This invention relates generally to associating data records, and in particular to identifying data records that may contain information about the same entity such that these data records may be associated. Even more particularly, this invention relates to the standardization and comparison of attributes within data records.
In today's day and age, the vast majority of businesses retain extensive amounts of data regarding various aspects of their operations, such as inventories, customers, products, etc. Data about entities, such as people, products, parts or anything else may be stored in digital format in a data store such as a computer database. These computer databases permit the data about an entity to be accessed rapidly and permit the data to be cross-referenced to other relevant pieces of data about the same entity. The databases also permit a person to query the database to find data records pertaining to a particular entity, such that data records from various data stores pertaining to the same entity may be associated with one another.
A data store, however, has several limitations which may limit the ability to find the correct data about an entity within the data store. The actual data within the data store is only as accurate as the person who entered the data, or an original data source. Thus, a mistake in the entry of the data into the data store may cause a search for data about an entity in the database to miss relevant data about the entity because, for example, a last name of a person was misspelled or a social security number was entered incorrectly, one or more attributes are missing, etc. A whole host of these types of problems may be imagined: two separate records for an entity that already has a record within the database may be created such that several data records may contain information about the same entity, but, for example, the names or identification numbers contained in the two data records may be different so that it may be difficult to associate the data records referring to the same entity with one another.
For a business that operates one or more data stores containing a large number of data records, the ability to locate relevant information about a particular entity within and among the respective databases is very important, but not easily obtained. Once again, any mistake in the entry of data (including without limitation the creation of more than one data record for the same entity) at any information source may cause relevant data to be missed when the data for a particular entity is searched for in the database. In addition, in cases involving multiple information sources, each of the information sources may have slightly different data syntax or formats which may further complicate the process of finding data among the databases. An example of the need to properly identify an entity referred to in a data record and to locate all data records relating to an entity in the health care field is one in which a number of different hospitals associated with a particular health care organization may have one or more information sources containing information about their patient, and a health care organization collects the information from each of the hospitals into a master database. It is necessary to link data records from all of the information sources pertaining to the same patient to enable searching for information for a particular patient in all of the hospital records.
There are several problems which limit the ability to find all of the relevant data about an entity in such a database. Multiple data records may exist for a particular entity as a result of separate data records received from one or more information sources, which leads to a problem that can be called data fragmentation. In the case of data fragmentation, a query of the master database may not retrieve all of the relevant information about a particular entity. In addition, as described above, the query may miss some relevant information about an entity due to a typographical error made during data entry, which leads to the problem of data inaccessibility. In addition, a large database may contain data records which appear to be identical, such as a plurality of records for people with the last name of Smith and the first name of Jim. A query of the database will retrieve all of these data records and a person who made the query to the database may often choose, at random, one of the data records retrieved which may be the wrong data record. The person may not often typically attempt to determine which of the records is appropriate. This can lead to the data records for the wrong entity being retrieved even when the correct data records are available. These problems limit the ability to locate the information for a particular entity within the database.
To reduce the amount of data that must be reviewed and prevent the user from picking the wrong data record, it is also desirable to identify and associate data records from the various information sources that may contain information about the same entity. There are conventional systems that locate duplicate data records within a database and delete those duplicate data records, but these systems only locate data records which are identical to each other or use a fixed set of rules to determine if two records are identical. Thus, these conventional systems cannot determine if two data records, with, for example, slightly different last names, nevertheless contain information about the same entity. In addition, these conventional systems do not attempt to index data records from a plurality of different information sources, locate data records within the one or more information sources containing information about the same entity, and link those data records together. Consequently, it would be desirable to be able to associate data records from a plurality of information sources which pertain to the same entity, despite discrepancies between attributes of these data records.
Thus, there is a need for system and methods for comparing attributes of data records which takes into account discrepancies between these attributes which may arise, and it is to this end that embodiments of the present invention are directed.
Embodiments of systems and methods for comparing attributes of a data record are presented herein. Broadly speaking, embodiments of the present invention generate a weight based on a comparison of the name (or other) attributes of data records. More particularly, embodiments of the present invention provide a set of code (e.g., a computer program product comprising a set of computer instructions stored on a computer readable medium and executable or translatable by a computer processor) translatable to generate a weight based on a comparison of the name attributes of data records. More specifically, embodiments of the present invention may calculate an information score for each of two name attributes to be compared to get an average information score for the two name attributes. The two name attributes may then be compared against one another to generate a weight between the two attributes. This weight can then be normalized to generate a final weight between the two business name attributes.
In one embodiment, each of the tokens of one attribute may be compared to the tokens of the other attribute to generate a weight for the two attributes. The comparison of each of the pairs of tokens may be accomplished by determining a current match weight for the pair of tokens, determining a first and second previous match weight for the pair of tokens and setting the weight to the current match weight if the current match weight is greater than both the first and second previous match weight or setting the weight to the greater of the first previous match weight or the second previous match weight otherwise.
In another embodiment, tokens of either attribute which are acronyms for tokens in the other attribute may be taken into account when comparing the two attributes.
Embodiments of the present invention may provide the technical advantage that attributes of the data records (and attributes in general) may be more effectively compared by allowing a weight to be generated by performing a whole host of comparisons on the tokens of the attributes. By more effectively comparing attributes of data records, the comparison and linking of various data records may be more effective in taking into account a variety of real world conditions which may occur during the entering or processing of data records such as mistakes that may be made or differences which may occur when entering data, variations in capabilities or formats of different systems, changes in personal conditions such as name changes due to marriage, etc.
Other technical advantages of embodiments of the present invention include the lesser weighting of frequently utilized tokens (e.g. Inc., Store, Co.), improving the accuracy of the generated weight. Additionally, when the two attributes are compared each is analyzed to determine if either attribute comprises one or more acronyms which also improve the accuracy of the comparison. Furthermore a variety of types of comparisons may take place between the different tokens, including an exact match where tokens match exactly or a phonetic match where the tokens match phonetically. The tokens may also be compared to determine if the edit distance between the two tokens is less than a certain distance (e.g. 20% of the longer of the two tokens) or if an initial token of one attribute matches a token of the other attribute. The weight generated for the two attributes may reflect the different types of matches which occur between the tokens of the two attributes.
These, and other, aspects of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. The following description, while indicating various embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many substitutions, modifications, additions or rearrangements may be made within the scope of the invention, and the invention includes all such substitutions, modifications, additions or rearrangements.
The drawings accompanying and forming part of this specification are included to depict certain aspects of the invention. A clearer impression of the invention, and of the components and operation of systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings, wherein identical reference numerals designate the same components. Note that the features illustrated in the drawings are not necessarily drawn to scale.
The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components and equipment are omitted so as not to unnecessarily obscure the invention in detail. Skilled artisans should understand, however, that the detailed description and the specific examples, while disclosing preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions or rearrangements within the scope of the underlying inventive concept(s) will become apparent to those skilled in the art after reading this disclosure.
Reference is now made in detail to the exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts (elements).
Before turning to embodiments of the present invention, a general description of an example infrastructure or context which may be helpful in explaining these various embodiments will be described. A block diagram of one embodiment of just such an example infrastructure is described in
As shown, the identity hub 32 may receive data records from the data sources 34, 36, 38 as well as write corrected data back into the data sources 34, 36, 38. The corrected data communicated to the data sources 34, 36, 38 may include information that was correct, but has changed, information about fixing information in a data record or information about links between data records.
In addition, one of the operators 40, 42, 44 may transmit a query to the identity hub 32 and receive a response to the query back from the identity hub 32. The one or more data sources 34, 36, 38 may be, for example, different databases that possibly have data records about the same entities. For example, in the health care field, each data source 34, 36, 38 may be associated with a particular hospital in a health care organization and the health care organization may use the identity hub 32 to relate the data records associated with the plurality of hospitals so that a data record for a patient in Los Angeles may be located when that same patient is on vacation and enters a hospital in New York. The identity hub 32 may be located at a central location and the data sources 34, 36, 38 and users 40, 42, 44 may be located remotely from the identity hub 32 and may be connected to the identity hub 32 by, for example, a communications link, such as the Internet or any other type communications network, such as a wide area network, intranet, wireless network, leased network, etc.
The identity hub 32 may have its own database that stores complete data records in the identity hub, or alternatively, the identity hub may also only contain sufficient data to identify a data record (e.g., an address in a particular data source 34, 36, 38) or any portion of the data fields that comprise a complete data record so that the identity hub 32 can retrieve the entire data record from the data source 34, 36, 38 when needed. The identity hub 32 may link data records together containing information about the same entity utilizing an entity identifier or an associative database separate from actual data records. Thus, the identity hub 32 may maintain links between data records in one or more data sources 34, 36, 38, but does not necessarily maintain a single uniform data record for an entity.
More specifically, the identity hub may link data records in data sources 34, 36, 38 by comparing a data record (received from an operator, or from a data source 34, 36, 38) with other data records in data sources 34, 36, 38 to identify data records which should be linked together. This identification process may entail a comparison of one or more of the attributes of the data records with like attributes of the other data records. For example, a name attribute associated with one record may be compared with the name of other data records, social security number may be compared with the social security number of another record, etc. In this manner, data records which should be linked may be identified.
It will be apparent to those of ordinary skill in the art, that both the data sources 34, 36, 38 and the operators 40, 42, 44 may be affiliated with similar or different organizations or owners. For example, data source 34 may be affiliated with a hospital in Los Angeles run by one health care network, while data source 36 may be affiliated with a hospital in New York run by another health care network. Thus, the data records of each of data sources may be of a different format.
This may be illustrated more clearly with reference to
Notice, however, that each of the records may have a different format, for example data record 202 may have a field 210 for the attribute of insurer, while data record 200 may have no such field. Similarly, like attributes may have different formats as well. For example, name field 210b in record 202 may accept the entry of a full name, while name field 210a, in record 200 may be designed to allow entry of a name of a limited length.
As may be imagined, discrepancies such as this may be problematic when comparing two or more data records (e.g. attributes of data records) to identify data records which should be linked. The name “Bobs Flower Shop” is not the same as “Bobs Very Pretty Flower Shoppe”. Similarly, a typo or mistake in entering data for a data record may also affect the comparison of data records (e.g. comparing the name “Bobs Pretty Flower Shop” with “Bobs Pretty Glower Shop” where “Glower” resulted from a typo in entering the word “Flower”).
To deal with these possibilities, a variety of name comparison techniques are utilized to generate a weight based on the comparison (e.g. similarity) of names in different records where this weight could then be utilized in determining whether two records should be linked, including various phonetic comparison methods, weighting based on frequency of name tokens, initial matches, nickname matches, etc. More specifically, the tokens of the name attribute of each record would be compared against one another, using methodologies to match the tokens including if the tokens matched exactly, phonetically, etc. These matches could then be given a weight, based upon the determined match (e.g. an exact match was given one weight, a certain type of initial match was given a certain weight, etc.) These weights could then be aggregated to determine an overall weight for the degree of match between the name attribute of two data records.
These techniques were not without their various problems, however, especially when applied to business names. This is because business names may present a number of fairly specific problems as a result of their nature. Some business names are very short (e.g. “Quick-E-Mart”) while others are very long (e.g. “San Francisco's Best Coffee Shop”). Additionally, business names may frequently use similar words (e.g. “Shop”, “Inc.”, “Co.”) which, though identical, should not weigh heavily in any heuristic for comparing these names. Furthermore, acronyms are frequently used in business names, for example a business named “New York City Bagel” may frequently be entered into a data record as “NYC Bagel”.
In many cases the algorithms employed to compare business names would not take account of these specific peculiarities when comparing business names. There was no support for acronyms, the frequency of certain words in business names was not accounted for and the ordering of tokens within a business name was not accounted for (e.g. the name “Clinic of Austin” may have been deemed virtually identical to “Austin Clinic”). In fact, in some cases two name attributes which only partially matched may have received a higher weight than two names attributes which match exactly. Consequently, match weights between business names were skewed, and therefore skewed the comparisons between data records themselves. Thus, it would be desirable to utilize comparison algorithms which generate weights which better reflect the similarity between name attributes of different records.
To that end, attention is now directed to systems and methods for comparing attributes. Broadly speaking, embodiments of the present invention generate a weight based on a comparison of the name attributes of data records. More specifically, embodiments of the present invention may calculate an information score for each of two name attributes to be compared to get an average information score for the two name attributes. The two name attributes may then be compared against one another to generate a weight between the two attributes. This weight can then be normalized to generate a final weight between the two business name attributes.
To aid in an understanding of the systems and methods of the present invention it will be helpful to present an example embodiment of a methodology for identifying records pertaining to the same entity which may utilize these systems and methods.
In one embodiment, when names are standardized consecutive single tokens are combined into tokens (e.g. I.B.M. becomes IBM) and substitutions are performed. For example (Co. is replaced by “Company”, Inc. is replaced by “Incorporated”). An equivalence table comprising abbreviations and their equivalent substitutions may be stored in a database associated with identity hub 32. Pseudocode for one embodiment of standardizing business names in accordance with embodiments of the present invention is as follows:
Once the attributes of the data records to be compared have been standardized at step 320, a set of candidates may be selected to compare to the new data record at step 330. This candidate selection process may comprise a comparison of one or more attributes of the new data records to the existing data records to determine which of the existing new data records are similar enough to the new data records to entail further comparison. These candidates may then undergo a more detailed comparison to the new records at step 340 where a set of attributes are compared between the records to determine if an existing data record should be linked or associated with the new data record. This more detailed comparison may entail comparing each of the set of attributes of one record (e.g. an existing record) to the corresponding attribute in the other record (e.g. the new record) to generate a weight for that attribute. The weights for each of the set of attributes may then be summed to generate an overall weight which can then be compared to a threshold to determine if the two records should be linked.
Turning now to
At step 410 two names are given or provided (e.g. input to a software application) such that these names may be compared. The names may each be in a standardized form comprising a set of tokens, as discussed above with respect to
Using an average value for the information score of the two attributes (instead of, for example, a minimum or maximum information score between the two attributes) may allow embodiments of the name comparison algorithm to allow the generated weight between two attributes to take into account missing tokens between the two attributes, and, in some embodiments, may allow the penalty imposed for missing tokens to be half the penalty imposed for that of a mismatch between two tokens. The information score of each of the tokens may, in turn, be based on the frequency of the occurrence of a token in a data sample. By utilizing relative frequency of tokens to determine an information score for the token, the commonality of certain tokens (e.g. “Inc.”) may be taken into account by scoring these tokens lower.
A weight between the two names can then be generated at step 440 by comparing the two names. This weight may then be normalized at step 450 to generate a final weight for the two names. In one embodiment, this normalization process may apply a scaling factor to the ratio of the generated weight to the average information score to generate a normalized index value. This normalized index value may then be used to index a table of values to generate a final weight.
It may be useful here to delve with more detail into the various steps of the embodiment of an algorithm for comparing names depicted in
In one embodiment, the weights associated with the tokens in the exact match weight table may be calculated from a sample set of data record, such as a set of data records associated with one or more of data sources 34, 36, 38 or a set of provided data records. Using the sample set of data records exact match weights may be computed using frequency data and match set data. The number of name strings (e.g. name attributes) NameTot in the sample set of data records may be computed, and for each name token T corresponding to these name strings a count: Tcount and a frequency Tfreq=Tcount/NameTot.
The tokens are then ordered by frequency with the highest frequency tokens first and a cumulative frequency for each token which is the sum of the frequencies for the token and all those that came before it is computed as depicted in Table 1:
All tokens up to and including the first token whose cumulative frequency exceeds 0.80 are then determined and for each of these tokens the exact match weight may be computed using the formula: ExactTi=−ln(Tfreq−i). If TM is the first token whose cumulative frequency exceeds 0.80 and TN is the last token or the lowest frequency token the default exact match weight can be computed by taking the average of −ln(Tfreq−M+1), . . . −ln(Tfreq−N). An embodiment of the compare algorithm described herein for comparing names may then be applied to a set of random pairs of names in the data set to generate: RanNameComp=The total number of name string pairs compared and For I=0 to MAX_SIM, RansSim−I=the total number of name string pairs whose normalized similarity is I. For each I, RanFreqSim−I=RanSim−I/RanNameComp can then be computed. Using the weight generation process as described in U.S. patent application Ser. No. 11/522,223 titled “Method and System For Comparing Attributes Such as Personal Names” by Norm Adams et al filed on Sep. 15, 2006, which is fully incorporated herein by reference and as described below, MatchFreqSim−I=MatchSim−I/MatchNameComp can also be computed for a token. Final weights for a token may then be computed as: Weight-Norm-SimI=log 10(MatchFreqSim−I/RanFreqSim−I).
Once the exact match weights for a set of tokens are calculated they may be stored in a table in a database associated with identity hub 32. For example, the following pseudocode depicts one embodiment for calculating an information score for an attribute utilizing two tables an “initialContent” table comprising exact match weights for initials, and “exactContent” comprising exact match weights for other tokens:
Referring still to
For each of these pairs of tokens it may be determined if a match exists between the two tokens at step 525. If no match exists between the two tokens at step 525 the current match weight may be set to zero at step 537. If a match exists between the two tokens, however, the current match weight for the two tokens may be calculated at step 535.
Once it has been determined if a match exists between the two tokens at step 525 and the match weight calculated at step 535 for the current match weight if such a match exists, it may be determined if a distance penalty should be imposed at step 547. In one embodiment, it may be determined if a distance penalty should be imposed, and the distance penalty computed, based on where the last match between a pair of tokens of the attributes occurred. To this end, a last match position may be determined at step 545 indicating where the last match between two tokens of the attributes occurred. If the difference in position (e.g. relative to the attributes) between the current two tokens being compared and the last match position is greater than a certain threshold a distance penalty may be calculated at step 555 and the current match weight adjusted at step 557 by subtracting the distance penalty from the current match weight. It will be apparent that these difference penalties may differ based upon the difference between the last match position and the position of the current tokens.
Match weights for previous tokens of the attributes may also be determined at steps 565, 567 and 575. More particularly, at step 565 a first previous match weight is determined for the token of one attribute currently being compared and the previous (e.g. preceding the current token being compared in order) token of the second attribute currently being compared, if it exists. Similarly, at step 567 a second previous match weight is determined for the token of second attribute currently being compared and the previous token of the first attribute currently being compared, if it exists. At step 575 a third previous match weight is determined using the previous tokens of each of the current attributes, if either token exist. The current match weight for the pair of tokens currently being compared may then be adjusted at step 577 by adding the third previous match weight to the current match weight.
The current match weight may then be compared to the first and second previous match weight at step 585, and if the current match weight is greater or equal to either of the previous match weights the weight may be set to the current match weight at step 587. If, however, the first or second previous match weight is greater than the current match weight the weight will be set to the greater of the first or second previous match weights at step 595. In this manner, after each of the tokens of the two attributes has been compared a weight will be produced.
It will be apparent that many types of data elements or data structures may be useful in implementing certain embodiments of the present invention. For example,
After the table is built at step 510, it may be initialized at step 520 such that certain initial cells within the table have initial values. More particularly, in one embodiment each of the first row and first column may be initialized such that the position indicator may receive a null or zero value and the weight associated with each of these cells may be initialized to a zero value.
Each of the other cells (e.g. besides the initial cells) of the table may then be iterated through to determine a position and a value to be associated with the cell. For each cell it is determined if the cell has already been matched through an acronym match at step 530, and if so the cell may be skipped. If the cell has not been previously matched, however, at step 540 it may be determined if a match exists between the two tokens corresponding cell, if no match exists it may then be determined if either of the tokens corresponding to the cell is an acronym for a set of the tokens in the other name at step 532, by, in one embodiment, comparing the characters of one token to the first characters of a set of tokens of the other name. If one of the tokens is an acronym for a set of tokens in the other name, a last position indicator and cell weight (as described in more detail below) are calculated at step 534 for the set of cells whose corresponding tokens are the acronym and the set of tokens of the other name which correspond to the acronym. Pseudocode for determining if one token is an acronym for a set of tokens of the other name is as follows:
If it is determined that neither of the tokens is an acronym at step 532, the match weight for the current cell may be set to zero at step 543. Returning to step 540, if a match exists between the two tokens corresponding to the current cell, the match weight for the two tokens may be calculated at step 542. Though virtually any type of comparison may be utilized to compare the two corresponding tokens and generate an associated match weight according to steps 540 and 542, in one embodiment it may be determined if an exact match, an initial match, a phonetic match, a nickname match or a nickname-phonetic match occurs and a corresponding match weight calculated as described in U.S. patent application Ser. No. 11/522,223 titled “Method and System For Comparing Attributes Such as Personal Names” by Norm Adams et al filed on Sep. 15, 2006 which is fully incorporated herein by reference. Pseudocode for comparing two tokens and generating an associated match weight is as follows:
Looking still at
Using the match weight for the cell then, a cell weight and last match position for the cell may be calculated at step 570. A flow diagram for one embodiment of a method for calculating a last match position and a cell weight for a cell is depicted in
The cell weights from the two adjoining cells and the temporary cell weight may be compared at step 640. If the temporary cell weight is greater than either of the cell weights of the adjoining cells, the last match position of the current cell is set to the position of the current cell at step 642 and the cell weight of the current cell is set to the temporary cell weight at step 644. If however, either of the cell weights exceeds the temporary cell weight, the greater of the two cell weights will be assigned as the cell weight of the current cell and the value of the last match position indicator of that cell (e.g. adjoining cell with higher cell weight) will be assigned as the last position indicator of the current cell at step 652 or step 654.
Returning now to
Before delving into examples of the application of embodiments of the above described methods, it may be useful to discuss how various match and distribution penalties are determined. In one embodiment, to calculate these penalties an embodiment of a compare algorithm such as that described with respect to
The following frequencies can then be computed:
RanProbExact=RanExact/RanComp
RanProbInitial=RanInitial/RanComp
RanProbPhonetic=RanPhonetic/RanComp
RanProbNickname=RanNickname/RanComp
RanProbNickPhone=RanNickPhone/RanComp
RanProbEdit=RanEdit/RanComp
RanProbDist−0=RanDist−0/RanComp
RanProbDist−1=RanDist−1/RanComp
RanProbDist−2=RanDist−2/RanComp
RanProbDist−3=RanDist−3/RanComp
Using the process described above in conjunction with generating exact match weights, a set of matched name pairs can be derived, and the following frequencies derived:
MatchProbExact=MatchExact/MatchComp
MatchProbInitial=MatchInitital/MatchComp
MatchProbPhonetic=MatchPhonetic/MatchComp
MatchProbNickname=MatchNickname/MatchComp
MatchProbNickPhone=MatchNickPhone/MatchComp
MatchProbEdit=MatchEdit/MatchComp
MatchProbDist−0=MatchDist−0/MatchComp
MatchProbDist−1=MatchDist−1/MatchComp
MatchProbDist−2=MatchDist−2/MatchComp
MatchProbDist−3=MatchDist−3/MatchComp
Using these frequencies the following marginal weights may be computed:
MarginalExact=log10(MatchProbExact/RanProbExact)
MarginalInitial=log10(MatchProbInitial/RanProbInitial)
MarginalPhonetic=log10(MatchProbPhonetic/RanProbPhonetic)
MarginalNickname=log10(MatchProbNickname/RanProbNickname)
MarginalNickPhone=log10(MatchProbNickPhone/RanProbNickPhone)
MarginalEdit=log10(MatchProbEdit/RanProbEdit)
MarginalDist−0=log10(MatchProbDist−0/RanProbDist−0)
MarginalDist−1=log10(MatchProbDist−1/RanProbDist−1)
MarginalDist−2=log10(MatchProbDist−2/RanProbDist−2)
MarginalDist−3=log10(MatchProbDist−3/RanProbDist−3)
and the respective penalties computed as follows:
Initial Penalty=MarginalExact−MarginalInitital
Initial Penalty=MarginalExact−MarginalInitial
Phonetic Penalty=MarginalExact−MarginalPhonetic
Nickname Penalty=MarginalExact−MarginalNickname
NickPhone Penalty=MarginalExact−MarginalNickPhone
Edit Distance Penalty=MarginalExact−MarginalEdit
DistPenalty1=MarginalDist−1−MarginalDist−1
DistPenalty2=MarginalDist−0−MarginalDist−2
DistPenalty3=MarginalDist−0−MarginalDist−3
An example of the application of embodiments of the systems and methods of the present invention to two actual names may now be illustrated with respect to the example table of
In accordance with one embodiment of the systems and methods of the present invention an average information score may be calculated for the two names being compared (step 430). In one embodiment, this is done using the exact match weights for each of the tokens in each of the names. According to this method, the information score for Bobs Flower Shop is 750 (e.g. 200+400+150) and the information score for the name “Bobs Very Pretty Flower Shoppe” is 1650 (200+150+300+400+600), making the average of the two information scores 1200.
Once an average information score for the two names is computed (step 430) a weight for the two names may be generated (step 440). In one embodiment, table 700 is constructed (step 510). Where each cell 702 has the ability to keep a position indicator (e.g. row, column) and a cell weight. Cells 702a of the table may then be initialized (step 520).
Once cells 702a of the table have been initialized, the remainder of the cells 702 of the table 700 may be iterated through. Starting with cell 702b (e.g. row 1, column 1), it is determined that a match occurs between the two tokens corresponding to the cell 702b (step 540). The match weight for these two tokens may then be calculated (step 542), which in this case is 200. The cell weight values for adjoining cells may then be determined (steps 610, 620), and from this it can be determined that the cell weight (0) from the diagonal cell 702a1 plus 200 (e.g. temporary cell weight for the cell) is greater than the cell weight of either adjoining cell 702a2, 702a3 (step 640). Thus, the last match position indicator of cell 702b is set to the current cell 702b (1,1) and the cell weight of the current cell is set to the calculated match weight (200) (steps 642, 644).
The last match position indicator and cell weight for the next cell 702c may then be calculated. It is determined that no match occurs between the two tokens corresponding to the cell 702c (step 540). As no acronym match occurs (step 532) the match weight for this cell is then set to zero (step 543). A temporary cell weight may then be calculated (step 630) and compared to the cell weights of adjoining cells 702b, 702a4 (steps 640, 650) and from this it can be determined that the cell weight (100) from the adjoining cell 702b is greater than the cell weight of adjoining cell 702a4 or the cell weight of diagonal cell 702a3 plus the match weight for the current cell (0) (e.g. temporary cell weight). Thus, the last match position indicator of current cell 702c is set to the last match position indicator of adjoining cell 702b (1,1) and the cell weight of the current cell 702c is set to the cell weight of the adjoining cell 702b with the greater cell weight (step 652)
Similarly cells 702d, 702e, 702f, 702g, 702h, 702i, 702j and 702k may be iterated through with similar results as those described above with respect to cell 702c. Upon reaching cell 702l, however, it may be determined that a match occurs between the two tokens corresponding to the cell 702l (step 540). The match weight for the tokens corresponding to cell 702l (e.g. “Flower” and “Flower”) may then be calculated (step 542), which in this case may be 400. It may then be determined if a distance penalty should be imposed by comparing the position of the last match position of the diagonal cell 702h with the position of the current cell 702l (step 550). This comparison may be accomplished by subtracting the row indices from one another (e.g. 4—1) and the column indices from one another (e.g. 2—1) and taking the maximum of these values (e.g. 3) and comparing this distance value to a threshold level to determine if a distance penalty should be imposed. In this case the threshold value for a distance penalty may be a distance of one, as three is greater than one it may be determined that a distance penalty should be imposed. The distance penalty corresponding to the distance value (e.g. 3) may then be subtracted from the calculated match weight for the current cell (steps 552, 560). In this case, the distance penalty is 100, which may be subtracted from the match weight of 400 to adjust the match weight of cell 702l to 300. The cell weight values for adjoining cells may then be determined, and from this it can be determined that the cell weight (200) from the diagonal cell 702h plus the match weight for the current cell 702l (e.g. 300) is greater than the cell weight of either adjoining cell 702k, 702i (e.g. 200 and 200 respectively) (step 640). Thus, the last match position indicator of cell 702l is set to the current cell 702l (4,2) and the cell weight of the current cell 702l is set to the calculated match weight plus the cell weight from the diagonal cell 702h (e.g. 300+200=500) (steps 642, 644).
The last position match indicator and cell weights for cells 702m, 702n and 702o may be calculated similarly to the calculations described above. Upon reaching cell 702p, however, it may be determined that a match occurs between the two tokens corresponding to the cell 702p (step 540). The match weight for the tokens corresponding to cell 702p (e.g. “Shoppe” and “Shop”) may then be calculated (step 542), which in this case may be 50 (as the match between “Shoppe” and “Shop” may be a phonetic match its weight may be the minimum of the exact match weights for Shoppe and Shop minus the phonetic penalty weight). It may then be determined if a distance penalty should be imposed by comparing the position of the last match position of the diagonal cell 702l with the position of the current cell 702p (step 550). This comparison may be accomplished by subtracting the row indices from one another (e.g. 5—4) and the column indices from one/another (3—2) and taking the maximum of these values (e.g. 1) and comparing this distance value to a threshold level to determine if a distance penalty should be imposed. In this case the threshold value for a distance penalty may be a distance of one and as such a distance penalty should not be imposed. Thus, the match weight of the current cell 702p is 50. The cell weight values for adjoining cells 702o, 702m may then be determined (steps 610, 620), and from this it can be determined that the cell weight from the diagonal cell 702l(500) plus the match weight for the current cell 702p (50) is greater than the cell weight of either adjoining cell 702k, 702i (e.g. 500 and 500 respectively) (step 640). Thus, the last match position indicator of cell 702p is set to the current cell 702p (5,3) and the cell weight of the current cell 702p is set to the calculated match weight plus the cell weight from the diagonal cell 702l (e.g. 500+50=550) (steps 642, 644).
Reading the last cell 702p of table 700 it can be determined that the weight for the two names being compared is 550. This weight may then be normalized according to a maximum similarity index and a ratio of the weight to an average information score for the two names (step 450). For example, if the maximum similarity index is 10, the weight may be normalized to a normalized index value of 4 by rounding the results of the equation 10*550/1200. This normalized index value may be used as index into a normalized weight table to generate the final weight for the two tokens (step 450). For example, the normalized weight of 4 may index to a final weight of 441 for the two names.
To further aid in an understanding of embodiments of the systems and methods of the present invention it may be useful to illustrate an example of the application of embodiments of the systems and methods of the present invention to a comparison of names where an acronym is present in one of the two names being compared. A table for use in illustrating such an example is depicted in
In accordance with one embodiment of the systems and methods of the present invention an average information score may be calculated for the two names being compared (step 430). In one embodiment, this is done using the exact match weights for each of the tokens in each of the names. According to this method, the information score for Bobs VP Flower Shop is 1050 (e.g. 200+300400+150) and the information score for the name “Bobs Very Pretty Flower Shop” is 1200 (200+150+300+400+150), making the average of the two information scores 1125.
Once an average information score for the two names is computed (step 430) a weight for the two names may be generated (step 440). In one embodiment, table 800 is constructed (step 510). Where each cell 802 has the ability to keep a position indicator (e.g. row, column) and a cell weight. Cells 802a of the table may then be initialized (step 520).
Once cells 802a of the table have been initialized, the remainder of the cells 802 of the table 800 may be iterated through. Starting with cell 802b (e.g. row 1, column 1), it is determined that a match occurs between the two tokens corresponding to the cell 802b (step 540). The match weight for these two tokens may then be calculated (step 542), which in this case is 200. The cell weight values for adjoining cells may then be determined (steps 610, 620), and from this it can be determined that the cell weight (0) from the diagonal cell 802a1 plus 200 (e.g. temporary cell weight for the cell) is greater than the cell weight of either adjoining cell 802a2, 802a3 (step 640). Thus, the last match position indicator of cell 802b is set to the current cell 802b (1,1) and the cell weight of the current cell 802b is set to the calculated match weight (200) (steps 642, 644).
Cells 802c-802f may similarly be iterated through, as discussed above. Upon reaching cell 802g it may be determined that no match exists between the two tokens corresponding to cell 802g (step 540), however, it may be determined that VP is acronym (step 532). This determination may be accomplished by comparing the first character of a first token “VP” corresponding to cell 802g (e.g. “V”) to the first character of the other token corresponding to cell 802g (e.g. very). As the character “V” matches the first character of the token “Very”, the next character of the token “VP” (e.g. “P”) is compared to the following token in the other name (e.g. “Pretty”) as these characters match, and there are no more characters of the first token (e.g. “VP”), it can be determined that the token “VP” is an acronym and values can be computed for the set of cells 802g, 802k corresponding to the acronym token (e.g. each cell which corresponds to one character of the acronym token and a token of the other name) similarly to the computation discussed above (in the example depicted with respect to
The rest of the cells 802 of table 800 may then be iterated through beginning with cell 802d to calculate last position matches and cell weights for these cells as described above. Cells 802g and 802k may be skipped during this iterative process as these cells have already been matched via an acronym (step 530). After iterating through the remainder of cells 802 of table 800, table 800 may resemble the table depicted in
Reading the last cell 802u of table 800 it can be determined that the weight for the two names being compared is 850. This weight may then be normalized according to a maximum similarity index and a ratio of the weight to an average information score for the two names (step 450). For example, if the maximum similarity index is 10, the weight may be normalized to a normalized index value of 8 by rounding the results of the equation 10*850/1125. This normalized index value may be used as index into a normalized weight table to generate the final weight for the two tokens (step 450). For example, the normalized weight of 8 may index to a final weight of 520 for the two names.
Pseudocode describing one embodiment of a method for comparing names is presented below to further help in the comprehension of embodiments of the present invention:
The normalized index value which may be returned by the embodiment of the present invention described in the pseudocode above may be used to index a table of values to obtain a final weight, as described above. Such a table may resemble the following, where the maximum index value may be 16:
In the foregoing specification, the invention has been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of invention.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.
This is a continuation of and claims a benefit of priority under 35 U.S.C. §120 of the filing date of U.S. patent application Ser. No. 11/521,928, now allowed, entitled “METHOD AND SYSTEM FOR COMPARING ATTRIBUTES SUCH AS BUSINESS NAMES” by Norm Adams, Scott Ellard and Scott Schumacher, filed Sep. 15, 2006 now U.S. Pat. No. 7,685,093, and is fully incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4531186 | Knapman | Jul 1985 | A |
5020019 | Ogawa | May 1991 | A |
5134564 | Dunn et al. | Jul 1992 | A |
5247437 | Vale et al. | Sep 1993 | A |
5321833 | Chang et al. | Jun 1994 | A |
5323311 | Fukao et al. | Jun 1994 | A |
5333317 | Dann | Jul 1994 | A |
5381332 | Wood | Jan 1995 | A |
5442782 | Malatesta et al. | Aug 1995 | A |
5497486 | Stolfo | Mar 1996 | A |
5535322 | Hecht | Jul 1996 | A |
5535382 | Ogawa | Jul 1996 | A |
5537590 | Amado | Jul 1996 | A |
5555409 | Leenstra et al. | Sep 1996 | A |
5561794 | Fortier | Oct 1996 | A |
5583763 | Atcheson et al. | Dec 1996 | A |
5600835 | Garland et al. | Feb 1997 | A |
5606690 | Hunter et al. | Feb 1997 | A |
5615367 | Bennett et al. | Mar 1997 | A |
5640553 | Schultz | Jun 1997 | A |
5651108 | Cain et al. | Jul 1997 | A |
5675752 | Scott et al. | Oct 1997 | A |
5675753 | Hansen et al. | Oct 1997 | A |
5694593 | Baclawski | Dec 1997 | A |
5694594 | Chang | Dec 1997 | A |
5710916 | Barbara et al. | Jan 1998 | A |
5734907 | Jarossay et al. | Mar 1998 | A |
5765150 | Burrows | Jun 1998 | A |
5774661 | Chatterjee | Jun 1998 | A |
5774883 | Andersen | Jun 1998 | A |
5774887 | Wolff et al. | Jun 1998 | A |
5778370 | Emerson | Jul 1998 | A |
5787431 | Shaughnessy | Jul 1998 | A |
5787470 | DeSimone et al. | Jul 1998 | A |
5790173 | Strauss | Aug 1998 | A |
5796393 | MacNaughton et al. | Aug 1998 | A |
5805702 | Curry | Sep 1998 | A |
5809499 | Wong et al. | Sep 1998 | A |
5819264 | Palmon et al. | Oct 1998 | A |
5835712 | DuFresne | Nov 1998 | A |
5835912 | Pet | Nov 1998 | A |
5848271 | Caruso et al. | Dec 1998 | A |
5859972 | Subramaniam et al. | Jan 1999 | A |
5862322 | Anglin et al. | Jan 1999 | A |
5862325 | Reed et al. | Jan 1999 | A |
5878043 | Casey | Mar 1999 | A |
5893074 | Hughes et al. | Apr 1999 | A |
5893110 | Weber et al. | Apr 1999 | A |
5905496 | Lau et al. | May 1999 | A |
5930768 | Hooban | Jul 1999 | A |
5960411 | Hartman et al. | Sep 1999 | A |
5963915 | Kirsch | Oct 1999 | A |
5987422 | Buzsaki | Nov 1999 | A |
5991758 | Ellard | Nov 1999 | A |
5999937 | Ellard | Dec 1999 | A |
6014664 | Fagin et al. | Jan 2000 | A |
6016489 | Cavanaugh et al. | Jan 2000 | A |
6018733 | Kirsch et al. | Jan 2000 | A |
6018742 | Herbert, III | Jan 2000 | A |
6026433 | D'Arlach et al. | Feb 2000 | A |
6049847 | Vogt et al. | Apr 2000 | A |
6067549 | Smalley et al. | May 2000 | A |
6069628 | Farry et al. | May 2000 | A |
6078325 | Jolissaint et al. | Jun 2000 | A |
6108004 | Medl | Aug 2000 | A |
6134581 | Ismael et al. | Oct 2000 | A |
6185608 | Hon et al. | Feb 2001 | B1 |
6223145 | Hearst | Apr 2001 | B1 |
6269373 | Apte et al. | Jul 2001 | B1 |
6297824 | Hearst et al. | Oct 2001 | B1 |
6298478 | Nally et al. | Oct 2001 | B1 |
6311190 | Bayer et al. | Oct 2001 | B1 |
6327611 | Everingham | Dec 2001 | B1 |
6330569 | Baisley et al. | Dec 2001 | B1 |
6356931 | Ismael et al. | Mar 2002 | B2 |
6374241 | Lamburt et al. | Apr 2002 | B1 |
6385600 | McGuinness et al. | May 2002 | B1 |
6389429 | Kane et al. | May 2002 | B1 |
6446188 | Henderson et al. | Sep 2002 | B1 |
6449620 | Draper | Sep 2002 | B1 |
6457065 | Rich et al. | Sep 2002 | B1 |
6460045 | Aboulnaga et al. | Oct 2002 | B1 |
6496793 | Veditz et al. | Dec 2002 | B1 |
6502099 | Rampy et al. | Dec 2002 | B1 |
6510505 | Burns et al. | Jan 2003 | B1 |
6523019 | Borthwick | Feb 2003 | B1 |
6529888 | Heckerman et al. | Mar 2003 | B1 |
6556983 | Altschuler et al. | Apr 2003 | B1 |
6557100 | Knutson | Apr 2003 | B1 |
6621505 | Beauchamp et al. | Sep 2003 | B1 |
6633878 | Underwood | Oct 2003 | B1 |
6633882 | Fayyad et al. | Oct 2003 | B1 |
6633992 | Rosen | Oct 2003 | B1 |
6647383 | August et al. | Nov 2003 | B1 |
6662180 | Aref et al. | Dec 2003 | B1 |
6687702 | Vaitheeswaran et al. | Feb 2004 | B2 |
6704805 | Acker et al. | Mar 2004 | B1 |
6718535 | Underwood | Apr 2004 | B1 |
6742003 | Heckerman et al. | May 2004 | B2 |
6757708 | Craig et al. | Jun 2004 | B1 |
6795793 | Shayegan et al. | Sep 2004 | B2 |
6807537 | Thiesson et al. | Oct 2004 | B1 |
6842761 | Diamond et al. | Jan 2005 | B2 |
6842906 | Bowman-Amuah | Jan 2005 | B1 |
6879944 | Tipping et al. | Apr 2005 | B1 |
6907422 | Predovic | Jun 2005 | B1 |
6912549 | Rotter et al. | Jun 2005 | B2 |
6922695 | Skufca et al. | Jul 2005 | B2 |
6957186 | Guheen et al. | Oct 2005 | B1 |
6990636 | Beauchamp et al. | Jan 2006 | B2 |
6996565 | Skufca et al. | Feb 2006 | B2 |
7035809 | Miller et al. | Apr 2006 | B2 |
7043476 | Robson | May 2006 | B2 |
7099857 | Lambert | Aug 2006 | B2 |
7143091 | Charnock et al. | Nov 2006 | B2 |
7155427 | Prothia | Dec 2006 | B1 |
7181459 | Grant et al. | Feb 2007 | B2 |
7249131 | Skufca et al. | Jul 2007 | B2 |
7330845 | Lee et al. | Feb 2008 | B2 |
7487173 | Medicke et al. | Feb 2009 | B2 |
7526486 | Cushman, II et al. | Apr 2009 | B2 |
7567962 | Chakrabarti et al. | Jul 2009 | B2 |
7620647 | Stephens et al. | Nov 2009 | B2 |
7627550 | Adams et al. | Dec 2009 | B1 |
7685093 | Adams et al. | Mar 2010 | B1 |
7698268 | Adams et al. | Apr 2010 | B1 |
7788274 | Ionescu | Aug 2010 | B1 |
8321383 | Schumacher et al. | Nov 2012 | B2 |
8321393 | Adams et al. | Nov 2012 | B2 |
20020007284 | Schurenberg et al. | Jan 2002 | A1 |
20020073099 | Gilbert et al. | Jun 2002 | A1 |
20020080187 | Lawton | Jun 2002 | A1 |
20020087599 | Grant et al. | Jul 2002 | A1 |
20020095421 | Koskas | Jul 2002 | A1 |
20020099694 | Diamond et al. | Jul 2002 | A1 |
20020152422 | Sharma et al. | Oct 2002 | A1 |
20020156917 | Nye | Oct 2002 | A1 |
20020178360 | Wenocur et al. | Nov 2002 | A1 |
20030004770 | Miller et al. | Jan 2003 | A1 |
20030004771 | Yaung | Jan 2003 | A1 |
20030018652 | Heckerman et al. | Jan 2003 | A1 |
20030023773 | Lee et al. | Jan 2003 | A1 |
20030051063 | Skufca et al. | Mar 2003 | A1 |
20030065826 | Skufca et al. | Apr 2003 | A1 |
20030065827 | Shufca et al. | Apr 2003 | A1 |
20030105825 | Kring et al. | Jun 2003 | A1 |
20030120630 | Tunkelang | Jun 2003 | A1 |
20030145002 | Kleinberger et al. | Jul 2003 | A1 |
20030158850 | Lawrence et al. | Aug 2003 | A1 |
20030174179 | Suermondt et al. | Sep 2003 | A1 |
20030182101 | Lambert | Sep 2003 | A1 |
20030182310 | Charnock et al. | Sep 2003 | A1 |
20030195836 | Hayes et al. | Oct 2003 | A1 |
20030195889 | Yao et al. | Oct 2003 | A1 |
20030195890 | Oommen | Oct 2003 | A1 |
20030220858 | Lam et al. | Nov 2003 | A1 |
20030227487 | Hugh | Dec 2003 | A1 |
20040107189 | Burdick et al. | Jun 2004 | A1 |
20040107205 | Burdick et al. | Jun 2004 | A1 |
20040122790 | Walker et al. | Jun 2004 | A1 |
20040143477 | Wolff | Jul 2004 | A1 |
20040143508 | Bohn et al. | Jul 2004 | A1 |
20040181526 | Burdick et al. | Sep 2004 | A1 |
20040181554 | Heckerman et al. | Sep 2004 | A1 |
20040220926 | Lamkin et al. | Nov 2004 | A1 |
20040260694 | Chaudhuri et al. | Dec 2004 | A1 |
20050004895 | Schurenberg et al. | Jan 2005 | A1 |
20050015381 | Clifford et al. | Jan 2005 | A1 |
20050015675 | Kolawa et al. | Jan 2005 | A1 |
20050050068 | Vaschillo et al. | Mar 2005 | A1 |
20050055345 | Ripley | Mar 2005 | A1 |
20050060286 | Hansen et al. | Mar 2005 | A1 |
20050071194 | Bormann et al. | Mar 2005 | A1 |
20050075917 | Flores et al. | Apr 2005 | A1 |
20050114369 | Gould et al. | May 2005 | A1 |
20050149522 | Cookson et al. | Jul 2005 | A1 |
20050154615 | Rotter et al. | Jul 2005 | A1 |
20050210007 | Beres et al. | Sep 2005 | A1 |
20050228808 | Mamou et al. | Oct 2005 | A1 |
20050240392 | Munro et al. | Oct 2005 | A1 |
20050256740 | Kohan et al. | Nov 2005 | A1 |
20050256882 | Able et al. | Nov 2005 | A1 |
20050273452 | Molloy et al. | Dec 2005 | A1 |
20060053151 | Gardner et al. | Mar 2006 | A1 |
20060053172 | Gardner et al. | Mar 2006 | A1 |
20060053173 | Gardner et al. | Mar 2006 | A1 |
20060053382 | Gardner et al. | Mar 2006 | A1 |
20060064429 | Yao | Mar 2006 | A1 |
20060074832 | Gardner et al. | Apr 2006 | A1 |
20060074836 | Gardner et al. | Apr 2006 | A1 |
20060080312 | Friedlander et al. | Apr 2006 | A1 |
20060116983 | Dettinger et al. | Jun 2006 | A1 |
20060117032 | Dettinger et al. | Jun 2006 | A1 |
20060129605 | Doshi | Jun 2006 | A1 |
20060129971 | Rojer | Jun 2006 | A1 |
20060136205 | Song | Jun 2006 | A1 |
20060161522 | Dettinger et al. | Jul 2006 | A1 |
20060167896 | Kapur et al. | Jul 2006 | A1 |
20060179050 | Giang et al. | Aug 2006 | A1 |
20060190445 | Risberg et al. | Aug 2006 | A1 |
20060195560 | Newport | Aug 2006 | A1 |
20060265400 | Fain et al. | Nov 2006 | A1 |
20060271401 | Lassetter et al. | Nov 2006 | A1 |
20060271549 | Rayback et al. | Nov 2006 | A1 |
20060287890 | Stead et al. | Dec 2006 | A1 |
20070005567 | Hermansen et al. | Jan 2007 | A1 |
20070016450 | Bhora et al. | Jan 2007 | A1 |
20070055647 | Mullins et al. | Mar 2007 | A1 |
20070067285 | Blume et al. | Mar 2007 | A1 |
20070073678 | Scott et al. | Mar 2007 | A1 |
20070073745 | Scott et al. | Mar 2007 | A1 |
20070094060 | Apps et al. | Apr 2007 | A1 |
20070150279 | Gandhi et al. | Jun 2007 | A1 |
20070192715 | Kataria et al. | Aug 2007 | A1 |
20070198481 | Hogue et al. | Aug 2007 | A1 |
20070198600 | Betz | Aug 2007 | A1 |
20070214129 | Ture et al. | Sep 2007 | A1 |
20070214179 | Hoang | Sep 2007 | A1 |
20070217676 | Grauman et al. | Sep 2007 | A1 |
20070250487 | Reuther | Oct 2007 | A1 |
20070260492 | Feied et al. | Nov 2007 | A1 |
20070276844 | Segal et al. | Nov 2007 | A1 |
20070276858 | Cushman et al. | Nov 2007 | A1 |
20070299697 | Friedlander et al. | Dec 2007 | A1 |
20070299842 | Morris et al. | Dec 2007 | A1 |
20080005106 | Schumacher et al. | Jan 2008 | A1 |
20080016218 | Jones et al. | Jan 2008 | A1 |
20080069132 | Ellard et al. | Mar 2008 | A1 |
20080120432 | Lamoureux et al. | May 2008 | A1 |
20080126160 | Takuechi et al. | May 2008 | A1 |
20080243832 | Adams et al. | Oct 2008 | A1 |
20080243885 | Harger et al. | Oct 2008 | A1 |
20080244008 | Wilkinson et al. | Oct 2008 | A1 |
20090089317 | Ford et al. | Apr 2009 | A1 |
20090089332 | Harger et al. | Apr 2009 | A1 |
20090089630 | Goldenberg et al. | Apr 2009 | A1 |
20090198686 | Cushman, II et al. | Aug 2009 | A1 |
20100114877 | Adams et al. | May 2010 | A1 |
20100174725 | Adams et al. | Jul 2010 | A1 |
20100175024 | Schumacher et al. | Jul 2010 | A1 |
20110010214 | Carruth | Jan 2011 | A1 |
20110010346 | Goldenberg et al. | Jan 2011 | A1 |
20110010401 | Adams et al. | Jan 2011 | A1 |
20110010728 | Goldenberg et al. | Jan 2011 | A1 |
20110047044 | Wright et al. | Feb 2011 | A1 |
20110191349 | Ford et al. | Aug 2011 | A1 |
Number | Date | Country |
---|---|---|
9855947 | Dec 1998 | WO |
0159586 | Aug 2001 | WO |
0159586 | Aug 2001 | WO |
0175679 | Oct 2001 | WO |
03021485 | Mar 2003 | WO |
2004023297 | Mar 2004 | WO |
2004023311 | Mar 2004 | WO |
2004023345 | Mar 2004 | WO |
2009042931 | Apr 2009 | WO |
2009042941 | Apr 2009 | WO |
Entry |
---|
Microsoft Dictionary, Sep. 8, 2008, Microsoft Corp, “normalize”, fifth edition,at p. 4, downloaded from safaribooks, pp. 1-4. |
Fair, “Record Linkage in the National Dose Registry of Canada”, European Journal of Cancer, vol. 3, Supp. 3, pp. S37-S43, XP005058648 ISSN: 0959-8049, Apr. 1997. |
International Search Report and Written Opinion, for PCT/US2007/012073, Mailed Jul. 23, 2008, 12 pages. |
International Preliminary Report on Patentability Issued in PCT/US2007/013049, Mailed Dec. 17, 2008. |
International Search Report and Written Opinion issued in PCT/US2007/013049, mailed Jun. 13, 2008. |
Office Action issued in U.S. Appl. No. 11/809,792, mailed Aug. 21, 2009, 14 pages. |
Oracle Data Hubs: “The Emperor Has No Clothes?”, Feb. 21, 2005, Google.com, pp. 1-9. |
IEEE, no matched results , Jun. 30, 2009, p. 1. |
IEEE No matched Results, 1 Page, Sep. 11, 2009. |
Office Action issued in U.S. Appl. No. 11/522,223 dated Aug. 20, 2008, 16 pgs. |
Office Action issued in U.S. Appl. No. 11/522,223 dated Feb. 5, 2009, Adams, 17 pages. |
Notice of Allowance issued for U.S. Appl. No. 11/522,223, dated Sep. 17, 2009, 20 pages. |
De Rose, et al. “Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach”, VDLB, ACM, pp. 399-410, Sep. 2007. |
Microsoft Dictionary, “normalize”, at p. 20, Fifth Edition, Microsoft Corp., downloaded from http://proquest.safaribooksonline.com/0735614954 on Sep. 8, 2008. |
Office Action issued in U.S. Appl. No. 11/521,928 dated Apr. 1, 2009, 22 pages. |
Office Action issued in U.S. Appl. No. 11/521,928 dated Sep. 16, 2008, 14 pages. |
Notice of Allowance issued for U.S. Appl. No. 11/521,928, dated Sep. 18, 2009, 20 pages. |
Gopalan Suresh Raj, Modeling Using Session and Entity Beans, Dec. 1998, Web Cornucopia, pp. 1-15. |
Scott W. Ambler, Overcoming Data Design Challenges, Aug. 2001, p. 1-3. |
XML, JAVA, and the future of the Web, Bosak, J., Sun Microsystems, Mar. 10, 1997, pp. 1-9. |
Integrated Document and Workflow Management applied to Offer Processing a Machine Tool Company, Stefan Morschheuser, et al., Dept. of Information Systems I, COOCS '95 Milpitas CA, ACM 0-89791-706-5/95, p. 106-115, 1995. |
International Search Report mailed on Jul. 19, 2006, for PCT/IL2005/000784 (6 pages). |
Hamming Distance, HTML. Wikipedia.org, Available: http://en.wikipedia.org/wiki/Hamming—distance (as of May 8, 2008). |
Office Action Issued in U.S. Appl. No. 11/521,946 mailed May 14, 2008, 10 pgs. |
Office Action issued in U.S. Appl. No. 11/521,946 mailed Dec. 9, 2008, 10 pgs. |
Office Action issued in U.S. Appl. No. 11/521,946 mailed May 13, 2009, 12 pgs. |
Freund et al., Statistical Methods, 1993, Academic Press Inc., United Kingdom Edition, pp. 112-117. |
Merriam-Webster dictionary defines “member” as “individuals”, 2008. |
Waddington, D., “Does it signal convergence of operational and analytic MDM?” retrieved from the internet:<URL: http://www.intelligententerprise.com>, 2 pages, Aug. 2006. |
International Search Report mailed on Oct. 10, 2008, for PCT Application No. PCT/US07/20311 (10 pp). |
International Search Report and Written Opinion issued in PCT/US07/89211, mailing date of Jun. 20, 2008. |
International Search Report and Written Opinion for PCT/US08/58404, dated Aug. 15, 2008. |
International Preliminary Report on Patentability Under Chapter 1 for PCT Application No. PCT/US2008/058665, issued Sep. 29, 2009, mailed Oct. 8, 2009, 6 pgs. |
International Search Report and Written Opinion mailed on Dec. 3, 2008 for International Patent Application No. PCT/US2008/077985. |
Gu, Lifang, et al., “Record Linkage: Current Practice and Future Directions,” CSIRO Mathematical and Informational Sciences, 2003, pp. 1-32. |
O'Hara-Schettino, et al., “Dynamic Navigation in Multiple View Software Specifications and Designs,” Journal of Systems and Software, vol. 41, Issue 2, May 1998, pp. 93-103. |
International Search Report and Written Opinion mailed on Oct. 10, 2008 for PCT Application No. PCT/US08/68979. |
International Search Report and Written Opinion mailed on Dec. 2, 2008 for PCT/US2008/077970. |
Martha E. Fair, et al., “Tutorial on Record Linkage Slides Presentation”, Chapter 12, pp. 457-479, Apr. 1997. |
International Search Report and Written Opinion mailed on Aug. 28, 2008 for Application No. PCT/US2008/58665, 7 pgs. |
C.C. Gotlieb, Oral Interviews with C.C. Gotlieb, Apr. 1992, May 1992, ACM, pp. 1-72. |
Google.com, no match results, Jun. 30, 2009, p. 1. |
Supplementary European Search Report for EP 07 79 5659 dated May 18, 2010, 5 pages. |
European Communication for EP 98928878 (PCT/US9811438) dated Feb. 16, 2006. |
European Communication for EP 98928878 (PCT/US9811438) dated Mar. 10, 2008. |
European Communication for EP 98928878 (PCT/US9811438) dated Jun. 26, 2006. |
Gill, “OX-LINK: The Oxford Medical Record Linkage System”, Internet Citation, 1997. |
Newcombe et al., “The Use of Names for Linking Personal Records”, Journal of the American Statistical Association, vol. 87, Dec. 1, 1992, pp. 335-349. |
European Communication for EP 07795659 (PCT/US2007013049) dated May 27, 2010. |
Ohgaya, Ryosuke et al., “Conceptual Fuzzy Sets-, NAFIPS 2002, Jun. 27-29, 2002, pp. 274-279.Based Navigation System for Yahoo!”. |
Xue, Gui-Rong et al., “Reinforcing Web-Object Categorization Through Interrelationships”, Data Mining and Knowledge Discover, vol. 12, Apr. 4, 2006, pp. 229-248. |
Jason Woods, et al., “Baja Identity Hub Configuration Process”, Publicly available on Apr. 2, 2009, Version 1.3. |
Initiate Systems, Inc. “Refining the Auto-Link Threshold Based Upon Scored Sample”, Publicly available on Apr. 2, 2009; memorandum. |
Initiate Systems, Inc. “Introduction”, “False-Positive Rate (Auto-Link Threshold)”, Publicly available on Apr. 2, 2009; memorandum. |
Jason Woods, “Workbench 8.0 Bucket Analysis Tools”, Publicly available on Apr. 2, 2009. |
“Parsing” Publicly available on Oct. 2, 2008. |
Initiate, “Business Scenario: Multi-Lingual Algorithm and Hub,” Publicly available on Apr. 2, 2009. |
Initiate, “Business Scenario: Multi-Lingual & Many-To-Many Entity Solutions”, Publicly available on Apr. 2, 2009. |
Initiate, “Relationships-MLH”, presentation; Publicly available on Sep. 28, 2007. |
Initiate, “Multi-Lingual Hub Support viaMemtype Expansion”, Publicly available on Apr. 2, 2009. |
Initiate Systems, Inc. “Multi-Language Hubs”, memorandum; Publicly available on Apr. 2, 2009. |
Initiate, “Business Scenario: Support for Members in Multiple Entities”, Publicly available on Oct. 2, 2008. |
Initiate, “Group Entities”, Publicly available on Mar. 30, 2007. |
Jim Cushman, MIO 0.5: MIO As a Source; Initiate; Publicly available on Oct. 2, 2008. |
Initiate, “Provider Registry Functionality”, Publicly available on Oct. 2, 2008. |
Edward Seabolt, “Requirement Specification Feature #NNNN Multiple Entity Relationship”, Version 0.1—Draft; Publicly available on Oct. 2, 2008. |
Initiate, “Arriba Training Engine Callouts”, presentation; Publicly available on Mar. 30, 2007. |
Initiate, “Business Scenario: Callout to Third Party System”, Publicly available on Oct. 2, 2008. |
John Dorney, “Requirement Specification Feature #NNNN Conditional Governance”, Version 1.0—Draft; Publicly available on Oct. 2, 2008. |
Initiate, Release Content Specification, Identity Hub Release 6.1, RCS Version 1.0; Publicly available on Sep. 16, 2005. |
Initiate, “Initiate Identity Hub™ Manager User Manual”, Release 6.1; Publicly available on Sep. 16, 2005. |
End User Training CMT; CIO Maintenance Tool (CMT) Training Doc; Publicly available on Sep. 29, 2006. |
“Hierarchy Viewer—OGT 3.0t”, Publicly available on Sep. 25, 2008. |
“Building and Searching the OGT”, Publicly available on Sep. 29, 2006. |
Sean Stephens, “Requirement Specification B2B Web Client Architecture”, Version 0.1—Draft; Publicly available on Sep. 25, 2008. |
“As of: OGT 2.0”, Publicly available on Sep. 29, 2006. |
Initiate, “Java SDK Self-Training Guide”, Release 7.0; Publicly available on Mar. 24, 2006. |
Initiate, “Memtype Expansion Detailed Design”, Publicly available on Apr. 2, 2009. |
Adami, Giordano et al., “Clustering Documents in a Web Directory”, WIDM '03, New Orleans, LA, Nov. 7-8, 2003, pp. 66-73. |
Chen, Hao et al., “Bringing Order to the Web: Automatically Categorizing Search Results”, CHI 2000, CHI Letters, vol. 2, Issue 1, Apr. 1-6, 2000, pp. 145-152. |
“Implementation Defined Segments—Exhibit A”, Publicly available on Mar. 20, 2008. |
Initiate, “Implementation Defined Segments—Gap Analysis”, Publicly available on Mar. 20, 2008. |
“Supporting Hierarchies”, Publicly available on Nov. 29, 2007. |
Xue, Gui-Rong et al., “Implicit Link Analysis for Small Web Search”, SIGIR '03, Toronto, Canada, Jul. 28-Aug. 1, 2003, pp. 56-63. |
Liu, Fang et al., “Personalized Web Search for iMproving Retrieval Effectiveness”, IEEE Transactions on Knowledge and Data Engineering vol. 16, No. 1, Jan. 2004, pp. 28-40. |
Anyanwu, Kemafor et al. “SemRank: Ranking complex Relationship Search Results on the Semantic Web”, WWW 2005, Chiba, Japan May 10-14, 2005, pp. 117-127. |
International Preliminary Report on Patentability, PCT/US2008/58404, Mar. 21, 2011, 4 pages. |
European Search Report/EP07795659.7, Apr. 15, 2011, 7 pages. |
Emdad Ahmed, “A Survey on Bioinformatics Data and Service Integration Using Ontology and Declaration Workflow Query Language”, Department of Computer Science, Wayne State University, USA, Mar. 15, 2007, pp. 1-67. |
International Preliminary Report on Patentability, PCT/US2007/89211, Apr. 30, 2012, 6 pages. |
European Search Report/EP07795108.5, May 29, 2012, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20100174725 A1 | Jul 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11521928 | Sep 2006 | US |
Child | 12687324 | US |