The present disclosure relates generally to the field of data management. In a more limited aspect, the present development relates to match grouping of a plurality of data records within a data set to reduce computational time and complexity in identifying matches among the data records using a computer-based information handling system. In certain embodiments the present development relates to a system and method for matching data records within a data set, each data record containing data elements representative of a business entity, wherein the present system and method may advantageously be used to derive business intelligence information about the entities described by the data records, and the present development will be described herein primarily by way of reference thereto, although it will be recognized that the present development is amenable to for use with entities of any type, including without limitation, persons, employees, customers, vendors, manufacturers, property, equipment and all manner of other items.
Business Intelligence and Analytics problems often require efficient matching and linking of very large, loosely coupled data sets in order to build holistic profiles of business entities, such as employers. Current techniques to analyze and link these data sets demand sequential processing in order to prevent data inconsistencies. When operating with relatively large data sets, e.g., data sets in the millions of records with numbers of matches in the tens or hundreds of millions of possibilities, sequential methods are no longer feasible. This is due to exceptionally long processing times with a sequential set of operations. The present disclosure presents a method and system that overcome such problems and others by providing a highly efficient, parallel process for identifying and linking matching records within a data set, which makes the present system and method feasible and practical to carry out on modest computing systems.
Consider a loosely structured data set A that consists of, for example, 10 million records, labeled A1, A2, A3, etc. through A10000000. Distinct records within data set A have variable numbers of populated data elements, for example, record A1 may have 20 populated data elements, while A2 may have 15 populated data elements and A3 may have 30 populated data elements. In the case of data pertaining to business entities, examples of data elements include entity name, address, telephone number, web address, email address, employer identification number (EIN), alternative business name, fax number, employee count, revenue range, year founded, and industry, among others.
Each record in data set A represents profile data (business intelligence) on a particular entity, such as an employer (company). The present development will be described herein primarily by way of reference to a data set relating to one or more businesses or employers, although the present development is not limited to such.
For a given employer represented within the data set, there may be (a) one record in data set A which represents that employer, or (b) more than one record in data set A representing that employer. However, there is no particular identifier within the data itself that decisively identifies the employer to which each record belongs. Instead, the likely identity of an employer associated with a given data record must be evaluated and derived via algorithmic processing.
The process to produce a holistic profile for a particular entity, in this case an employer, represented by data in data set A is to identify which records in data set A relate to the same employer (such as an employer we will label E1), and which relate to other employers (e.g., E2, up to EN). To do this, it is necessary to process each of the data records in data set A and match/link them together, so that all the records determined to be associated with the same employer E1 are linked together within a group G1 (see
The challenge of building these groups is further complicated due to the possibility of record match chaining. Chaining happens when matches such as the following occur:
In this situation, it is desirable that all of these chains be resolved in order to arrive at the appropriate linked group G1 for a single employer we will call E1, resulting in:
Employer E1=[A1, A7, A99, A207, A311]
Once linked successfully by the appropriate employer identifier, the available profile information (business intelligence) on employer E1 can be derived through examination of the aggregate set of values within the records that are linked to one another because they are associated with the same employer E1. For example, one record might contain an employer identification number (EIN), where another might contain an address, where another (or more than one) might contain phone numbers.
The matching and linking process needed in this case to build profiles for these employers must include a method to compare one record in set A with other records in set A, to determine which ones match each other and, thus, are assumed to be related to the same employer. In loosely structured, sparse data sets such as that described here, there is no definitive method to do this. For example, one cannot simply compare names because one record might have a name, and another may not have a name at all. Alternatively, the names may be different, even though the company is in fact the same, for example if one describes a brand name or fictitious trade name while another describes a legal corporation name. Likewise, phone numbers may or may not match, addresses may or may not match, and so on. Moreover, a particular comparison between two records may come to the conclusion that the two records are not associated with the same employer but instead represent different employers.
There are three primary challenges present in implementing a process capable of deriving employer profiles as outlined here. Those key challenges are:
To solve a problem such as this on large data sets using non-computational means would be virtually impossible. That is because, to complete the process, it is necessary to compare each record in set A with every other different record in set A, to apply the explicit and fuzzy matching logic and determine the likelihood that the records in each pair of records are related or not and whether that likelihood meets an acceptable threshold for deciding whether or not the records in each pair should be linked to each other. For a data set A with 10 million records, that total number of comparisons is combinatoric and requires up to 49.99 trillion comparisons. Then, chains need to followed and resolved, and, finally, efficient assignment of employer identifiers to each record must be applied.
The method described herein presents a highly efficient approach to match sparsely populated records, as required by step 1, and then to unwind and resolve match chains of arbitrary depth in order to identify non-overlapping record sets, as required in step 2 of this process. The unwinding and resolution process is able to collapse the recursive nature of the chained match data into an efficient sequential processing sequence where match groups are subsequently collapsed into prior match groups when chained matches are encountered.
By application of this particular method, it is possible to execute steps 1 and 3 in a highly parallel manner and thus achieve a high degree of overall performance in solving this type of problem. Sequential processing is isolated to Step 2, which can be executed very efficiently using the present method.
The invention may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of illustrating preferred embodiments and are not to be construed as limiting the invention.
The three steps described above in building business intelligence profiles for entities from loosely structured data sets comprises:
Assume we are operating on a set of records “A” with individual record identifiers A1, A2, A3, etc.; the following description outlines the three steps:
In certain embodiments, in Step 1, a high performance yet flexible multi-level comparison method is implemented that is able to predict within some acceptable probability threshold whether or not two records represent the same entity. This comparison method relies on both explicit indicators (such as Record A1 and A7 have data element in common, such as the same phone number), as well as “fuzzy” matching logic, which consists of partial matching, overlapping matching, similarity matching, and other business rules that contribute to making the match prediction. For example, in the case of business entities, and using such fuzzy matching logic, Record A1 and A7 would be determined to represent the same entity of the entity name if the records differed only in the presence or omission of an entity designator such as “Inc.” or “LLC” or truncation of a word, such as “Corporation” vs. “Corp.” The current fuzzy matching rules may also be logically extended to utilize other known matching strategies such as machine learning predictions based on historical examples.
In certain embodiments, the matching process is implemented in a unique multi-staged manner to deliver high performance. The process consists of collecting together likely matches (“match candidates”) from the universe of potential record matches, then utilizing a scoring system to predict the likelihood that the records match the same entity. The stages are as follows:
Stage 1: Quick Matching
This stage implements two fast matching rules to find candidate matches:
Stage 2: Field Matching
This stage identifies candidate match records that were not found by Stage 1, but which have at least one common overlapping field, such as employer identification number, phone number, web address, email address, zip code, street address, or the like.
Stage 3: Similarity Matching
This stage identifies candidate matching records that were not found by stages 1 or 2 but are candidates because of some level of name overlap or similarity. The steps are:
Stage 4: Candidate Scoring
For each candidate match record, a flexible scoring system is applied to predict the likelihood that the two records are representative of the same entity. In certain embodiments, the flexible scoring system consists of the following:
Stage 5: Match Prediction
The last stage makes a prediction based on the scoring factors to determine whether these records represent the same entity. In certain embodiments, the prediction itself has multiple thresholds, which can be selected based on the willingness of the consuming application to accept false positives. For example, in certain applications, it may be desirable to apply a “stricter” threshold, which produces higher confidence matches but might miss some matches. Likewise, in certain embodiments, it may be desirable to apply a “less strict” threshold, which normally predicts a higher number of matches but at a trade-off of letting through a higher rate of errors. In certain embodiments, match threshold level is selectable by the user. In certain embodiments, any number (e.g., 2, 3, 4, 5, 6, or more) of varying selectable matching threshold levels are provided.
For each of the candidate records that is predicted to be a match within the selected risk threshold, the candidate record is treated as a confirmed match. The outcome of comparing all the records against each other using these methods is a list of confirmed record matches of the example form:
Because Step 1 only identifies and records matches, but does not write any data back to the original source records, it is possible to execute the matching process in a highly parallel manner across a plurality, e.g., 128 or more, of simultaneous processes, where each distinct process is responsible for producing the list of matches for a given subset of the source records contained in the data set.
In Step 2, we will utilize a unique method to process these match chains efficiently and identify sets of non-overlapping record groups for each entity represented in the data. The purpose of Step 2 is to segregate records into distinct, non-overlapping sets so that parallel processing can be performed on each set independently, in order to maximize algorithmic performance and eliminate the possibility of concurrency conflicts that might lead to data corruption.
The method of Step 2 employs a “partitioning algorithm” described as follows. The partitioning algorithm takes a set of match records as input. Each match record is a pair of record identifiers where the left record identifier is deemed a match for the right record identifier. An illustration of a match record representative of a pair of matching records appears in
In Step 2, each subsequent match record produced by Step 1 is processed in sequence. If both records in the pair currently being processed are not in a partition, they are assigned the next available new partition identifier. If one record in the pair is already in a partition, then the second record is assigned the same partition.
By processing the match records in serial fashion to partition records, concurrency issues are avoided, which would manifest themselves if multiple processes were trying to write conflicting partition information into the same records within the data set at the same time. In certain embodiments, the partitioning algorithm herein works by tracking partition assignments in a numeric array indexed by record (as identified by its record ID). The members of each partition are also stored in sets to enable fast merging of partitions when following match chains. This process avoids the normal complexities of following recursive match chains individually.
A partition table is used as an index to determine which partition set to update during partition assignment. In certain embodiments, for utmost computational efficiency, the partition table is a bit array the size of which is set to the largest numerical record ID, however other implementations are possible. Once the partitioning process is complete, the values in the partition table can be used to write the entity identifiers back to the source records in Step 3, discussed in greater detail below.
The following examples explain the operation of the partitioning system. Examples include matches which have the form An1, An2 to mean An1 implies An2, or more specifically, record An1 matches record An2, wherein n1 and n2 are numerical data record designations. Assigned partition groups are shown next to records using the symbol “:”. For example An1:0 means record An1 is assigned partition 0 and An2:1 means record An2 is assigned partition 1. In the illustrated embodiment, partition “0” is a special partition identifier value that indicates that the record has not been partitioned yet. Finally, partition sets are shown in the form “1:{An1, An2}” meaning that partition number one contains the data records identified by An1 and An2.
In the illustrated embodiments: (a) the presence of a comma (“,”) between a pair of records indicates a match record identified in Step 1; (b) the presence of a dash symbol (“−”) before the pair indicates a match record that has already been processed; and (c) the presence of a colon “:” followed by a number appearing after a data record indicates the partition number to which that data record has been assigned. The process herein is illustrated below, with reference again to the example given above wherein:
A1 matches A7 (A1, A7)
A7 matches A99 (A7, A99)
A99 matches A207 (A99, A207)
A99 matches A311 (A99, A311)
A311 matches A1 (A311, A1).
The following Examples illustrate the process of Step 2.
Example 1 is a simple example wherein an initial partition assignment is made. Starting with an initial match record A1, A7 wherein neither record has yet been partitioned, we have an initial state of:
Matches: A1, A7
Partitions: A1:0 A7:0
Sets: { }.
Because neither A1 nor A7 have yet been assigned a partition group, processing this match record assigns both A1 and A7 to the next available partition identifier, in this case partition 1.
Partitions: A1:1 A7:1
After this step, we have the partition sets:
Sets: 1:{A1, A7}.
Example 2 illustrates a further example, which also considers circular references. Referring again to the example above, consider the following initial state:
Matches: A1,A7 A7,A99 A99,A311 A311,A1
Partitions: A1:0 A7:0 A99:0 A311:0
Sets: { }.
The process first handles the match A1, A7. Since neither record in the match record pair has a partition defined, both records are assigned to the next available partition, namely, partition 1 (note the first match A1, A7 is now shown as having been processed with the “−” prefix designation):
Matches: −A1,A7 A7,A99 A99,A311 A311,A1
Partitions: A1:1 A7:1 A99:0 A311:0
Sets: 1: {A1, A7}.
Now we process the second match, A7, A99, where A7 is already in partition 1, so A99 should also be added to partition 1:
Matches: −A1,A7 −A7,A99 A99,A311 A311,A1
Partitions: A1:1 A7:1 A99:1 A311:0
Sets: 1: {A1, A7, A99}
Now we process the third match, A99, A311, where A99 is already in partition 1, so A311 should be added to partition 1.
Matches: −A1,A7 −A7,A99 −A99,A311 A311,A1
Partitions: A1:1 A7:1 A99:1 A311:1
Sets: 1:{A1, A7, A99, A311}
Lastly, we process the next link, A311:A1, which is a circular reference. Because A311 and A1 are already in the same partition, nothing needs to be changed.
Matches: −A1,A7 −A7,A99 −A99,A311 −A311,A1
Partitions: A1:1 A7:1 A99:1 A311:1
Sets: 1: {A1, A7, A99, A311}
Example 3 looks at a case using matching records from the above example where sets are processed that require partition merges. Consider the following initial state:
Matches: A1,A7 A99,A311 A311,A1
Partitions: A1:0 A7:0 A99:0 A311:0
Sets: { }
The first match record A1, A7 is processed and since neither record is in a partition, A1 and A7 are both added to partition 1:
Matches: −A1,A7 A99,A311 A311,A1
Partitions: A1:1 A7:1 A99:0 A311:0
Sets: 1: {A1, A7}
The second match record A99, A311 is processed and since neither are in a partition, A99 and A311 are both added to the next available partition, which is partition 2:
Matches: −A1,A7 −A99,A311 A311,A1
Partitions: A1:1 A7:1 A99:2 A311:2
Sets: 1:{A1, A7} 2:{A99, A311}
Finally, the third match record A311, A1 is processed and it is determined that both records, though matching, are in different partitions. Because it is an object of Step 2 to assign all matching records into a single, unique partition, the right (e.g., numerically higher) partition (partition 2) is merged into the left (e.g., numerically lower) partition (partition 1). Thus, as a result of the comparison of records A311 and A1, the record A311, as well as all other records in partition 2 (in this example, A99), are merged into partition 1 by changing the partition identifier.
Matches: −A1,A7 −A99,A311 −A311,A1
Partitions: A1:1 A7:1 A99:1 A311:1
Sets: 1: {A1, A7, A99, A311}
Note that there are no more match pairs in partition 2 as a result of the merge operation. Optionally, partitions that become empty as a result of a merge operation, such as partition 2 in this example, can be designated as an available partition for processing of future match records and re-used, although it is not necessary to do so.
Finally, in Step 3, the partition identifiers determined by Step 2 are written back into all the source data records in Data Set A as unique entity identifiers. Based on the matching process and the subsequent partitioning process, we know that there is one unique entity represented in the original source record set for each partition identifier that has been produced. Therefore, in the preferred embodiment, the partition number can be used as a unique assigned entity identifier. However in other embodiments, it is also possible to map the partition identifier to other entity identifiers, or alternatively, to derive an identifier from some selected entity data present in the sparse data records in the same set.
Once these write operations are complete, all the original source records are designated with the particular entity identifier to which they associate. These write operations can likewise be executed safely in a bulk manner or even highly parallel manner because the resolution of which partition value needs to be written to which record has already been decisively determined during Step 2.
The partitioning process described in Step 2 above is illustrated in the flow chart 100 appearing in
If, at step 116, it is determined that record 2 is in a partition (i.e., has been assigned a partition ID), then the process proceeds to step 136 where the partition ID of record 2 defines a value (“PID2” in the illustrated example), and that value is assigned as the partition identifier for record 1 at step 138. The process then continues to step 124 and continues as described above.
If, at step 112, it is determined that record 1 is in a partition, the process proceeds to step 142 where the partition ID of record 1 defines a value (“PID1” in the illustrated example). The process then proceeds to step 144, where it is determined whether record 2 of the match record currently being processes is also in a partition. If at step 144, record 2 is not in a partition 2, then the value PID1 is assigned as the partition identifier for record 2 at step 168. The process then continues to step 124 and continues as described above.
If, at step 144, it is determined that record 2 is in a partition (i.e., has been assigned a partition ID), then the process proceeds to step 148 where the partition ID of record 2 defines the value PID2. Next, at step 152, the value of PID2 is then compared to the value of PID1 as determined at step 142. If it is determined at step 152 that the partition identifiers of records 1 and 2 are the same, then the process continues to step 124 and continues as described above.
If it is determined at step 152 that the partition identifiers of records 1 and 2 are not the same, then a partition merge is required and the process continues to step 156. At step 156, it is determined which partition value is greater. If PID1 is less than PID2, then the process continues to step 160 and the partition identifier of record 2, as well as all other records that were previously assigned the value PID2, are reassigned the partition ID value PID1, thereby merging the matching records 1 and 2 into the same partition, based on the lower partition value, and also cascading the reassignment across the full set of records that had been previously assigned the PID2 value. The process then proceeds to step 124 and continues as described above.
If, at step 156, it is determined that PID1 is greater than PID2, then the process continues to step 164 and the partition identifier of record 1, as well all other records that were previously assigned the PID1 value, are changed to have the PID2 value of record 2, thereby merging the matching records 1 and 2 into the same partition, again, based on the lower partition value, and also cascading the changes across the full set of records that were previously assigned the PID1 value. The process then proceeds to step 124 and continues as described above.
Referring now to
The hardware system 200 includes a central processing system 230, a memory 232, one or more storage devices 234, including main and auxiliary memory, an input/output (I/O) system 236, a network interface 238, a communications interface 240, and a display system 242 operably connected by a bus 244.
The hardware system 200 is controlled by the central processing system 230, which may include a central processing unit such as a microprocessor or microcontroller for executing programs, performing data manipulations and controlling the tasks of the hardware system. The processor 230 can be any suitable Intel, AMD, Motorola, Texas Instruments, or Sun processor, or the like. Communication with the central processor 230 is implemented through the system bus 244 for transferring information among the components of the hardware system.
The memory 232 provides storage of instructions and data for programs executing on the central processing system 230. The memory 232 is typically semiconductor-based memory as would be generally understood by persons skilled in the art. The storage devices 234 may include semiconductor-based memory such as read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, and so forth. The storage devices 234 may also include a variety of non-semiconductor-based memories, including but not limited to hard disk, floppy disc, compact disc read-only memory (CD-ROM), digital versatile disc read-only memory (DVD-ROM), and so forth.
The display system 242 may comprise a display device and a video display adapter having the components for driving a display device, including video memory, buffer, and graphics engine as desired. The display device may comprise a video monitor such as a cathode ray-tube (CRT) display, liquid-crystal display (LCD), light-emitting diode (LED) display, gas or plasma display, and so forth.
The input/output (I/O) system 236 may comprise one or more controllers or adapters for providing interface functions between one or more I/O devices. The input/output system 236 may comprise one or more serial ports, parallel ports, universal serial bus (USB) ports, IEEE 1394 ports, infrared ports, etc., for interfacing with corresponding I/O devices such as a keyboard, mouse/pointing device, printer, modem, microphone, speaker, and so forth.
The network interface 238 may be connected to a network to communicate with other computers, external devices, networks, or information sources on the network 190. The network interface 238 may be a network adapter implementing, for example, IEEE 802 network standards (e.g., IEEE 802.3 for Ethernet networks, IEEE 802.11 for wireless networks, IEEE 802.15 for personal area networks, IEEE 802.16 for broadband wireless metropolitan networks, and so on.).
The communications interface 240 may be connected to a network, such as the Internet for communication with other computers or devices using an ISP and/or a dial up phone system to connect to the network. The communications interface 240 can be a modem, digital subscriber line (DSL), asymmetric digital subscriber line (ASDL), frame relay, asynchronous transfer mode (ATM), integrated digital services network (ISDN), personal communications services (PCS), transmission control protocol/Internet protocol (TCP/IP), serial line Internet protocol/point to point protocol (SLIP/PPP), and so on. It should be appreciated that the hardware system 200 of
The systems and methods disclosed herein can be implemented as sets of instructions resident in the main memory of one or more computer systems. Until required by the computer system, the set of instructions may be stored in another computer readable memory such as a hard disk drive or in a removable memory such as an optical disk for utilization in a DVD-ROM or CD-ROM drive, a magnetic media for utilization in a magnetic media drive, a magneto-optical disk for utilization in a magneto-optical drive, or a memory card for utilization in a memory card slot. Further, the set of instructions can be stored in the memory of another computer and transmitted over a local area network or a wide area network, such as the Internet, when desired by the user. Additionally, the instructions may be transmitted over a network in the form of an applet that is interpreted after transmission to the computer system rather than prior to transmission. One skilled in the art would appreciate that the physical storage of the sets of instructions or applets physically changes the medium upon which it is stored, e.g., electrically, magnetically, chemically, physically, or optically, so that the medium carries computer readable information.
The invention has been described with reference to the preferred embodiment. Modifications and alterations will occur to others upon a reading and understanding of the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
This application claims the priority benefit of U.S. provisional application No. 62/626,811 filed Feb. 6, 2018. The aforementioned application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62626811 | Feb 2018 | US |