None.
Not Applicable.
This invention generally relates to the comparison between two records, sets or batches of data and the efficient determination of any additions, modifications and/or deletions, and more particularly, to a system, method and software for comparing and analyzing a first set of data with a second set of data received by a computer while maintaining a persistent key.
Databases and data warehouses are computer-based data structures designed to allow the storing and querying of records which are typically received from one or more sources. The records generally correspond with entities, such as individuals, organizations and property. In certain cases, a database system is confronted with a situation wherein a new set of data may be substantially duplicative of a data set previously submitted to the system. Furthermore, the new data set may include a certain amount, even a small amount, of additions, modifications or deletions when compared with the previously submitted data set. Processing largely redundant sets of data misuses valuable system resources and presents significant scalability issues.
For example, a previously submitted data set may contain all the telephone residential listings of a particular geographic area. Thereafter, perhaps monthly or semiannually, the system may receive a new set of data that comprises a more recent set of either all or part of the telephone residential listings of the particular geographic area. Processing the new highly duplicative data set, at a minimum, will not identify records deleted from the more recent set and will require the intended recipient(s) to process substantially more data than necessary.
It is contemplated by the present invention that identifying or assigning a persistent key corresponding to each record could be used to facilitate efficient processing and identification of each record by the intended recipient(s) of the data set. For example, telephone residential listings do not contain a persistent key for each record. Therefore, any comparison in current systems is based upon the entire record or some combination of data in the record, such as last name, first name, telephone number and/or address. Occasionally, one record or many records in a data set may be different from a previously submitted data set, such as when a postal office splits a zip code. In such a case, a persistent key facilitates more efficient processing by the intended recipient(s) by enabling the intended recipient(s) to update the affected record(s) based upon the persistent key, thus minimizing the processing required to initially identify the affected record(s).
Unfortunately, current systems do not have an efficient way to compare two data sets and determine the additions, deletions or modifications between the two data sets while maintaining a persistent key. This includes, without limitation, an efficient way for generating a log representing a subset of such additions, deletions or modifications for further review, analysis and/or reporting with the respective persistent key.
The present invention is provided to address these and other issues.
It is an object of the invention to provide a method, program and system for processing data to compare two sets of data. The invention is implemented through a computer which may be connected to one or more additional computers in a network.
In one embodiment, the method, program and system comprises the step of receiving a first data set and a second data set, each data set comprising at least one record, where each record reflects at least one of a plurality of entities. In this regard, more than one record may reflect the same entity (e.g., an entity representing a specific person). The method, program and system further comprises the step of utilizing an algorithm to: (a) compare the second data set to at least a portion of the first data set; (b) identify or assign a persistent key (and perhaps the same persistent key for records reflecting the same entity) for each record in the second data set; and (c) create a database or file (i.e., a journal) to include any records: (i) in the second data set that: (1) do not exist in the first data set (perhaps with an “add” directive representing additions and the identified or assigned persistent key), (2) include at least one change to at least one record in the first data set (perhaps with a “modify” directive representing modifications and the identified or assigned persistent key), and/or (3) perhaps do not include at least one change to at least one record in the first data set, but reflects the identical one of a plurality of entities as a record in the first data set with a date (perhaps with a “no change” directive representing that the same record in the first data set was submitted in the second data set); and/or (ii) in the first data set that do not exist in the second data set, perhaps: (1) with a “delete” directive representing deletions and the identified persistent key and (2) only if the second data set is not incremental (e.g, only last month's changes, rather than a complete data set) of the at least a portion of the first data set.
The data contained in the first and second data sets preferably represents a plurality of entities. However, in some cases, each data set may include one or more records pertaining to a single (i.e., the same) entity. The entities can be individuals, property, organizations, or other identifying items that can be represented by identifying data.
The step of utilizing an algorithm may include, prior to comparing the second data set to the at least a portion of the first data set, at least one of: (a) creating a new portion of the second data set, (b) modifying at least a portion of the second data set and/or (c) organizing the second data set for efficient comparison, including, without limitation: (i) sorting the second data set, (ii) utilizing a database structure (e.g., a database with an index) and/or (iii) utilizing a memory array. It is yet further contemplated that the step of modifying at least a portion of the second data set may include removing or replacing a portion of the second data set meeting a user-defined criterion, such as removing or replacing characters contained within a record which are identified to be inappropriate.
It is yet further contemplated that the step of utilizing an algorithm further comprises: (a) organizing the at least a portion of the first data set for efficient comparison prior to comparing the second data set to the at least a portion of the first data, including, without limitation: (i) sorting the first data set, (ii) utilizing a database structure (e.g., a database with an index) and/or (iii) utilizing a memory array and/or (b) modifying the first data set to reflect the second data set (with the identified or assigned persistent keys). In this manner, the first data set may reflect the last known data set.
It is yet further contemplated that the step of utilizing an algorithm to compare the second data set and the at least a portion of the first data set includes determining whether at least one record in the: (a) second data set: (i) does not exist in the first data set or (ii) includes at least one change to at least one record in the at least a portion of the first data set that is determined to reflect an identical one of the plurality of entities or (b) at least a portion of the first data set does not exist in the second data set.
It is yet further contemplated that the step of utilizing an algorithm to identify or assign a persistent key includes assigning the same persistent key that was previously identified or assigned to a record in the at least a portion of the first data set to at least one record in the second data set when the at least one record in the second data set is determined to reflect an identical one of the plurality of entities (e.g., the same person).
In a second embodiment, the method, program and system comprises the steps of: (a) receiving a first data set having a first record, (b) assigning a persistent key to the first record, (c) receiving a second data set having a second record, (d) comparing the second record to the first record, and (e) recording an entry in a journal pertaining to the comparison of the second record to the first record. It is yet further contemplated that the second embodiment further comprises the step of: (a) assigning a persistent key to the second record identical to the persistent key assigned to the first record if the second record matches (e.g., reflects an identical entity, but does not necessarily contain identical data) the first record, and/or (b) assigning a persistent key different from the persistent key assigned to the first record if the second record does not match the first record.
It is yet further contemplated that the step of recording an entry in a journal includes recording: (a) a changed second record entry in the journal if the second record matches the first record and includes changes to information contained in the first record (perhaps with a “modify” directive), (b) the persistent key and date (reflecting that the first record is identical to the second record) if the second record matches the first record and does not include changes to information contained in the first record (perhaps with a “no change” directive), and/or (c) the second record in the journal with an “add” directive if the second record does not match the first record.
It is yet further contemplated that the second embodiment further comprise the step of recording the first record in the journal with a “delete” directive if the first record does not match the second record and the second data set is not an incremental data set of the first data set.
It is yet further contemplated that: (a) the first data set includes a plurality of first records, (b) each of the first records represents one of a plurality of entities, (c) the second data set includes a plurality of second records and/or (d) the second embodiment further comprise the step of modifying: (i) the second data set prior to comparing the second record to the first record (e.g., creating new data and/or replacing existing data) and/or (ii) the first data set to reflect the second data set (with the assigned persistent key).
It is yet further contemplated that the second embodiment further comprise the steps of organizing: (a) the first data set for efficient comparison prior to comparing the second record to the first record and/or (b) the second data set for efficient comparison prior to comparing the second record to the first record.
These and other aspects and attributes of the present invention will be discussed with reference to the following drawings and accompanying specification.
While this invention is susceptible of embodiment in many different forms, there is shown in the drawing, and will be described herein in detail, specific embodiments thereof with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated.
A data processing system 10 for processing data is illustrated in
The system is configured to receive data sets from the source 18. The data sets comprise one or more records representing one or more entities. The entities may be individuals, organizations, property, proteins, chemical or organic compounds, biometric or atomic structures, or other items that can be represented by identifying data.
The system 10 utilizes an algorithm 20 to process a first data set 22 and a second data set 24 from the source 18. The algorithm 20 is stored in the memory 16 and is processed or implemented by the processor 14.
The first data set may represent a last known data set wherein each record has a persistent key that was identified or assigned prior to any comparison with the second data set. A persistent key is a unique numeric or alphanumeric identifier that, at a minimum, may be used to distinguish one or more records representing a particular entity from other records representing a different entity.
As illustrated in
If a record in the first data set (“First Record”) matches (e.g., reflects an identical entity, but does not necessarily contain identical data) a record in the second data set (“Second Record”) 34, the algorithm 20 assigns the Second Record the same persistent key 36 as that assigned to the matched First Record and determines whether the Second Record incorporates changes that are not reflected in the First Record 38. If the Second Record does incorporate changes that are not reflected in the First Record (“Changed Second Record”), the Changed Second Record is recorded in a separate file (such as a flat file or database, hereafter a “Journal”) for identifying changes, no changes, additions and deletions with a “modify” directive 40 and the first data set is updated based upon the directive 42 by replacing the First Record with the Change Second Record (with persistent key and perhaps date/time stamp). The algorithm 20 then determines whether there are additional uncompared records 44.
If the Second Record does not incorporate changes, but is the same as the First Record (“Identical Record”), the algorithm 20 may record in the Journal a “no change” directive (with the persistent key) and a date representing that the Identical Record was submitted in the second data set 46. The algorithm 20 then determines whether there are additional uncompared records 44.
If a Second Record does not match a First Record 48, the Second Record is assigned a new persistent key 50 and recorded in the Journal with an “add” directive (“Add Record”) 52. The first data set is then updated based upon the directive 42 by adding the Add Record with the persistent key to the first data set directly or indirectly (e.g., directly to the first data set, into a separate file or database which could be later merged with the first data set, and/or through the utilization of a memory array).
If a First Record does not match a Second Record and is missing in the second data set (an “Unmatched Record”) 54, the algorithm 20 will determine whether the second data set is merely an incremental data set of the first data set 56 generally through instructions emanating from the source identifying the second data set as incremental or not incremental. If the second data set is not an incremental data set, the Unmatched Record is: (a) recorded in the Journal with a “delete” directive (with the persistent key) 58 and (b) the first data set is updated based upon the directive 42 by removing or marking for deletion the Unmatched Record from the first data set. The algorithm 20 then determines whether there are any additional uncompared records 44.
If there are additional uncompared records, the algorithm 20 would then compare the next record in the second data set to the first data set 60 and the process would be repeated. If there no additional uncompared records, the algorithm 20 would save the updated data set and the Journal 62.
Depending upon the end-user preference, the Journal could produce (perhaps for additional processing or analysis) a report, file and/or subset of data identifying: (a) all Changed Second Records that reflect records that modify certain records in the first data set, (b) all Identical Records that reflect records that were left unchanged, but have a more recent date corresponding with the second data set, (c) all Add Records that reflect records to be added to the first data set, and/or (d) all Unmatched Records that reflect records to be deleted from the first data set.
From the foregoing, it will be observed that numerous variations and modifications may be effected without departing from the spirit and scope of the invention. It is to be understood that no limitation with respect to the specific apparatus illustrated herein is intended or should be inferred. It is, of course, intended to cover by the appended claims all such modifications as fall within the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
1261167 | Russell | Apr 1918 | A |
5010478 | Deran | Apr 1991 | A |
5229764 | Matchett et al. | Jul 1993 | A |
5403639 | Belsan et al. | Apr 1995 | A |
5454101 | Mackay et al. | Sep 1995 | A |
5534855 | Shockley et al. | Jul 1996 | A |
5555409 | Leenstra et al. | Sep 1996 | A |
5560006 | Layden et al. | Sep 1996 | A |
5608907 | Fehskens et al. | Mar 1997 | A |
5659731 | Gustafson | Aug 1997 | A |
5675785 | Hall et al. | Oct 1997 | A |
5758343 | Vigil et al. | May 1998 | A |
5764977 | Oulid-Aissa et al. | Jun 1998 | A |
5781911 | Young et al. | Jul 1998 | A |
5784464 | Akiyama et al. | Jul 1998 | A |
5794246 | Sankaran et al. | Aug 1998 | A |
5799309 | Srinivasan | Aug 1998 | A |
5819263 | Bromley et al. | Oct 1998 | A |
5878416 | Harris et al. | Mar 1999 | A |
5933831 | Jorgensen | Aug 1999 | A |
5991408 | Pearson et al. | Nov 1999 | A |
5991733 | Aleia et al. | Nov 1999 | A |
5991758 | Ellard | Nov 1999 | A |
5991765 | Vethe | Nov 1999 | A |
5995097 | Tokumine et al. | Nov 1999 | A |
5995973 | Daudenarde | Nov 1999 | A |
6014670 | Zamanian et al. | Jan 2000 | A |
6032158 | Mukhopadhyay et al. | Feb 2000 | A |
6035295 | Klein | Mar 2000 | A |
6035300 | Cason et al. | Mar 2000 | A |
6035306 | Lowenthal et al. | Mar 2000 | A |
6041410 | Hsu et al. | Mar 2000 | A |
6044378 | Gladney | Mar 2000 | A |
6049805 | Drucker et al. | Apr 2000 | A |
6052693 | Smith et al. | Apr 2000 | A |
6058477 | Kusakabe et al. | May 2000 | A |
6065001 | Ohkubo et al. | May 2000 | A |
6073140 | Morgan et al. | Jun 2000 | A |
6076167 | Borza | Jun 2000 | A |
6092199 | Dutcher et al. | Jul 2000 | A |
6122641 | Williamson et al. | Sep 2000 | A |
6122757 | Kelley | Sep 2000 | A |
6160903 | Hamid et al. | Dec 2000 | A |
6167517 | Gilchrist et al. | Dec 2000 | A |
6185557 | Liu | Feb 2001 | B1 |
6189016 | Cabrera et al. | Feb 2001 | B1 |
6208990 | Suresh et al. | Mar 2001 | B1 |
6249784 | Macke et al. | Jun 2001 | B1 |
6272495 | Hetherington | Aug 2001 | B1 |
6317834 | Gennaro et al. | Nov 2001 | B1 |
6334132 | Weeks | Dec 2001 | B1 |
6339775 | Zamanian et al. | Jan 2002 | B1 |
6357004 | Davis | Mar 2002 | B1 |
6385604 | Bakalash et al. | May 2002 | B1 |
6418450 | Daudenarde | Jul 2002 | B2 |
6446210 | Borza | Sep 2002 | B1 |
6460037 | Weiss et al. | Oct 2002 | B1 |
6684334 | Abraham | Jan 2004 | B1 |
6690820 | Lees et al. | Feb 2004 | B2 |
6697947 | Matyas, Jr. et al. | Feb 2004 | B1 |
6810405 | LaRue et al. | Oct 2004 | B1 |
6819797 | Smith et al. | Nov 2004 | B1 |
6968338 | Gawdiak et al. | Nov 2005 | B1 |
20020023088 | Thwaites | Feb 2002 | A1 |
20030030733 | Seaman et al. | Feb 2003 | A1 |
20030097380 | Mulhern et al. | May 2003 | A1 |
20030191739 | Chatterjee et al. | Oct 2003 | A1 |
20040019518 | Abraham et al. | Jan 2004 | A1 |
20040162802 | Jonas | Aug 2004 | A1 |
Number | Date | Country |
---|---|---|
1130513 | Sep 2001 | EP |
9914692 | Mar 1999 | WO |
Number | Date | Country | |
---|---|---|---|
20040162802 A1 | Aug 2004 | US |