The present invention generally relates to computer-implemented systems for searching within a database. More specifically, the present invention relates to searching and scoring exact and non-exact matches of data from a plurality of databases to validate data integrity.
It is widely understood that the quality of data in enterprises is highly variable. Even companies that spend millions of dollars attempting to keep their data clean, accurate, and up-to-date often fail badly.
The current approach with respect to certain types of data, for example, customer data, is for companies to validate their internal data by purchasing data from external data source vendors. These vendors provide data that they claim to have verified. Typical methods of verification are labor intensive—for example, telephone numbers verified by placing calls to the number, email addresses verified by click-back responses, and so forth. Enterprises pay these vendors millions of dollars a year in license fees in order to be able to compare their own data to such an external source and, based on that comparison, attempt to determine the validity or falsity of their own data. This method is not only extremely expensive, but it is also a fairly limited check since there are relatively few external data vendors and the vendors' methodology of requiring verified sources means that the data is seldom up-to-date. Yet at present, a single vendor of such services reportedly earns over a billion dollars in revenues from supplying fact-checked customer data to companies.
Embodiments are described relating to novel systems and methods for validating data. The embodiments create a “consensus value” for various items of data based on information shared by different entities, whose separate data can be used for this purpose whilst maintaining its confidentiality from other entities, who may be business competitors and/or who for various reasons should preferably not be given access to the data. Use of consensus value validation provides significant advantages over today's methodology of reliance on outside data vendors to provide purportedly fact-checked clean data.
An exemplary embodiment of the invention is hereafter described with reference to the drawings, but such description and drawings do not limit the scope of the invention. For example, the exemplary embodiment describes a system for validating customer data. Other embodiments may validate social media linkages, product part numbers and descriptions, geographic place names, workplace titles, alternate names for companies, patients' medical data, or any other type of data which may be stored in a database and for which validation is desired. Additionally, the exemplary embodiment is described in terms of a relational database, but the invention may be used in connection with a non-relational database as well.
The exemplary embodiment utilizes a community of users referred to hereafter as the “Enterprise Community,” depicted in schematic form in
The exemplary embodiment further utilizes a system 200 comprising a central database 202 and an Applications Programming Interface (API) 201. Preferably, the contents of the central database 202 are stored in a location separate from the computers of Enterprise Members 101, 102, 103 and kept highly secure even from members of the Enterprise Community 100, so that there need be no concern about competitor access to private customer lists. In the exemplary embodiment, the central database 202 may be maintained on one or more computers. Other embodiments (not shown but more fully described below) may not require a central database 202.
In inventive embodiments not comprising a central database 202, the system 200 may comprise merely an API 201. The API 201 would comprise an aggregation process to create a list of dependencies and counts. The counts and encoded values would be provided to each Enterprise Member 101, 102, 103, which could then perform their own matching against the aggregate dependencies and counts. The encoded values provided may be non-invertible, providing the Enterprise Members 101, 102, 103 access to the encoded values but no information beyond their own records.
In certain embodiments, which may or may not comprise a central database 202, the system 200 may encode the original data entered by Enterprise Members 101, 102, 103 and discard the original data entries. This additional step provides increased security even beyond the concern about competitor access since the owner of the central database 202 would not have access to the original data.
As described in more detail hereafter and shown in overview form in
Pertinent items of data are associated with a consensus value that reflects the presence or absence of such matches and the frequency of matches and/or dissonances. The raw data comprising the consensus value in this embodiment is determined by a counting function that is incremented when validated matches are present and may be decremented when dissonant data is detected. The Enterprise Member 101 ultimately receives a Validation Report 302 that identifies whether Customer Record 301 has been seen before, in whole or in part, and the consensus value(s) assigned to Customer Record 301 as a whole and/or as to pertinent data elements. After processing and evaluation has been completed, those data elements comprising Customer Record 301 that are new to the system, optionally together with coding added to those elements by the API 201, are added to the central database 202. In an iterative process as more and more data is submitted, individual items of customer data as well as entire records relating to particular customers will, through use of the API 201, develop consensus values. The higher the consensus value, the more assured an enterprise may be of the data's trustworthiness.
The example embodiment shown in
At Step 401, the API receives a call from Enterprise Member 101, transmitting the Customer Record 301. At Step 402, which is optional, the API determines whether to accept the Customer Record 301 for processing, by screening and identifying the submission. First, the API verifies the identity of the Enterprise Member that is submitting the record. To avoid the potential for a hacker to corrupt the database, records preferably are accepted only from Enterprise Members whose identity can be verified. In addition, preferably as part of this screening and identification procedure, the API may verify whether the Enterprise Member has assigned to the record a Customer ID. If the record is not from a verified source or does not contain a required data element such as a Customer ID, then at Step 402B, the API stops processing the record and optionally may notify the submitter that processing has been stopped and why.
At Step 403, after verifying and accepting the data, the API standardizes the data from Customer Record 301. This standardization includes identifying data corresponding to various pre-designated fields utilized by the central database and placing the data into the corresponding fields in the format designated for that field. For example, the API may identify a person's name as a name and an email address as an email address, and place the name into one or more “name” fields (for example, “surname” and “first name”) and the email address in an “email address” field.
At Step 404 the API encodes functional dependencies within the data. That is, certain attributes associated with customers are presumed to be uniquely associated with one or more other attributes; and codes identifying these relationships are associated with the appropriate data elements. A functional dependency can be described as follows:
In the context of customer data being manipulated by the inventive systems and methods, a functional dependency might be defined between a person's name and that person's email address as described above, whilst a functional dependency might not be defined between a person's email address and that person's residence address. The invariants that link functionally dependent data typically will be chosen to reflect the likelihood that the two (or more) data fields are, in fact, interdependent in some way and that through a series of such linkages, one can determine whether or not the data contained in the fields is associated with a particular person. Often, an email address is used by a single individual having a particular surname. Thus, defining a functional dependency between the email address and surname is likely to be useful. On the other hand, many people have the same zip code, and many people have the same first name; defining a functional dependency between zip code and first name fields may be less useful. When those two fields are further considered together with telephone number, the number of persons who would have the same zip code, same first name, and same telephone number is substantially reduced and thus a functional dependency might be created among those three fields. Of course, any type of data may be manipulated in similar ways to identify functional dependencies.
Typically, the functional dependencies between various fields are predefined and, once the data has been sorted into standardized fields, the API may add appropriate coding to each data element that indicates the predefined functional dependencies. Adding coding in this manner speeds sorting of the data although it would be possible, for sufficiently small datasets or if processing power or time were not critical limitations, to sort the data using matrices or tables by maintaining appropriate field relationships during subsequent manipulation of the data. It is worth noting that the Customer ID assigned by a particular member will uniquely identify—for that member—one particular customer; and from the perspective of that member, there is a functional dependency between their Customer ID and the data elements associated with that customer. However, the Customer ID of one Enterprise Member will not necessarily, or even usually, be the same as that of another member.
At Step 405A, the API evaluates whether there are any matches within the database for customer data having the various functional dependencies assigned to the new data. For example, if customer surname and customer email address have a functional dependency, the database will be queried for data sharing the same customer surname and email address. The number and type of functional dependencies that are encoded at Step 404 and/or evaluated at Step 405A may vary according to the practitioner of the inventive method. Whether a “match” exists will be determined using standard database techniques well known to persons of ordinary skill in the field.
If the evaluation using the standard functional dependencies produces no matches, then optionally at Step 405B, the Enterprise Member may be given the opportunity to manually add functional dependencies to the submitted data. For example, even though many people would not find it effective to associate zip code and first name, the Enterprise Member may know that within its unique customer list, such associations are likely to generate useful matches and may, therefore, create a functional dependency between the customer first name and customer zip code fields. Of course, it is not necessary to stop processing the data whilst the Enterprise Member is asked whether to create such additional functional dependencies. The Enterprise Member may optionally pre-designate additional functional dependencies to evaluate either as a matter of standard procedure for all of its data, or for only those subsets of data that do not otherwise produce matches. The data may then be cycled through Step 405A again, i.e. compared with the database again to see if the new dependencies allowed any matches to be discovered. The data also may be cycled through Step 408 (described below).
After all desired functional dependencies have been added and all searches for matches have been performed, if the API cannot find match(es) sufficient to associate the submitted customer record with a previously-submitted customer record, then at Step 405C the data comprising the submitted customer record is sent to the central database 202 as a new entry, and stored in the database. As will be appreciated by those skilled in the art, at (or before) the time the customer record is stored in the central database, the customer record is assigned an identifier that is unique on a system-wide basis (rather than unique only to the Enterprise Member that submitted the record).
If, however, the API finds a match at Step 405A between the newly submitted record and a customer record that previously has been submitted, it associates the newly submitted record with the prior record and then at Step 406A evaluates whether the content of any of the data fields in the newly submitted customer record differs from the data associated with the same field in the prior customer record. If the newly submitted customer record matches in all respects the previously submitted customer record, (i.e., the same Enterprise Member has submitted, under its own Customer ID, a customer record that matches in all respects a set of data previously submitted by that Enterprise Member under the same Customer ID), then the API may either update the date on the record to reflect the currency of the information or take no action affecting the consensus value assigned to the data elements contained in the customer record. The API then moves forward to Step 106. The customer record is treated in this manner because there is nothing to indicate that the Enterprise Member has in any way evaluated or otherwise enhanced the reliability of any of the data comprising the customer record since the last time it was submitted by that Enterprise Member.
If, however, comparison of the records shows that a newly submitted data element associated with a particular field differs from the previously submitted data element associated with the same field or if there was no prior data in that field (for example, if a new telephone number has been submitted), or if identical customer data is submitted by a different Enterprise Member (typically indicated by the different Customer ID assigned to the data), then the API does take action affecting the consensus values.
If the newly submitted data element represents a change in data previously submitted for the same customer (i.e., a record with the same Customer ID) by the same Enterprise Member, then at Step 4066, the API may decrement the count for the pre-existing data element. Although it would be feasible to decrement the count when discrepant data is submitted, it is preferable not to do so. When a prior submitter has changed data, it can be presumed that the change was made knowingly and for good cause. Where discrepant data from multiple sources has been input, there is less reason to believe that any one of the sources is more reliable than any other, and so there is no reason to downgrade the pre-existing data. Preferably, the API will use a simple counter function and, when decrementing, will decrement the value for the older data by 1 in light of the disagreement between the two data elements and the determination by the originator of the data that the first value no longer is correct. Optionally, an Enterprise Member may manually adjust the value of the decrement to reflect its confidence in the change it has made. Assuming that all Enterprise Members have relatively similar standards in assessing the reliability of their own data, such optional adjustments could assist in more rapidly reaching reliable consensus as to particular data elements. Alternatively, the API may evaluate the reliability of an Enterprise Member according to a variety of factors, such as how often other Members modify their data to match that Member, how often the Member has a high consensus value for its own data, the number of edits that a Member makes to their own data, and determinations by the system owner that a Member is especially reliable (e.g., a well-known data vendor may have more reliable information than a small unknown company).
The API also will assign, at Step 406C, a consensus value to each newly submitted data element(s). Preferably in the example of
If the customer data entries have not changed or if one or more entries have changed and the consensus values already have been modified at Steps 406A through 406C, then at Step 407 the API records the counts of the functional dependencies. The final consensus values may take into account one or more of the number of times the same pairs appear regardless of the record with which they are associated; the number of times the same pairs appear in association with the system-wide identifier for this customer; the number of times the same pairs are associated with unique Customer IDs, thus indicating how many different Enterprise Members have the same information; and other statistical measures as appropriate.
At Step 408, the API then takes any additional steps necessary to validate the data. The API calculates a consensus value for the record as a whole, taking into consideration, as appropriate, one or more statistical measures, manual adjustments made by Enterprise Members, and/or the reliability of the Enterprise Members submitting data for the record. The API also calculates a consensus value for the data in each of the fields of the record, and/or the data in selected fields or functionally dependent fields considered to be of particular importance. The consensus value(s) in each case may be raw data in the form of counts and/or a calculated consensus value that rates the validity of the record and/or its constituent data elements compared to the values in the community database.
Finally, at Step 409, the API provides a report regarding the customer record to the Enterprise Member that submitted the record. Within the return information, the API may provide various consensus values for the customer record, including a consensus value for the record as a whole as well as consensus values for some or all fields of data, and/or for functionally dependent fields of data.
Optionally, the API may also provide a member-specific newsfeed, which would aggregate information regarding transactions that produce changes of consensus values, and inform the member that the new data has been submitted relating to that member's customers. As members review the changes and, in turn, update (or choose not to update) their own records in response to these newsfeeds, the community derived consensus values will quickly be propagated throughout the entire community and further updated where appropriate.
In this example scenario, Enterprise Member 101 submits the following record to the central database:
The next day, Enterprise Member 102 submits the following record to the central database:
The following day, Enterprise Member 103 submits the following record:
On the fourth day, Enterprise Member 102 submits the following record:
At most, for John Smith, data for only four data fields (FirstName, Surname, Street, and ZipCode) can be found. If the API has been programmed to increment counts only when five fields match, or when a combination of FirstName, Surname, and at least 3 of the remaining fields match, then the API would not correlate the first record for “John Smith” with the second record for “John Smith” and none of the entries for “John Smith” would obtain counts greater than one at this point.
Incrementing the count for data elements associated with customer John Smith in this scenario will require obtaining data from at least a third customer record pertaining to John Smith. If, for example, Enterprise Member 103 thereafter submits a customer record matching that submitted by Enterprise Member 101, then the data submitted by Enterprise Member 101 and Enterprise Member 103 for Company, Phone, and Email would each have a consensus value of two; and the data entries submitted by each of the Enterprise Members for FirstName, Surname, StreetAddress, and ZipCode would each have a consensus value of three.
The API may assign enhanced significance to modifications made by an enterprise member to data that the same enterprise member previously submitted. Rather than simply treating the new data as an addition to the knowledge base, the alteration can be treated as reflecting negatively on the originally submitted data. For example, after the records described above in this Scenario had been submitted, Enterprise Member 102 might submit a revised record for its customer Susie Jones. (The fact that this is the same customer Susie Jones, and not a separate Susie Jones, would be determined by matching the CustomerID assigned by Enterprise Member 102 to this record, with the CustomerID assigned by Enterprise Member 102 to the previously submitted “Susie Jones” record.) The revised customer record might contain data matching in all respects (except Customer ID) the customer record for Susie Jones previously submitted by Enterprise Member 103. Notably, this would mean that the company name, telephone number, and email address for this customer now have been changed by Enterprise Member 102. Treating this as a vote of no confidence in Enterprise Member 102's earlier-submitted data for those fields, the counts for Enterprise Member 102's previous data entries for Company, Phone, and Email would each be decremented by one, leaving those data entries with a count of zero. The counts, and thus consensus values, assigned to the data in the remaining fields would be unchanged.
It should be noted that one-letter domains are not, under current rules, valid and thus none of the email addresses given above would actually be valid email addresses. However, the comparison would not detect this invalidity if in fact the various enterprises did submit records containing the invalid addresses as given in the foregoing example. Separate steps using external data sources to validate email addresses (or data in other fields) can be employed, if desired, to enhance the validity of the data. Such steps are well known to those of ordinary skill in the art.
A consensus value is derived from a rules driven engine that may be adjusted with respect to the weighting factors. Multiple consensus value algorithms may be supported, and each Enterprise Member may select the consensus scheme that makes the most sense for its business. For example, Enterprise Member 101 might adopt the weighting scheme described above wherein alterations in data submitted by a member decrement the consensus value for that data, while Enterprise Member 102 might not decrement the original data under those circumstances but might instead add an additional increment to the consensus value of the substituted data.
The following example is essentially a subset of the prior Scenario, described with reference to
System Record 1111 comprises data elements in each of fields 501, 601, 651, 701, 751, 801, and 901. Those fields identify, respectively, the customer's surname, first name, street address, zip code, telephone number, and email address. System Record 2222 comprises data elements in each of fields 502, 602, 652, 702, 752, 802, and 902 which likewise identify, respectively, the customer's surname, first name, street address, zip code, telephone number, and email address.
Customer Record 301 comprises data elements for a third record, which has been processed by the API and determined to match System Record 2222. Customer Record 301 may emanate from Enterprise Member 103 and be identified with Customer ID EN3-503 or may, alternatively, emanate from Enterprise Member 102 and be identified with Customer ID EN2-502 (the same customer identification number assigned to a previously submitted record from Enterprise Member 102).
The customer data comprising System Record 1111 in
Optionally, the consensus value for the duplicated data would vary depending on whether the newly matching data emanated from Enterprise Member 103, identified with Customer ID EN3-503 or, alternatively, emanated from Enterprise Member 102, identified with Customer ID EN2-502 (the same customer identification number assigned to a previously submitted record from Enterprise Member 102). If the data emanated from Enterprise Member 103, then it would be treated as described above. However, if the data emanated from Enterprise Member 102, then the information previously submitted for the telephone number and email address might be considered discredited and those prior-submitted data elements might have their consensus values reduced by some amount, for instance by a value of 1. In the
The following example provides an illustration of user documentation that could be provided to explain to an Enterprise Member how to submit records to an embodiment of the inventive system, and how to interpret the report that would be returned. The example assumes that the API utilizes a representational state transfer (REST) software architecture; that it can be queried using hypertext transfer protocol (HTTP) commands, and in particular HTTP GET requests; and that responses are provided in a language-independent data interchange format known as JavaScript Object Notation (JSON).
These endpoints are supported by the API:
valid_co Validate Company
map_co_alias Map Company Alias
valid_contact Validate Contact
valid_email Validate Email
map_title Map Title
valid_address Validate Address
valid_linkage Validate Linkage
apikey The personal API key that is issued to each member of the community. The apikey is used to track all of your interactions. You can see your apikey on this page when you are logged in.
cmd The command for this call. You'll see cmd values in each of the various commands available for the API.
cust_id You must specify a unique key for each unique organization or person that you validate. This might be a primary key from the database that holds your data, an identifier that you use, or merely generated. We use this key to track changes in your data and to notify you when the consensus values have changed on your data.
payload This isn't really a parameter. When you validate a record in our system you may also submit extra fields that you want us to track. For example, if you were validating records from several disparate systems in your organization you might want to add an identify named “system” that indicates the source of the record. To add a payload, just add a parameter that isn't one of our required parameters and we'll store it with your record.
callback Specify the name of a function and your return JSON will come wrapped as a json function.
Company API Parameters
These are organizations consisting either of commercial enterprises, non-profits, or government entities.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=valid_co
&userid=example_userid
&name=example_name
&url=example_url
&address1=example_address1
&address2=example_address2
&city=example_city
&state=example_state
&postalcode=example_postalcode
&country=example_country
&phone=example_phone
&fax=example_fax
&sic=example_sic
&naics=example_naics
&revenue=example_revenue
&employees=example_employees
These are alias names for companies. This allows the comparison between names such as “International Business Machines” and “IBM”. We do not encode these values—we keep them in the database and aliases with extremely high consensus values are used in the normalization process.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=map_co_alias
&userid=example_userid
&name=example_name
&alternate=example_alternate
Contacts API Parameters
These are individuals that are either standalone or are part of an organization.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=valid_contact
&userid=example_userid
&name=example_name
&company=example_company
&title=example_title
&email=example_email
&address1=example_address1
&address2=example_address2
&city=example_city
&state=example_state
&postalcode=example_postalcode
&country=example_country
&phone=example_phone
&fax=example_fax
&mobile=example_mobile
Email API Parameters
These are email addresses. They're different from a contact in that all we know is a name and an email.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=valid_email
&userid=example_userid
&name=example_name
&email=example_email
Titles API Parameters
These are title mappings. This allows comparisons such as “Snr Engineer” and “Senior Engineer”. We do not encode these values. Title mappings with extremely high consensus values are used in the normalizations.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=map_title
&userid=example_userid
&name=example_name
&alternate=example_alternate
Addresses API Parameters
These are plain addresses, not necessarily connected with a company or contact. We do not encode these values.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=valid_address
&userid=example_userid
&address1=example_address1
&address2=example_address2
&city=example_city
&state=example_state
&postalcode=example_postalcode
&country=example_country
Linkages API Parameters
These are linkages between identities indicating that two different identifiers are actually the same person.bobzilla1742@gmail.com could also be @BigDaddy2783 on Twitter.
https://api.consensics.com/?apikey=YOUR_APIKEY
&cmd=valid_linkage
&userid=example_userid
&fromsystem=example_fromsystem
&fromid=example_fromid
&tosystem=example_tosystem
&toid=example_toid
The foregoing details are exemplary only. Other modifications that might be contemplated by those of ordinary skill in the art are within the scope of this invention, and the invention is not limited by the examples illustrated herein.
| Number | Date | Country | |
|---|---|---|---|
| 61613757 | Mar 2012 | US |