The present invention relates to the field of matching data between two or more organizations in a private and secure manner using a distributed data system.
There is a plethora of personal information that is collected online and stored digitally. For one company to share data with another company, they must consider regulatory requirements associated with the sharing of a person's personal details, as well as ethical boundaries. The requirements may vary depending on the field of the industry, for example, banking and medical records would generally have a higher standard than musical or movie tastes, for example. These personal details, often referred to as “Personal Information” (PI), “Sensitive Personal Information” (SPI), or “Personally Identifiable Information” (hereinafter “PII”), are fields or groups of fields found in one or more databases, spreadsheets, cloud providers, and data repositories of an organization, which may identify an individual. In each country, regulations may define those field details that could identify a person in question, and that are therefore subject to control. This PII is sensitive and valued by the individuals that are described by it, and to organizations that collect and store it. Due to increasing awareness of privacy concerns including identify theft, there are increasing regulations worldwide to prevent the communication of PII, yet the data holds a great amount of useful information that may provide useful insights for organizations, were they able to share between them.
In the past, PII and other data has been shared between entities without a respect for the sensitivity of that PII or used only by the entity that collected the data, which presumably already had data security measures in place. However, there is a desire to combine the data from multiple entities to provide further insights to provide customers better products and services; and to share data in a more ethical and private manner.
If data could be combined without contravening the regulations, without directly transmitting PII, the data could be used for other purposes by stripping the data of personally identifiable characteristics, such as name, email, and address.
In an effort to allow data sharing between organizations, tertiary parties to the match process have come into play. These match systems require the organizations to share personal information with the independent party who provides a match table to be used to share data. These independent matching organizations then have access to all of the personal data from many organizations making them “honeypots” for unscrupulous actors.
Based on the foregoing, there is a need in the art for a system to remove the personally-identifiable aspects of data, to permit the data to be shared between entities and across geographies to extrapolate insights from the data. And to decentralize the risk of collecting all PII records into a single organizations control. It would therefore be useful to have a data “shredder” that creates small unidentifiable data portions, of a particular individual on their own, to distribute those “shreds” to multiple parties, and to be able to match the shreds to determine if a person is the same between the original databases.
A distributed data system has a network, a first organization connected to the network having a first database having personal identifiable information, the personal identifiable information divisible into a plurality of components, and having a first token associated with the personal identifiable information, and a first personal information gateways in communication with the first database and the network, wherein the personal information gateway is configured to divide the personal identifiable information into a plurality of data shreds, each data shred corresponding to a component, as well as a plurality of matching nodes connected to the network, wherein the nodes are configured to match data, wherein each data shred is configured to be transmitted to a matching node receiving only that component, wherein the matching nodes are configured to determine whether different shreds match.
In one embodiment, there is a second organization having a second database of a second organization having personal identifiable information divisible into a plurality of components and having a second token associated with the personal identifiable information and a second personal information gateway in communication with the second database, wherein the second personal information gateway is configured to shred the personal identifiable information into a plurality of second data shreds, each data shred corresponding to a component, wherein the second data shred is transmitted to the matching node, and wherein the matching node matches a first data token to a second data token if the first and second data shreds match.
In an embodiment, the matching of the first and second tokens permits data that does not contain personal identifiable information to be exchanged between the first and second organization and matched with an individual. The system may also have a policy administration system in communication with the first personal information gateway to provide personal identifiable information rules.
The system may have a data exchange configured to transmit data between the first and second organizations, using the match of the first and second tokens, without transmitting personal identifiable information. In an embodiment, the data shreds are hashed before being transmitted to the matching nodes. The hashed data shreds may be compared by the matching nodes, and the hashed data shreds may be hashed a second time after being matched by the matching node.
In an embodiment, the matching node is configured to provide a matching confidence score based on a number of positive matches. The system may also have more than one matching node, wherein an overall matching confidence score is determined from the matching confidence score of each matching node. The personal information gateways may convert the personal identifiable information of the first organization to binary format. Each of the one or more nodes is configured to store a specific data field.
The distributed data system may have a network, a first organization connected to the network having a first database having personal identifiable information, the personal identifiable information divisible into a plurality of components, and having a first token associated with the personal identifiable information, and a first personal information gateway in communication with the first database and the network, wherein the personal information gateway is configured to divide the personal identifiable information into a plurality of data shreds, each data shred corresponding to a component, a second organization connected to the network, having a second database of a second organization having personal identifiable information divisible into a plurality of components and having a second token associated with the personal identifiable information, a second personal information gateway in communication with the second database, wherein the second personal information gateway is configured to shred the personal identifiable information into a plurality of second data shreds, each data shred corresponding to a component, and a plurality of matching nodes connected to the network, wherein the nodes are configured to match data, wherein each data shred is configured to be transmitted to a matching node receiving only that component, wherein the data shreds are hashed before being transmitted to the matching node, wherein the matching nodes are configured to determine whether different shreds match, and wherein the second data shred is transmitted to the matching node, wherein the matching node matches a first data token to a second data token if the first and second data shreds match, and wherein if the first data token and second data token match, data that is not personal information may be exchanged between the first and second organizations through a data exchange.
A method of transmitting and comparing data is disclosed having the steps of sending data from a first database to a first personal information gateway, the personal information gateway shredding the data according to components, each component corresponding to a matching node, sending data from a second database to a second personal information gateway, the first personal information gateway generating a first token for the data received and sending the unique token back to the database, the second personal information gateway generating a second token for the data received and sending the unique token back to the database, the personal information gateway transmitting the data to one or more matching nodes according to the corresponding component, the first personal information gateway transmitting the matching request to the one or more nodes, each matching node corresponding to a component providing a match confidence score, and the one or more nodes generating a matching table comprising data of matching first and second tokens.
The method may have the additional step of the personal information gateway generating a first token for the data received and sending the unique token back to the database. It may also have the step of removing the data from the database after it has been sent to the personal information gateway. Another optional step is the personal information gateway cleansing and normalizing the data it has received.
In an embodiment, the personal information gateway places a one-way hash on the data it has received such that it does not contain plaintext data. The first and second organizations may exchange data that is not personal information when the first and second tokens are matched, and the first personal information gateway hashes the shredded data.
The foregoing, and other features and advantages of the invention, will be apparent from the following, more particular description of the preferred embodiments of the invention, the accompanying drawings, and the claims.
For a more complete understanding of the present invention, the objects and advantages thereof, reference is now made to the ensuing descriptions taken in connection with the accompanying drawings briefly described as follows.
Preferred embodiments of the present invention and their advantages may be understood by referring to
In reference to
In the preferred embodiment, when the first organization 1 wishes to send data from the database 20 to a second organization 2, to be combined with the data of the second organization, the data to be transmitted is split into PII records and non-PII records. The PII records are passed from the database 20 to the PIG 10 within the first organization 1. The PIG consists of a system which processes or “shreds” the PII into granular data elements (data shreds), typically individual fields of the PII. The granularity may be smaller, in the form of parts of fields (individual or small groups of characters) or parts of the ASCII characters forming the data. Each information field is broken into smaller portions by the PIG 10 as it is prepared for transmission, and is attributed a token ID that is unique to the complete PII record. The Token ID provides the PIG 10 with a way to link granular parts of PII together to determine the identity of the record. The information is transformed or shredded by the PIG 10 into portions small enough to strip the information down to data that cannot be considered PII. The data is transmitted to nodes 15 for further processing. Those transmissions may be secured within virtual private networks, secured by a secured socket layer (SSL) or equivalent technology and may only accept transmissions within a whitelist of subscribers.
The PIG may process the information to reduce identifiability in other ways than shredding, such as combining multiple fields, or maintenance of pseudo-records and/or aliases to match field values, albeit with a lowered match confidence or probability.
The organization 1 is connected to other organizations 2, 3 through the cloud 100, wherein each organization has a gateway for the data of a PIG 22, 23. Each organization 1, 2, 3 is connected to the policy administration system 6. The Policy Administration system 6 contains data policy information as to what is considered PII, which policies may be provided by regulatory or government bodies, both domestic and international, and determines what may be transmitted between which type of organizations, defining what is an acceptable or allowable match. The Policy Administration system 6 is connected to a data exchange 4. The data exchange 4 facilitates anonymized data transfer using tokens, and uses a record, or match list, of corresponding tokens between different organizations 1, 2, 3. The data exchange 4 may send non-PII attributes appended to tokens, as described in further detail in
With reference to
With reference to
With reference to
With reference to
In an embodiment of the data shredding by the PIG 10, wherein the data is not hashed, in step 83 the alphanumeric characters of data entries are converted from ASCII to binary, wherein the binary coding may be further broken up to better anonymize data before being transmitted, and to make any reconstruction meaningless and difficult. For example, an 8-bit binary ASCII character may be broken into two 4-bit nibbles. Future iterations could break that down further into 2-3 bit portions. Further, a secure tunnel (VPN) is generated between the PIG 10 and the nodes 15, to prevent interception of information sent through the tunnel.
With reference to
In one embodiment, each node carries a particular portion of the information, for example, if an email address is divided into 3 parts by the data shredding, Node A always receives the first part, Node Y always receives the second part, and Node Z always received the third part of the email address. Due to the shredding and distribution of the data, no one node 15 contains enough information to re-identify a person or be considered PII. In this way, personal information may stored on a torrent style network where all nodes 15 contribute to the distribution of the shreds of the original PII data.
The nodes 15 are connected to the cloud in a torrent-style network. The data may be received non-sequentially, maximizing the efficiently of different network connection between the nodes and the organizations. Data is received by organizations from many small data requests over different IP connections to different nodes, and reassembled from the small data requests on-site, as is common in torrent-style system.
With reference to
In an embodiment, the contributing party will encrypt or hash their data using a key or salt known only to them. The key or salt would be submitted through the management system into the matching nodes during the matching phase to unlock the used of their shred(s). This process can be likened to the two-key systems used in safe deposit boxes.
With reference to
With reference to
With reference to
In the preferred embodiment, the PIG 10 will be in communication with a policy administration system 6 to ensure proper regulation of data being transmitted. The policy administration system 6 describes whether a match is permitted ethically or legally, after applying rules regarding the type of information, its final destination (national or international, taking into account jurisdictional peculiarities, and optionally what other information is being transmitted alongside the information. Additionally, blacklists could be implemented via the PII policy administration system 6, to keep data or metadata from being obtained by a competitor's organization. Examples may include a blacklist for banks transmitting to another financial institution. In one embodiment, a permitted use governance system may be used to manage the white and black lists by the organizations themselves.
Each of the nodes 15 and organizations 1, 2, 3 are connected to a network, preferably the Internet to pass through a cloud. Due to the risk of interception of traffic that passes through publicly available networks, the data intended for communication is hashed before transmission, wherein the data hashes to a unique hash value, and wherein the data cannot be un-hashed to reveal the original data. There are a number of hash functions known in the art that could be used, for which a non-limiting example might be SHA or its variants. Preferred hash functions always produce the same output for a given input, and map the inputs as evenly as possible over the output range, and good hash functions also have a circumscribed output range. Ideally, to reduce ambiguity, the hashes are unique and for a given value only a single hash output is the result.
Each of the matching nodes 15 has a matching engine built in. In one embodiment, the contributor nodes also match and have a matching engine built in. The matching node 15 receives data from multiple organizations 1, 2, 3. If a particular data entry exists in multiple organizations 1, 2, 3, a simple grouping of those data entries is created within the node 15. In an embodiment, the nodes 15 are independent of the organizations 1, 2, 3. They are connected to the network 100, and are distributed similar to a torrent in one embodiment. In an embodiment, each node maintains a particular piece of the shredded data for multiple data records, so in an example a particular node may contain thousands of second triplets of users' telephone numbers, while another node may contain only the first triplet of users' telephone numbers. If an event arises wherein the originating organization 1 would like to utilize attributes of receiving organization 2, identity-matching needs to occur to ensure that the individual is the same person. The Policy Management System 6 receives a request for a match between two individuals' PII data in order to facilitate an exchange of attributes for an existing customer. Once the policy engine has approved the request, a match request will be sent out to each node requesting the two Tokens of any “MATCHED” requests for those accounts. Each node would independently respond with a table containing tokens that match the request
Optionally, during a match request, a map will be created to confirm all bits are available between parties and report missing components if required. This will allow the PIG admin to add additional nodes to increase the quality of the matching map. Even though this is permitted by the technology, it may be restricted from a regulatory point of view.
The organizations 1, 2, 3 are connected to the cloud 100 (generally a server network) through their PIGs 10. A number of nodes 15 are also connected to the cloud and may communicate with the organizations 1, 2, 3 through their PIGs 10, and may also communicate directly with other nodes 15. In an embodiment, the data will be further encrypted using a one-way hash using any one of a number of hash functions known in the art. In an embodiment, the hash is used when the data first exits the organization 1, 2, 3 by the PIG 10. This ensures that during its transmission to the storage node(s), it cannot be seen in clear text to maintain data privacy. Optionally, a second one-way hash is applied by the receiving node 15 when the data is received, and the data is stored in a double-hashed format, which further obfuscates the data and makes it impervious to any other site attempting to hack that location from the cloud. This also adds to layers of protection that make it so the PI bank management organization will not be allowed to get into someone's actual data.
In an embodiment, for each action on any given node, a transaction is recorded against a common ledger so that an immutable record exists on each match ever requested. In one embodiment, the ledger is recorded as a blockchain, such that prior records cannot be altered, and an audit path is always maintained. In an embodiment, a multi-tiered encryption model is used in which a transaction data block of the actor is individually encrypted, a transaction data block of each transaction is individually encrypted, and a chain of data blocks is encrypted. Before decrypting the data pertaining to a party of a transaction, the chain of data blocks must be decrypted, followed by a decryption of the transaction's transaction data block, followed by a decryption of the party's transaction data block. In this way the placement and use of all PII by any employee of an organization is now fully captured in an independent, immutable, and distributed way.
In an embodiment, each Company Database can replace their PII with tokens. Any person or application requiring the use of PII would use a governance engine that supports permitted use of that data. In this way, the PIG 10 becomes the single source of all PII data within an organization as well as the single place requiring protection and management. This improves and standardizes records management and data cleansing while maintaining internal data security measures. In another embodiment, the PIG 10 may actually be formed of two components, a first that holds the master records (on a secluded network) and a second that stores the hashed shredded records and can communicate directly with the Internet.
In an embodiment, the initial architecture of the system will require there to be enough nodes to ensure no single node can re-identify a person. For example, if Node 1 held a given name and surname, or a surname and phone number, that could be enough to re-identify. Even though all chunks are stored in an encrypted way, this will ensure that the data stays de-identified. Some other PI chunks could be placed together with less risk such as birth month and city. In the embodiment, the data schema is laid out in such a way as to ensure no single point of failure could cause an outage in the use of data. Whether through redundant copies or multiple parity chunks such as what is employed with object storage or other scale-out storage solutions can be used.
In an embodiment, due to the distributed nature of the deployment, each organization will have used varying levels of security. A hacker would be required to hack multiple environments simultaneously to retrieve useful data. Even then, it may be similar to retrieving a large phone book with an arbitrary account number as the single identifier of the provider organization.
In an embodiment, a binary conversion may be utilized to convert alphanumeric characters to binary to increase the granularity of the distribution of characters. As more complex characters are intended for use, the coding should not be limited to UTF-8. Because of distributing the binary elements, even letters of a name are unintelligible to the PIG storing the data and the matching process between organizations will take little processing to accomplish.
The invention has been described herein using specific embodiments for the purposes of illustration only. It will be readily apparent to one of ordinary skill in the art, however, that the principles of the invention can be embodied in other ways. Therefore, the invention should not be regarded as being limited in scope to the specific embodiments disclosed herein, but instead as being fully commensurate in scope with the following claims.