A business organization (a retail business, a professional corporation, a financial institution, and so forth) may collect, process and/or store data that represents sensitive or confidential information about individuals or business organizations. For example, the data may be personal data that may represent names, residence addresses, medical information, salaries, banking information, and so forth. The data may be initially collected or acquired in “plaintext form,” and as such may be referred to as “plaintext data.” Plaintext data refers to ordinarily readable data. As examples, plaintext data may be a sequence of character codes, which represent the residence address of an individual in a particular language; or the plaintext data may be a number that that conveys, for example, a blood pressure reading.
For purposes of controlling access to sensitive information (e.g., information relating to confidential or sensitive information about one or more business enterprises and/or individuals) plaintext data items, which represent the sensitive information, may be converted, through a process called “pseudonymization,” into corresponding pseudonymns, or pseudonym values. In this context, a “plaintext data item” (also referred to as “plaintext,” or a “plaintext value” herein) refers to a unit of data (a string, an integer, a real number, and so forth) that represents ordinarily readable content. As examples, a plaintext data item may be a string of character codes that corresponds to data that represents a number that conveys, in a particular number representation (an Arabic representation, for example), a blood pressure measurement, a salary, and so forth. The pseudonym value ideally conveys no information about the entity associated with the corresponding plaintext value. The pseudonymization process may or may not be reversible, in that reversible pseudonymization processes allow plaintext values to be recovered from pseudonym values, whereas irreversible pseudonymization processes do not.
The pseudonymization process may serve various purposes, such as regulating access to sensitive information and allowing the sensitive information to be analyzed by third parties. For example, the sensitive data may be personal data, which represents personal information about the public, private and/or professional lives of individuals. In some cases, it may be useful to process pseudonymized data to gather statistical information about the underlying personal information. For example, it may be beneficial to statistically analyze pseudonymized health records (i.e., health records in which sensitive plaintext values have been replaced with corresponding pseudonym values), for purposes of gathering statistical information about certain characteristics (weights, blood pressures, diseases or conditions, diagnoses, and so forth) of particular sectors, or demographics, of the population. The pseudonymization process may, however, potentially alter, if not destroy, statistical properties of the personal information. In other words, a collection of plaintext values may have certain statistical properties that are represented by various statistical measures (means, variances, ranges, distributions, expected and so forth). These statistical properties may not be reflected in the corresponding set of pseudonym values, and accordingly, useful statistical information about the personal information may not be determined from the pseudonymized data.
As a more specific example, one way to convert plaintext data (e.g., personal data, such as data representing health records, salaries, addresses, and so forth) into a corresponding set of pseudonyms is to encrypt the plaintext data. However, encrypting data may destroy statistical properties of the data. For example, the encryption of plaintext data that has a Gaussian, or normal statistical distribution, may produce a set of pseudonym values, which have an associated uniform probability distribution.
In accordance with example implementations that are described herein, a pseudonymization process converts plaintext values into corresponding pseudonym values in a process that preserves a statistical distribution of the plaintext values. Moreover, in accordance with example implementations, the pseudonymization process is irreversible. In other words, in accordance with example implementations, it may be quite challenging, if not impossible, to reconstruct the plaintext values from the pseudonym values.
More specifically, in accordance with example implementations, a pseudonymization engine converts plaintext values (assumed to have a normal statistical distribution) to pseudonym values that have a normal statistical distribution. In accordance with example implementations, the pseudonymization engine repeatedly applies a hash function (a cryptographic hash function, such as an SHA-2 hash function or an SHA-3 hash function, as examples) in the conversion of each plaintext value.
The output of a hash function is a pseudorandom value. In accordance with the Central Limit Theorem, the sum of several such hash values may approximate or reach a normal, or Gaussian distribution. More specifically, if “H” represents a hash function and “H(x)” represents the application of the hash function to an input value x, the sum H(x)+H(H(x))+H(H(H(x))) approximates, if not exactly matches, a normal distribution. In accordance with example implementations that are described herein, a pseudonym value is determined by repeatedly applying a hash function and adding the resulting hashes together, as set forth above in the summation above. In accordance with example implementations, the resulting set, or collection, of pseudonym values has a predetermined statistical distribution (a Gaussian or normal distribution, as an example); and due to the hash function being a one way function, the pseudonymization may be irreversible.
Referring to
Regardless of its particular form, in accordance with some implementations, the computer system 100 may include one or multiple processing nodes; and each processing node 110 may include one or multiple personal computers, workstations, servers, rack-mounted computers, special purpose computers, and so forth. Depending on the particular implementations, the processing nodes 110 may be located at the same geographical location or may be located at multiple geographical locations. Moreover, in accordance with some implementations, multiple processing nodes 110 may be rack-mounted computers, such that sets of the processing nodes 110 may be installed in the same rack. In accordance with further example implementations, the processing nodes 110 may be associated with one or multiple virtual machines that are hosted by one or multiple physical machines.
In accordance with some implementations, the processing nodes 110 may be coupled to a storage 160 of the computer system 100 through network fabric (not depicted in
The storage 160 may include one or multiple physical storage devices that store data using one or multiple storage technologies, such as semiconductor device-based storage, phase change memory-based storage, magnetic material-based storage, memristor-based storage, and so forth. Depending on the particular implementation, the storage devices of the storage 160 may be located at the same geographical location or may be located at multiple geographical locations. Regardless of its particular form, the storage 160 may store pseudonymized data records 164 (i.e., data representing pseudonyms, or pseudonym values, generated as described herein).
In accordance with some implementations, a given processing node 110 may contain a pseudonymization engine 122, which is constructed to, for a given plaintext value, repeatedly apply a hash function (a cryptographic hash function, as an example) to produce multiple hash values, which are added together to produce the corresponding pseudonym value, as described herein. Due to the use of a hash function and the corresponding hash values, the pseudonymization process is irreversible, in accordance with example implementations.
In accordance with example implementations, the processing node 110 may include one or multiple physical hardware processors 134, such as one or multiple central processing units (CPUs), one or multiple CPU cores, and so forth. Moreover, the processing node 110 may include a local memory 138. In general, the local memory 138 is a non-transitory memory that may be formed from, as examples, semiconductor storage devices, phase change storage devices, magnetic storage devices, memristor-based devices, a combination of storage devices associated with multiple storage technologies, and so forth.
Regardless of its particular form, the memory 138 may store various data 146 (data representing plaintext values, pseudonym values, hash function outputs, mathematical combinations of hash values, intermediate results pertaining to the pseudonymization process, and so forth). The memory 138 may store instructions 142 that, when executed by one or multiple processors 134, cause the processor(s) 134 to form one or multiple components of the processing node 110, such as, for example, the pseudonymization engine 122.
In accordance with some implementations, the pseudonymization engine 122 may be implemented at least in part by a hardware circuit that does not include a processor executing machine executable instructions. In this regard, in accordance with some implementations, the pseudonymization engine 122 may be formed from whole or in part by a hardware processor that does not execute machine executable instructions, such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), and so forth. Thus, many implementations are contemplated, which are within the scope of the appended claims.
As depicted in
In accordance with further example implementations, the pseudonymization engine 122 may determine fewer or more than three hash values and base the determination of each pseudonym value on the summation of these hash values. For example, in accordance with further example implementations, the pseudonymization engine 122 may set the pseudonym value equal to H(x)+H(H(x)).
The number of hash function iterations control the statistical distribution of the pseudonym values.
Moreover, in accordance with further example implementations, the pseudonymization engine 122 may further process a set of pseudonym values that are derived using a summation of hashes (such as one of the summations described above) to further manipulate statistical properties of the pseudonym values. For example, after the pseudonymization engine 122 uses one or multiple hash function iterations to reach or approximate a given distribution, such as a normal distribution, as depicted in
The pseudonymization engine 122 may, in accordance with example implementations, apply a statistical distribution transformation function to the set of intermediate pseudonym values to further manipulate statistical properties of the resulting pseudonym dataset. For example, in accordance with some implementations, the pseudonymization engine 122 may apply a Box Muller or a polar Marsaglia transformation, as just a few examples. In this manner, the pseudonymization engine 122 may, for example, convert a set of intermediate pseudonym values having a normal statistical distribution into a set of pseudonym values that have a log-normal statistical distribution.
Referring to
Referring to
Referring to
While the present disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having the benefit of this disclosure, will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations.