1. Field of the Invention
The present disclosure relates to the field of secured data processing and in particular to an apparatus and method for anonymizing data for further processing.
2. Background Art
Data owners, which can also include those that have generated the data or those who are associated with the generated data, are increasingly concerned with the privacy and the security of their data. Those that have earned or been given the right to electronically process this data, referred to herein as third party data processors, must ensure its security and integrity. Processing of the data may include the storage and/or analysis of the data as well as transformation of the data, for example using the original data as input to generate output data. One technique that can be used to enhance security is to process the data in an anonymous fashion. Anonymous data processing typically includes replacing an identifier of the data owner, such as a name or numerical identifier, included in or with the data with a proxy identifier such as a serially assigned unique number. This allows other unproxied information in the data, for example an income range, to be associated with independent data owners when being processed while also preventing the unproxied information from being easily attributed or tracked back to the actual data owner using the proxied data.
While the use of proxy identifiers is effective in anonymizing data during downstream processing, proxying the data directly limits the flexibility of an anonymizing system. For example, in the case where separate data items, such as postal code and gender, associated with the same data owner are provided to two or more downstream data processors using the same proxy identifier for the data owner, the possibility of correlating the postal code to the gender exists if the different data processors share or exchange data. When two or more data processors are able to correlate the anonymized data that they receive from the data processor, then the data owner's privacy and security has a higher potential to be at least partially compromised. For example, two data processors, one that receives postal code data and one that receives gender data, could collectively determine that a person (or number of persons) of a particular gender live in a particular postal code. As will be appreciated, the forgoing example is a trivial illustration; however with larger amounts of data extensive correlation analysis of anonymized data may result in discovery of multiple characterizing data items attributable to one or more data owners.
Furthermore, even if the data is provided to only one data processor, directly proxying data is inflexible since it may be difficult to change the proxied value associated with a particular data owner.
Therefore, there is a need for a mechanism for anonymous data processing that mitigates the possibility of compromising data owners' privacy and security when data associated with the data owner is provided to multiple downstream data processors and/or provides flexible anonymization of data.
In accordance with the present disclosure there is provided a method, implemented in a processing unit of anonymizing data of one or more data owners. The method comprises receiving an identifier and data associated with a data owner of the one or more data owners, determining a static owner identifier using the received identifier, performing a first hashing process using a first generated cryptographic salt and the static owner identifier to generate a first unique one-way hash result (HASH1) associated with the static owner identifier, determining a first anonymous identifier (AID1) associated with the HASH1 and storing in a memory unit associated with the processing unit the determined AID1 with at least a first portion of data (DATA1) associated with the received data.
In accordance with the present disclosure there is further provided a system for anonymizing data associated with a subscriber, the device comprising a computer readable memory unit for storing instructions and data, a network interface coupling the device to a network, and a processing unit for executing the instructions stored in the computer readable memory unit, the instructions when executed by the processing unit configuring the system to perform a method, implemented in a processing unit of anonymizing data of one or more data owners. The method comprises receiving an identifier and data associated with a data owner of the one or more data owners, determining a static owner identifier using the received identifier, performing a first hashing process using a first generated cryptographic salt and the static owner identifier to generate a first unique one-way hash result (HASH1) associated with the static owner identifier, determining a first anonymous identifier (AID1) associated with the HASH1 and storing in a memory unit associated with the processing unit the determined AID1 with at least a first portion of data (DATA1) associated with the received data.
In accordance with the present disclosure there is further still provided a computer readable memory storing instructions for configuring a computer to perform a method, implemented in a processing unit of anonymizing data of one or more data owners. The method comprises receiving an identifier and data associated with a data owner of the one or more data owners, determining a static owner identifier using the received identifier, performing a first hashing process using a first generated cryptographic salt and the static owner identifier to generate a first unique one-way hash result (HASH1) associated with the static owner identifier, determining a first anonymous identifier (AID1) associated with the HASH1 and storing in a memory unit associated with the processing unit the determined AID1 with at least a first portion of data (DATA1) associated with the received data.
Embodiments are described herein with reference to the drawings in which:
The data repository 102 may be any repository of data associated with a data owner. Additionally or alternatively, the data 104 and associated static owner identifier 106 may be received individually or in batches from one or more processes or components. The stored data may be data-of-interest that is to be provided to one or more downstream data processors 122. The data contained in the data repository 102 may include a plurality or set of information of various types and various formats and ranges. Each set of information may be associated with a data owner via a static owner identifier that uniquely identifies the data owner. In addition, as described further herein, the data owner may have one or more dynamic identifiers that uniquely identify the data owner, but that can change over time to identify a different data owner. The static owner identifier is persistent and does not change with time, whereas each dynamic identifier can change over time. In an example used herein, a data owner may be an Internet access subscribing household, the data-of-interest may include the Internet data traffic to and from the household, also referred to as click stream data. The data may also comprise data resulting from the processing of the click stream data. The traffic data to and from the household is associated with an Internet Protocol (IP) address that can change dynamically over time. This IP address may be the dynamic identifier. The data owner may be associated with one or more static owner identifiers such as, for example, an account identifier provided by an Internet Service Provider (ISP) and a media access control (MAC) address associated with a modem used to access the Internet through the ISP. It is possible for the ISP, or authority that provides the dynamic identifier, to determine the static owner identifier from the dynamic identifier. The dynamic identifier has been described above as a dynamically assigned IP address. It will be appreciated that IP addresses may also be statically assigned. It may also be possible to determine the static owner identifier, for example the MAC address, from a static identifier such as a statically assigned IP address, using the same process used for determining a static owner identifier from a dynamic identifier. Additionally or alternatively, the static identifier may be used as the static owner identifier.
It may be desirable to have the data 104 stored in the data repository 102 processed by a third party processor 122. However it may not be desirable to provide the data 104 associated with the static owner identifier 106 to the third party due to privacy or other concerns. In order to provide the data stored in the data repository to a third party without being able to associate the data to the data owner using the static owner identifier, the data is first anonymized. An anonymizer 108 receives the data 104 and associated static owner identifier 106. The anonymizer 108 associates an anonymous identifier (AID) with the data. In the example depicted in
The data stored in the repositories 114, 120 may be provided to different third party processors 122a, 122b (referred to collectively as third party processors 122). The third party processors 122a, 122b may process the data and store results in the repositories 114, 120 or alternatively in another repository. Since the anonymizer 108 associates different AIDs with different types of data, or with different copies of the same type of data, associated with the same static owner identifier, the third party processors 122a, 122b will not be able to associate the different data types back to the same data owner. Additional privacy may be provided by providing different AIDs based on the type of data, the third party processor the data is to be provided to, or both.
The ISP network 202 comprises a plurality of switches, routers or other network equipment 212a, 212b that routes data between the subscriber's computer 204 and the Internet 206. One or more network sensors 214 collect data from the ISP network. The data collected may be associated with a static owner identifier, or other identifier that can be used to determine an associated static owner identifier of the data owner. As will be appreciated, the static owner identifier may need to be determined from the network traffic. For example, if the subscriber's computer 204 is assigned an Internet Protocol (IP) address dynamically, the static owner identifier may be determined by using the dynamically assigned IP address to look up, or request from the address authority, the associated static owner identifier.
The network sensors 214 may pass the collected data to a data anonymization unit 216 that implements at least a portion of the anonymization system 100 including the anonymizer 108. The data anonymization unit 216 may comprise a processing unit and a memory unit (not depicted). As will be appreciated, the processing unit may comprise one or more processors coupled together. The one or more processors of the processing unit may be arranged on the same physical chip, or they may be arranged on multiple separate chips. Additionally, the processing unit may be further comprised of multiple processors or computing devices containing one or more processors coupled together, for example over a network. Similarly, the memory unit may comprise a plurality of memory devices for storing information. The memory devices of the memory unit may store information, including instructions and data, in volatile memory. The memory unit may also comprise memory devices for storing information in non-volatile storage. The data anonymization unit 216 is depicted as being a single physical component, as will be appreciated that data anonymization unit 216 may include multiple physical components coupled together. The multiple components may be located in the same location or may be located in different geographical locations.
Regardless of the specific physical configuration of the data anonymization unit 216, the data anonymization unit 216 is configured to anonymize the data collected by the one or more network sensors 214. The anonymized data may then be provided to one or more third party processors 122.
As depicted in
The anonymizer 108 receives a static owner identifier 106 that identifies the data owner and is associated with the data 104. The anonymizer 108 comprises a hash processor 302 that receives the static owner identifier 106. The hash processor 302 provides a one-way hash process that takes the static owner identifier 106 and a cryptographic salt 304 as input. The cryptographic salt 304 is a plurality of random bits that are used to help prevent the resultant hash from being reversed using a dictionary type attack. The hash process 306 takes the cryptographic salt 304 and the static owner identifier 106 as input and produces a fixed length string based on the inputs. Given the same inputs, the hash process will produce the same output. Given different inputs, the hash process 306 will, with a high probability, produce different outputs. Given the output of the hash process 306, it is mathematically complex to determine the original inputs, as such the hash process provides a one-way association between the input and output. Additionally, by using the cryptographic salt 304, it is more difficult to retrieve the static owner identifier from the output, since the salt value would need to be known in order to determine the static owner identifier 106. The hash process may be any appropriate one way function. For example the hash process may implement a message digest process such as Message-Digest algorithm 5 (MD5), or a secure hash algorithm such as Secure Hash Algorithm (SHA) 128 or SHA 256.
The cryptographic salt 304 used by the hash processor 302 is the same for all static owner identifiers that are hashed. The cryptographic salt 304 may be changed periodically; however; once the salt used is changed, inputting the same static owner identifier 106 into the hash processor 302 will produce a different output, and as such any data associated with the previous hash output of the static owner identifier 106 will be inaccessible, or will not be able to be associated with the same static owner identifier. If it is desirable to periodically change the salt used but still have the static owner identifier be associable to the previous hash output the old salt can be saved. Alternatively it may be desirable to periodically change the salt without storing it in order to make it impossible to associate data anonymized with the old salt with data anonymized with the new salt. For example, if the salt is changed once a month, only one month's worth of data will be able to be associated with a particular static owner identifier.
The cryptographic salt 304 may be provided in various ways. As depicted in
The salt 308 may be generated internally by the anonymizer 108 and the resultant salt inaccessible from processes external to the anonymizer 108. Additional privacy may be provided by having the salt 304 inaccessible from outside the anonymizer 108 since the salt 304 used when hashing a static owner identifier 106 must be known in order to be able to determine the static owner identifier 106 from the output of the hash process 306.
The salt generator 308 may produce the cryptographic salt 304, which is then stored in volatile memory of the memory unit. Alternatively, the cryptographic salt 304 may be produced by the salt generator 308 each time it is required by the hash process 306. The salt 304 stored in the volatile memory may be stored in a secured area of the volatile memory so that it is inaccessible to processes external to the anonymizer 108. The salt 304 stored in the protected memory of the memory unit may be accessed by the hash process 306 as required. Additionally or alternatively, the salt 304 may be stored in non-volatile memory of the memory unit. By storing the cryptographic salt 304 in non-volatile storage, the same salt may be used even following a power failure or rebooting of the anonymizer 108, or the hardware that has been configured to implement the anonymizer 108.
As describe above, the hash processor 302 receives a static owner identifier and in combination with a machine generated cryptographic salt generates a hash output 310 (HASH1). The anonymizer 108 associates the hash output (HASH1) 310 with an anonymous identifier 312 (AID1). The hash output 310 and the associated anonymous identifier 312 may be stored, for example in a look-up table or other similar structure such as repository 314. The hash output 310 and the anonymous identifier 312 may be stored in non-volatile storage of the memory unit.
The anonymous identifier 312 is associated in a one-to-one relationship with the hash output 310. The anonymous identifier may be produced by an anonymous identifier generator 318. The anonymous identifier 312 may be a unique random number or string or a unique number provided in a sequential order. Each anonymous identifier is associated with a unique hash output. Before generating a new anonymous identifier, the anonymizer 108 may check the hash outputs 310 stored in the repository 314 to determine if the hash output is already associated with an anonymous identifier 312. If the hash output 310 is already stored in the repository 314 and associated with an anonymous identifier 312, a new anonymous identifier does not need to be created. If however, the hash output 312 is not already stored in the repository 314, and so is not associated with an anonymous identifier 314, then a new anonymous identifier 312 is generated and the hash output 310 and new anonymous identifier 312 is then stored in the repository. The anonymous identifier 312 may be provided to third party processors 122.
Once the anonymous identifier 312 associated with the hash output 310 is determined, either by creating a new anonymous identifier or retrieving it from the repository 310, it is associated with at least a portion of the data 104 (DATA1) 316 that was associated with the static owner identifier 106. DATA1316 may be a portion of the data associated with the static owner identifier, or may be based on the data associated with the static owner identifier. Regardless of what DATA1316 is, it is associated with the anonymous identifier 312 that in turn is associated with the hash output 310 of the static owner identifier 106. The anonymous identifier 312 and DATA1316 may be stored in an anonymized repository 114. The anonymized repository 114 is depicted as being part of the anonymizer 108; however, rather than storing the anonymized data, the anonymizer may provide the anonymous identifier of a static owner identifier to another component or process external to the anonymizer 108 to be associated and stored with DATA1.
A third party processor 122 may access the anonymized repository 114 in order to process the anonymized data. All the data 316 that was originally associated with a particular static owner identifier 106 is associated with the same anonymous identifier so that all relationships between the data still exist; however, the anonymized data cannot be directly related back to a particular static owner identifier 106 or data owner.
Furthermore, the anonymizer may be configured to allow access to the data associated with an anonymous identifier. For example, a third party processor 122 may receive an identifier associated with a data owner and desire to retrieve data associated with the data owner from the anonymized repository 114. The third party processor 122 provides the identifier for which the anonymized data is requested. The identifier may then be used to determine the static owner identifier. The anonymizer may then determine the hash output using the static owner identifier and associated anonymous identifier stored in the repository 314. The anonymizer 108 may then be used to retrieve and provide the data associated with the anonymous identifier to the third party processor. By providing access to third parties, it is possible to allow the third parties to request data associated with a dynamic identifier, such as an IP address, and have the ISP provide the data from the anonymized data.
The anonymizer 108b is similar to that of
The data 416a, 416b, 416c may be a portion of the data 104 associated with the static owner identifier 106. Additionally or alternatively some of the data 416a, 416b, 416c may be the same data as other data 416a, 416b, 416c. The data 416a, 416b, 416c may be different types of data received separately at the anonymizer 108b, or it may be different parts of data received at the anonymizer at the same time. The data 416a, 416b, 416c may also be derived from the received data 104. By providing multiple hash processors 402a, 402b, 402c it is possible to create separate anonymous identifiers for different pieces of data. As such, even if multiple pieces of data are provided to the same third party processor 122, the third party processor will not be able to associate data of one type from a particular data owner with data of another type from the same data owner since each type of data will be associated with a different anonymous identifier 312, 412a, 412b, 412c.
As described above, each hash processor 402a, 402b, 402c may use a different input instead of the static owner identifier 106 used by hash processor 306. IAn anonymizer 108, 108b may use different combinations of the inputs described herein. As depicted in
It will be appreciated that the same repository may be used to store the hash output from multiple hash processors and associated anonymous identifiers. However, if the same repository is used, an indication of the hash processor used to generate the hash output should also be stored in order to ensure that if two hash processors generate the same hash output, they will be associated with different anonymous identifiers. Additionally or alternatively, the hash processors may be configured such that given the same input they produce different hash outputs. This may be done for example by having each hash processor use different cryptographic salts, different hash processes, or both different salts and different hash processes.
Since the input to the second hash processor 402a will always be different than the input to the first hash processor 302, both the hash process 406a and the salt 404a used by the second hash processor 402a may be the same as used by the first hash processor 302. However, additional security may be provided by using different cryptographic salts for each of the hash processors.
As depicted in
The hash processors 302, 402a, 402b each produce a given respective output for each static owner identifier. In contrast the hash processor 402c uses as input a random number produced by a random number generator 408. Since the hash processor 402c uses a random number as an input, multiple pieces of data associated with the same static owner identifier will likely result in different hash outputs and so be associated with different anonymous identifiers.
Each of the anonymous identifiers 312, 412a, 412b, 412c are associated with respective pieces of data 316, 416a, 416b, 416c and stored in one or more anonymous repositories 114, 418a, 418b, 418c. Any one of the third party processors 122 may then access the anonymous data repositories in order to process the data.
The third party processors may provide different functionality. For example, a third party processor may process the data for an ISP, for example generating a user profile from click stream data. Additionally or alternatively, a third party processor may request the retrieval of data associated with an identifier. The third party processor may provide the identifier to the ISP and receive data in response. For example, the anonymized data may include a user profile associated with a subscriber of the ISP. The profile data is associated with an AID. A third party processor may be, for example, an advertisement delivery service that provides advertisements for display on web sites or with other media. The third party processor receives an IP address of a subscriber to provide an advertisement for. The third party processor provides the IP address to the ISP, which determines the AID, as described above, and then retrieves the profile associated with the AID and provides the profile to the third party processor. The third party processor may then use the retrieved data, for example to provide an advertisement based on the retrieved profile.
The anonymizer 108c is similar to the anonymizer 108 described above with reference to
As depicted in
The dynamic identifier translator 509 receives a dynamic identifier 506 associated with data 104 and uses the dynamic to static owner identifier translation table to determine the static owner identifier that is associated with the dynamic identifier. The dynamic translator 509 then provides the static owner identifier to the hash processor 302, which hashes the static owner identifier, associates the hash output 310 of the hash process 306 with an anonymous identifier 312 and associates the anonymous identifier 312 with data 316 as described above with regards to
Although
If the received identifier is determined to be a dynamic identifier (Yes at 704), a static owner identifier associated with the dynamic identifier is retrieved 706, for example using a dynamic to static owner identifier translation table. If the identifier is determined not to be a dynamic identifier (No at 704), the received identifier is used as the static owner identifier 708. Once the static owner identifier is determined either at (706 or 708), a hash is performed using the static owner identifier and a generated cryptographic hash 710. Once the hash output is generated it is determined if there is an anonymous identifier already associated with the hash output 712. If there is no anonymous identifier associated with the hash output (No at 712), a unique identifier is generated for the anonymous identifier 714 and the hash output and generated anonymous identifier are stored together 716. If the hash output is already associated with an anonymous identifier (Yes at 712) the anonymous identifier associated with the hash output is retrieved 718. The anonymous identifier is associated with a piece of data associated with the received identifier 720. The piece of data and anonymous identifier may be stored in a repository 722 and access to the data provided to one or more third party processors.
The method of
The above description has described various systems and methods for anonymizing data. The systems and methods have been described with reference to various embodiments, and in particular to the implementation of the system and methods in an ISP network. The systems and methods described above can readily be adapted to anonymize data in environments or applications other than those described herein.
This application claims priority under 35 U.S.C.§119(e) to U.S. Provisional Application Ser. No. 61/225,203, filed on Jul. 13, 2009, the content of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61225203 | Jul 2009 | US |