The present invention relates to anonymization of personal Information.
These days when integration of an enormous quantity of data for individuals is being progressed, corporation that treats personal information is required to consider protection of the privacy. A business operator that treats personal information necessarily observes at least Act on the Protection of personal Information (hereinafter, simply referred to as Protection Law) and applicable laws and regulations. The Protection Law obligates the management and administration for collecting and using the personal information and government ministries stipulate guidelines for concrete measures thereof.
One of the measurements for management stipulated by the guidelines is anonymization of personal information. For example, the Health, Labor, and Welfare Ministry requires the personal information be anonymized in a case of providing to a third party of personal information regarding medical care, conference presentation, report of medical accident unless particularly necessary. Further, The Ministry of Economy, Trade and Industry also has the anonymization of personal information as a desirable measurement at the time of providing the personal information to the third party.
The simplest anonymizing process of personal information includes removing information that is capable of identifying an individual from, the personal information and obfuscating the information. An example of the former includes processing that removes a name and an address, and examples of the latter include processing that converts an address into the unit of prefectural and city governments and processing that converts an age into a unit of 10 years. Hereinafter, when an object to be obfuscated is represented by a tree structure in accordance with the level of obfuscation, it is referred to as a generalization hierarchy tree.
However, even though the anonymization processing is performed, in some cases, if a plurality of attributes regarding the individual is combined, the individual may be identified. For example, if the combination by the address of the unit of prefectural and city governments and the age of a unit of 10 years is a very rare case, the individual may be specified. Therefore, in anonymization, it is required to further definitively remove the identifiability.
As a technology for removing the identifiability, there is an anonymization technology that sets a threshold and generates anonymous data that guarantees that the threshold or more of combinations of arbitrary attribute values included in personal information data are included in the data. This invention belongs to this kind of anonymization technology. This kind of anonymization technology is disclosed in Non-Patent Document 1.
In K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient Full-Domain K-Anonymity,” 2005 ACM SIGMOID International Conf. Management of Data, pp. 49-60, 2005 (Non-Patent Document 1). It is disclosed that by obfuscating an attribute value in personal information data using a generalization hierarchy tree, it is guaranteed that at least a threshold number of the combinations of arbitrary attribute values occur in the disclosed data.
The technology of Non-Patent Document 1 requires to separately define a generalization hierarchy tree that defines a level of obfuscation for every attribute. Further, since all of candidates which reach the threshold value or higher are output, anonymous data to be used needs to be selected. Therefore, it is difficult to automate a unit that determines a dominance of the availability between anonymous data.
The present invention has been made in an effort to appropriately protect personal Information while lowering an operational cost of anonymization of personal information.
It is disclosed that a personal information anonymization device includes a personal information storing unit configured to store one or more personal information formed of an attribute value for every attribute; a generalization hierarchy tree automatic generation unit configured to select one attribute and automatically configure a generalization hierarchy tree that represents a dominant concept of each attribute value which occurs in the input personal information for each attribute as a tree structure in accordance with a level of obfuscation using a frequency obtaining unit that counts the number of input personal information having the attribute value for every attribute value that occurs in the selected attribute; and a unit configured to recede the input personal information using the generalization hierarchy tree generated for each attribute using the generalization hierarchy tree automatic generation unit. Therefore, the above-mentioned problems may be solved.
It is possible to reduce the operational cost accompanied by the automation and appropriately protect the personal information.
a) is a view illustrating an example of a generalization hierarchy tree table in the first embodiment.
b) is a view illustrating an example of a generalization hierarchy tree table in the first embodiment.
c) is a view illustrating an example of a generalization hierarchy tree table in the first embodiment.
a) is a view illustrating an example of a user defined hierarchy tree and a generalization hierarchy tree based on the user defined hierarchy tree in the third embodiment.
b) is a view illustrating an example of a user defined hierarchy tree and a generalization hierarchy tree based on the user defined hierarchy tree in the third embodiment.
c) is a view illustrating an example of a user defined hierarchy tree and a generalization hierarchy tree based on the user defined hierarchy tree in the third embodiment.
example in the third embodiment.
a) is a view illustrating an operational example in the third embodiment.
b) is a view Illustrating an operational example in the third embodiment.
c) is a view illustrating an operational
example in the third embodiment.
Hereinafter the best modes for carrying out the present invention will be described in detail with reference to the drawings.
Three embodiments which will be described below are technologies that mainly protect electronic format of personal information. The term “personal information” used in the embodiments means information about an individual which may identify a specific individual by name, date of birth, or other information. Further, information which may be easily cross-checked with other information to identify the specific individual may be included in the personal information. In this embodiment, the term “anonymization of the personal information” refers to processing that converts the personal Information so that a subject of the information cannot be easily identified. Further, the term “receding” means replacing an attribute value that describes an arbitrary attribute of an individual with a more ambiguous concept.
A configuration example of a device that implements a technology of a first embodiment will be described with reference to
The storage 103 is, for example, a storage media such as a CD-R (compact disc recordable), a DVD-RAM (digital versatile disk random access memory), or a silicon disk, a driving device of the storage media, or an HDD (hard disk drive). The storage 103 stores a personal information table 131, an anonymous Information table 132, a minimum, identical value occurrence information 133, an attribute type information 134, and a program 151. The personal information table 131 stores personal information regarding a plurality of individuals. In this embodiment, personal information for each individual is formed of Item values for a plurality of items. The anonymous information table 132 stores a result that anonymizes the personal information table 132 according to the embodiment of the present invention. The minimum identical value occurrence information 133 stores a threshold value. The attribute type information 134 stores information types of attributes of the personal information table 131. The program 151 implements the functions which will be described below.
The input device 104 is, for example, a keyboard, a mouse, a scanner, or a microphone. The output device 105 is a display, a printer, or a speaker. The communication device 106 is, for example, a FAN (local area, network) board and is connected to a communication network (not illustrated).
The CPU 101 loads the program 151 in the memory 102 and executes the program to implement a generalization hierarchy tree automatic generation unit 121 and a recoding unit 122. If necessary, the receding unit 122 implements a lost information amount metric unit 123 as internal processing.
The generalization hierarchy tree automatic generation unit 121 has the personal information table 131 and the attribute type information 134 as an input to obtain a frequency of all attribute values from the attributes of the personal information table 131 and create a Huffman coding tree or a Shannon-Fano coding tree or Hu-Tucker coding tree from the obtained frequency information and type information of the attribute obtained from the attribute type information 134. The generalization hierarchy tree automatic generation unit 121 stores the created trees in a generalization hierarchy tree table 135 as a generalization hierarchy tree.
The recoding unit 122 has the personal information table 131, the minimum identical value occurrence information 133, and the generalization hierarchy tree table 135 as inputs to recede the attribute value in accordance with the generalization hierarchy tree corresponding to each attribute obtained from the generalization hierarchy tree table 135 so that the number of all records which are present in the table becomes larger than a value that is stored in the minimum identical value occurrence information 133. The recoding unit 122 outputs the result to the anonymous information table 132. Further, the result may be output to the output device 105.
The lost information amount metric unit 123 is a part that quantitatively estimates an amount of information of data lost by recoding the attribute value and is called from the recoding unit 122, if necessary.
Next, a specific example of the above-mentioned tables will be described.
First, referring to
In
A first row of the table Illustrated in
Information in the above-mentioned personal information table 131 is considered to be stored in advance.
Further, an item of personal information is not limited to the Items illustrated in
Next, referring to
In the example of
Further, the value of the minimum identical value occurrences 301 is not limited to five, but may be arbitrarily set.
Next, referring to
The attribute type information 134 defines an information type of an attribute for designating a configuring method when a generalization hierarchy tree of an attribute to be anonymized is configured. Table 134-a of the example of
Next, referring to
Here, as described above, the generalization hierarchy tree table 135 is created by the generalization hierarchy tree automatic generation unit 121 by referring to the personal information table 131 and the attribute type information 134. First, a conceptual view of the generalization hierarchy tree 135-a1 created for the attribute “address” 201 is illustrated in
In
For example, the nodes 501 and 502 are internal nodes. In each node, a label 5031 and a frequency 5032 are associated. An original attribute value is associated to the leaf as a label and as a frequency, the number of occurrences of the attribute values in the personal table is associated. For example, the leaf 503 is labeled with “Bunkyo-ku, Tokyo” and the number of occurrences 35 is associated as a frequency. In the label of the internal node, an abstract concept that is capable of indicating all of children is allocated and total frequencies of all of the children are allocated as the frequency.
For example, an attribute “address” 201 is a string manipulation type of a right-hand truncation type if the attribute type information 134 is referred to. Therefore, the node 503 “Bunkyo-ku, Tokyo” and the node 504 “Toshima-ku, Tokyo” are generalized to a more abstract concept as the same parent node 502 and “Tokyo” is allocated as a label of the node 502. Further, as a frequency of the node 502, the total frequencies of all of the children are associated. Similarly, a result that performs the string manipulation of the right-hand truncation type on the generalization hierarchy structure of all of the attribute values and outputs the generalization hierarchy structure as a tree structure is a generalization hierarchy tree 135-a1.
In
A first row 511 of the table 135-a2 indicates a label of each column and each record of second and subsequent rows corresponds to one node. In other words, a left column refers to a label of the node, a center column refers to a label of a parent node of the node, and a right column refers to a frequency of the node. For example, the record 512 corresponds to the node 501. Since the node 501 is a root, the node 501 does not have a parent. In this case, in the center column, a value which is referred to as “Null” is stored and a frequency 205 of the node 501 is stored in the right column. Similarly, a record corresponding to the node 502 is a record 513.
Further, the invention is not limited to an attribute of a string manipulation type of the right-hand truncation type, but a generalization hierarchy tree for an arbitrary attribute type may be stored in the storage by this method.
In
Further, the invention is not limited to an attribute of a string manipulation type of the right-hand truncation type, but a generalization hierarchy tree for an arbitrary attribute type may be managed on the memory by this method.
Next, referring to
The table 135-b2 represents a frequency of the attribute value of the attribute “age” 202 as a table which shows that the number of records having an attribute value “20” is 50, the number of records having an attribute value “25” is 35, the number of records having an attribute value “27” is 25, the number of records having an attribute value “33” is 40, and the number of records having an attribute value “38” is 55, and there is no record having other attribute values. In this example, the type of attribute values is limited to five kinds, but does not need to be limited thereto. When the order preservation type generalization hierarchy tree is constructed using the frequency table 135-b2, a generalization hierarchy tree 135-b1 is created.
Further, in the generalization hierarchy tree 135-b1, the generalization hierarchy tree is configured in a form of storing the size order so that a label of the internal node may be designated in a form of a range. For example, in the node 531, a label of “20-27” may be designated. In other words, ranges indicated by labels of two nodes which do not have a grandparent-grandchild relationship do not overlap.
Referring to
Next, referring to
An example that configures the generalization hierarchy tree using frequency information 135-c2 is a tree 135-c1. Labels which are allocated to the internal nodes list labels of leaves which are lower-ranked than the internal node. For example, in the node 541, labels of “China, France, Germany, United States, England” are allocated, which may be interpreted as “China or France or Germany or United States or England”.
Referring to
Further, in the example of
Next, referring to
First, the generalization hierarchy tree automatic generation unit 121 automatically generates generalization hierarchy trees referring to the personal information table 131 and the attribute type information 134 and stores the result in the generalization hierarchy tree table 135 (S801). Next, referring to the personal information table 131, the minimum identical value occurrences 133, and the generalization hierarchy tree table 135, the receding unit 122 recedes data such that the number of arbitrary records is five or larger as illustrated in the minimum identical value occurrences 301 and stores the result in the anonymous information table 132 (S802).
Further, in
Next, referring to
First, some notations will be defined. m refers to a total number (number of columns) of attributes of the personal information table 131. The columns of the personal information table 131 will be called as zeroth column, first column, . . . , m−1-th column in order from the left.
In
Next, it is checked whether j is smaller than m (S903). If j is equal to or larger than m, the processing is completed.
In the determination of the step S903, if j is smaller than m, an attribute type of a j-th attribute is obtained from the attribute type information 134 (S904) and the processing is conditionally branched in accordance with the result (S905).
If the attribute type of the attribute is the “string manipulation type” in the step S905, first, all attribute values that occur in the personal Information table 131 of the j-th attribute are listed without omission (S911). Specifically, it is determined whether an attribute value corresponding to the j-th attribute is already listed while scanning all records. If the attribute value is not listed, the attribute value is listed. In order to determine whether to list an attribute value, for example, a data structure such as set which is provided by a standard library of C++ which is a programming language may be used.
Next, the string manipulation designated from, the listed attribute values is performed, an inclusive relationship is extracted, and a tree is configured based on the inclusive relationship (S912). The method of extracting the inclusive relationship depends on various known string manipulation methods. For example, in the case of string manipulation of the right-hand truncation type as illustrated in the example of
If the attribute type of the attribute is “order preservation type” in the step S905, first, frequency information of all attribute values of the j-th attribute is obtained (S921). Specifically, it is determined whether an attribute value corresponding to the j-th attribute of a record which is being currently scanned is already listed while scanning all records. If it is determined that the attribute value is listed, a counter that counts a frequency of the attribute value is increased by one. If it is determined that the attribute value is not listed, a counter of a frequency of the attribute value is set to 1. As a data structure, a map which is provided from a C++ standard library is used. The map is configured by associating a value to an element in a set in the set which is described above. The element of the set is referred to as a key and the associated value is referred to as a value. At the time of completing to scan all records, frequencies of the attribute values are stored in the map.
Next, using the frequency information of the j-th attribute obtained above, the Hu-Tucker coding tree is configured, which becomes a generalization hierarchy tree of the attribute (S922). As a method of configuring the coding tree, a method disclosed in Non-Patent Literature “D. E. Knuth, “The Art of Computer Programming: Volume 3 Sorting and Searching,” Addison-Wesley, pp. 439-444, 1973” may be used. Also in this case, similarly to the step S912, a label may be appropriately allocated to the node. Further, in the case of “order preservation type”, as described above, as a range where the attribute values do not overlap, a label of the internal node may be allocated. After completing the processing of the step S922, the sequence proceeds to processing of the step S941 which will be described below.
If the attribute type of the attribute is “the others” in the step S905, first, all frequency information of the j-th attribute is obtained (S931), which is absolutely equal to the processing S921.
Next, using the frequency information of the j-th attribute obtained above, the Huffman coding tree or the Shannon-Fano coding tree are configured, which become generalization hierarchy trees of the attribute (S932). Which coding tree is used is determined by a designer of the computer 100 in advance. Further, as a method of configuring the Huffman coding tree, a method disclosed in Non-Patent Literature “T. S. Han and K. Kobayashi, “Mathematics of Information and Coding,” American Mathematical Society, pp. 99-105, 2002” is used. As a method of configuring the Shannon-Fano coding tree, a method disclosed in Non-Patent Literature “T. S. Han and K. Kohayashi, “Mathematics of Information and Coding,” American Mathematical Society, pp. 95-96, 2002” is used, After completing the processing of the step S932, the sequence proceeds to processing of the step S941 which will be described below.
After completing the processing of the step S912, S922, or S932, the frequency information of the nodes of the generalization hierarchy tree configured in the steps is updated (S941). Further, a detailed updating method will be described below with reference to
Next, the configured generalization hierarchy tree is stored in the generalization hierarchy tree table 135 (S942) and j+1 is substituted in 1 (S943) and then the sequence returns to the evaluation of the above-mentioned step S903.
j monotonically increases and is necessarily larger than m. Therefore, the generalization hierarchy tree for all attributes as described above may be configured.
Referring to
First, frequency information of all attribute values of the j-th attribute is obtained (S1001). The step S1001 is absolutely equal to the step S921.
Next, the obtained frequency information is allocated to a leaf corresponding to the generalization hierarchy tree of the j-th attribute (S1002). Specifically, a frequency obtained in the step S1001 is substituted in the frequency 5215 of the data structure of the correspondfng leaf, which is carried out for all leaves.
A routine of
The routine of
Next, 0 is substituted in i (step S1005).
Next, it is determined whether i is smaller than p (S1006). If i is equal to or larger than p, the sequence proceeds to a step S1010 which will be described below.
In the determination of the step S1006, if i is smaller than p, it is determined whether a frequency is already allocated Into the i-th child (S1007). If the frequency is already allocated, i+1 is substituted in i (S1009), and then the sequence returns to the step S1006.
In the determination of the step S1007, if the frequency is not allocated to the i-th child yet, the routine of
In the determination of the step S1006, if i is equal to or larger than p, the total number of frequencies of zero-th, first, . . . , p−1-th child is set as a frequency of the node (S1010).
By doing this, frequencies of all nodes may be set.
Next, referring to
First, the personal information table 131 and the generalization hierarchy tree table 135 are loaded on the memory (S1101). The generalization hierarchy tree table 135 is specifically managed on the memory using the above-mentioned data structure 521. Further, as described above, the automatic generation S801 of the generalization hierarchy trees and the recoding S802 are performed at different timings. Therefore, if the generalization hierarchy trees are corrected or have been corrected, the generalization hierarchy tree automatic generation unit 121 needs to update the frequency information of the generalization hierarchy trees using the method of
Next, an empty list v in which the nodes are stored is prepared (S1102) and 0 is substituted in j (step S1103). In the list v prepared in step S1102, the nodes are stored and each of the stored elements e indicates a candidate in which a label of a child of e is receded to a label of e and is dynamically changed in the processing of the step S802.
Next, it is determined whether j is smaller than m (S1104). If it is determined that j is smaller than m, in the j-th generalization hierarchy tree, all nodes in which all children are leaves are added to v (step S1105). j+1 is substituted in j (S1106) and the sequence returns to the step S1104.
In the determination of S1104, if it is determined that j is equal to or larger than m, it is determined whether the number of all attribute data tuples that occur in the personal information table on the memory is k or larger (S1107). Specifically, the data structure such as map is prepared and if all attribute data tuples indicated by a record are present in a key set of the map, a count which is stored in the value is counted up by one. If the all attribute data tuples are not present in the key set, 1 is substituted in the key as a value. The above processing is carried out for all records. It may be determined whether the number of the all values which are stored in the map is k or larger.
In the determination of the step S1107, if it is determined that the number of the data tuples is k or smaller, a loop of the step S1108 is processed. The loop is carried out on ail elements w in v.
In the loop S1108, a lost information amount when an attribute value of all records having a label of a node of a child of w as data is recoded to a label of w is calculated by the lost information amount metric unit 123 (S1109). The method of calculating the lost information amount will be described below.
After completing the loop S1108, labels of all records hawing a label of a node of a child of node u having the least lost information amount in v as data are receded to a label of u (S1110).
Next, all children of u are deleted and u is used as a leaf so that the generalization hierarchy tree including u is updated (S1111).
Next, if a parent of u is t and all children of t are leaves, t is added to v (S1112) and the sequence returns to the evaluation of the step S1107.
In the determination of the step S1107, if it is determined that the number of all tuples of attribute data is k or larger in the personal information table on the memory, the receded result on the memory is written in the anonymous information table 132 (S2113), and the processing is completed.
Next, referring to
First, a variable I in which a finally calculated lost information amount is stored is Initialized to 0 (S1201). A loop S1202 is a loop for all children c of a node w.
In the loop S1202, internally, a lost information amount i when one record having a label of c as data is recoded into a label of w is calculated (S1203). A method of calculating a lost Information amount will be described below. Next, count(c)*i is added to I (S1204). In the meantime, count (c) refers to a total number of records having a label of c as data in the personal information table on the memory and the calculation refers to the multiplication of a real number. Specifically, count (c) may be obtained by referring to the frequency 5215 of the node.
After completing the loop S1202, I is fed back and the processing is completed.
Next, referring to
The amount of information of data that is lost when one record having a label of c as data is recoded into a label of w is calculated by −log{count(c)/count(w)} (S1205). Further, even though usually, 2 is used as a base of log, but the lost information amount is not changed only by constant number times. Therefore, any number may be used. However, the number needs to be unified in the system. Similarly to the above description, count(c) refers to a total number of records having a label of c as data in the personal information table on the memory.
Further, in the calculating method of a lost information amount at the time of receding as illustrated in
As described above, a feature of the computer 100 is that a method that automatically configure the generalization hierarchy tree and a calculating method of a lost information amount are included. The Hu-Tucker coding tree, the Huffman coding tree, and the Shannon-Fano coding tree are trees in which an attribute value having a smaller frequency is disposed in a deep position and an attribute value having a larger frequency is disposed in a shallow position as described above. Therefore, at the time of receding, in order to increase the possibility of receding the attribute values having smaller frequencies into the same label, very available anonymous data may be generated while avoiding excessive receding. Further, if the above-mentioned coding trees are used as the generalization hierarchy tree, the lost information amount at the time of receding may be reduced.
Next, a second embodiment will be described. The second embodiment improves the usability of data. Hereinafter, when the second embodiment is described, configurations which overlap the first embodiment are denoted by the same reference numerals and the description thereof will be omitted. Further, most operations of the second embodiment are the same as in the first embodiment. The same operations are denoted by the same reference numerals, and the description thereof will be omitted.
First, referring to
In
Next, referring to
The generation information table 1332, as illustrated in
Next, referring to
In
Referring to
First, the anonymous information table 132 and the generalization hierarchy tree table 135 are obtained on the memory (S1601). After obtaining the tables, the following processing will be carried out on a loop for all records r (S1602) and a loop for all attributes of a record r as an internal loop (S1603). However, an attribute which is being currently processed is referred to as a j-th attribute.
First, it is specified to which node of the generalization hierarchy tree an attribute value of a j-th attribute of the record r corresponds and the node is considered defined as w (S1604). Next, everything that becomes leaves at a node corresponding to a child of w is listed, which is referred to as c1, c2, . . . , cn (S1605). Specifically, a searching method such as width first searching from w may be used. Once the searching is performed, the searching result is associated with the node so as to be stored and then reused.
Next, even though the j-th attribute of the record r is labeled as w, which may be replaced with a label of one leaf of the generalization hierarchy tree by a method described below (S1606). Using the frequency Information of the node stored in the generalization hierarchy tree, a label of c1 is selected with a probability of count(c1)/count(w) and c2 is selected with a probability of count (c2)/count (w) and c1, c2, c3, . . . , cn are randomly generated with the same probability to be replaced with the label of the node of the generation result.
Finally, all records are stored in the generation information table 1332 (S1607).
The feature of the computer 100 configured in the second embodiment is that an application using data is not selected since a value of a set in which an attribute value of the generation Information table 1332 is the same as the attribute value of the original personal information table 131 is obtained. For example, if there is a record indicating that the age is 10 years old, in many cases, the record may be stored in the memory as an integer. If the data is recoded to “10-19 years old”, it is difficult to represent the record as an integer, which cannot be used in an arbitrary application. However, in the second embodiment, the record is replaced Into an age between “10-19 years old” using the frequency information. For example, the record is replaced into “14 years old”. Therefore, the record may be represented as an integer and may be used in an arbitrary application which may be used for the original personal information. Further, it is expected that the distribution of the attributes of the generation Information table 1332 approaches the distribution of the original personal information table 131.
Further, in the second embodiment, even though it is described that a step of configuring the anonymous information table 132 is included, a method that configures the anonymous information table 132 in advance as described above and performs only the pseudo-personal information generation unit 1321 later is also suggested. According to the method, the personal information table 131 is not necessary so that the system may be configured only by the anonymous information table 132, the generalization hierarchy tree table 135, and the pseudo-personal information generation unit 1321. Therefore, by externally depositing only the anonymous information and generalization hierarchy tree, an available system may be constructed and the personal information does not need to be deposited so that the system has high anonymity.
Next, a third embodiment will be described.
The third embodiment uses a classification of the attribute values which is desired by a user to improve the availability of data. In various fields such as international classification of diseases, a library classification, or a patent classification, a predetermined classification is present. Further, as for an age, a frequently used classification such as 10's or 20's is present. The third embodiment automatically generates a generalization hierarchy tree while considering a user-desired classification by defining only a hierarchy structure which is desired by the user as a generalization hierarchy tree in advance. For example, the age classification is defined as “20 to 24 years old” and “25 to 29 years old” in advance so as to prevent the data from being receded such that the classification departs from the user desired classification such as “24 to 27 years old”.
Further, when the generalization hierarchy tree is configured, the third embodiment accepts to add a node so as not to depart from the user defined hierarchy tree. For example, if the user defines a classification of “20 to 24 years old”, as a child of the node of “20 to 24 years old”, a node “20 to 22 years old” is configured, which is accepted. Further, if the user defines “*” including all attribute values as parents of “20 to 24 years old”, as a parent of “20 to 24 years old”, a node of “20 to 29 years old” may be newly added. By accepting to add a hierarchy which has a form so as not to depart from the user defined hierarchy tree, more detailed anonymous data may be output while using the classification desired by the user.
Hereinafter, when the third embodiment is described, configurations which overlap the first embodiment are denoted by the same reference numerals and the description thereof will be omitted. Further, some of operations of the third embodiment are the same as in the first embodiment. The same operations are denoted by the same reference numerals, and the description thereof will be omitted.
First, referring to
In
A CPU 101 loads the program 1731 on a memory 102 and implements a generalization hierarchy tree automatic generation unit 1721 and a receding unit 122 based on the user defined hierarchy tree. If necessary, the receding unit 122 implements a lost information amount metric unit 123 as internal processing.
The user defined hierarchy tree table 1732 stores the definition of a classification for an arbitrary attribute which is desired by a user. The user does not need to define a user defined hierarchy free for all attributes to be anonymized, but may define only an attribute for which the user wants to define the classification. Further, as described above, the user may define only desired classification for the attribute but does not need to define all hierarchies. Further, as for any attribute types such as “string manipulation type”, “order preservation type” or “the others”, in a plurality of nodes which do not have the grandparent-grandchild relationship, the classification should be defined such that the attribute value which becomes a grandchild of each node does not overlap. For example, a classification such as “25 to 38 years old” and “20 to 33 years old” or a classification such as “{Yokohama-shi, Kanagawa-ken, Kawasaki-shi, Kanagawa-ken}” and “{Yokohama-shi, Kanagawa-ken, Fujisawa-shi, Kanagawa-ken}” may be not defined.
Referring to
First, referring to
Referring to
In
In
Next, referring to
a-1) is an example of the user defined hierarchy tree of an attribute “address” of the string manipulation type and
b-1) illustrates an example of the user defined hierarchy tree of an order preservation type attribute “age” and
c-1) illustrates an example of the user defined hierarchy tree of the other attribute “nationality” and
In
Next, referring to
First, the generalization hierarchy tree automatic generation unit 1721 based on a user defined hierarchy tree automatically generates a generalization hierarchy tree referring to the personal information table 131 the attribute type information 134, and the user defined hierarchy tree table 1732 and stores the result in the generalization hierarchy tree table 135 (S2001). Next, the receding unit 122 recedes the data and stores the result in the anonymous information table 132 (S802). The step S802 is equal to that of the first embodiment. Similarly to the relationship of steps S801 and S802 illustrated in the first embodiment, there is no need to continuously perform the steps S2001 and S802, but the processing timings may be different from each other.
Next, referring to
First, the personal Information table 131 and the user defined hierarchy tree table 1732 are loaded on the memory 102 (S2101). In this case, it is checked whether classifications defined in the user defined hierarchy trees overlap. Specifically, in a plurality of nodes that do not have a grandparent-grandchild relationship among nodes that configure the user defined hierarchy trees, it is checked whether the grandchildren of the nodes do not overlap. If the grandchildren overlap, the processing is completed.
Steps S902 and S903 are equal to those of the first embodiment.
In the step S2102, it is determined whether a user defined hierarchy tree in a j-th attribute is present. If the user defined hierarchy tree is not present, the sequence proceeds to the step S2103. If the user defined hierarchy tree is present, the sequence proceeds to the step S2104. Details of the steps S2103 and S2104 will be described below. After completing the processing of the steps S2103 and S2104, the sequence proceeds to the processing of the step S943.
The processing of the step S943 is equal to that of the first embodiment.
Referring to
Next, referring to
The processing of the steps S904 and S905 is the same as the above description. In the step S905, if the attribute type of the attribute is a “string manipulation type”, the sequence proceeds to the step S2311, if the attribute type of the attribute is an “order preservation type”, the sequence proceeds to the step S2321, and if the attribute type of the attribute is “the others”, the sequence proceeds to the step S2331. The details of the steps S2311, S2321, and S2331 will be described below. After completing the processing of the step S2311, S2321, or S2331, the sequence proceeds to the step S942.
The processing of the step S942 is the same as the above description.
Referring to
First, some notations will be defined. y refers to a hierarchy number of the deepest hierarchy of the user defined hierarchy tree 1732. “*” which includes all attribute values is a hierarchy 0 and the lower hierarchies are referred to as a hierarchy 1, a hierarchy 2, . . . , a hierarchy y.
The step S911 is equal to that of the first embodiment.
In the step S2401, a parameter x is initialized to y.
Next, it is checked whether x is smaller than 0(S2402). If x is smaller than 0, the processing is completed. In contrast, if x is equal to or larger than 0, the sequence proceeds to the step S2403.
In the step S2403, a user defined hierarchy tree having a j-th attribute is used to prepare a list z in which all nodes of the hierarchy x are listed.
In the step S2404, it is determined whether the list z is empty. If the list z is empty, the sequence proceeds to the step S2407. If the list z is not empty, the sequence proceeds to the step S2405.
In the step S2405, one node is selected from the list z and the selected node is deleted from the list z.
In the step S2411, nodes which are grandchildren of the selected node are listed in the step S2405. Specifically, if a node that does not have a child in the user defined hierarchy tree 1732 is selected in the step S2405, attribute values which are the grandchildren of the node are listed using the attribute value information obtained in the step S911. For example, a node of “Kawasaki-shi, Kanagawa-ken” is selected, attribute values including a string of “Kawasaki-shi, Kanagawa-ken” are listed. Further, if a node having a child in the user defined hierarchy tree 1732 is selected in the step S2405, nodes defined as children of the node in the user defined hierarchy tree 1732 are listed. For example, if a node of “{Yokohama-shi, Kanagawa-ken, Kawasaki-shi, Kanagawa-ken}” is selected, “Yokohama-shi, Kanagawa-ken” and “Kawasaki-shi, Kanagawa-ken” which are defined as children of “{Yokohama-shi, Kanagawa-ken, Kawasaki-shi, Kanagawa-ken}” in the user defined hierarchy tree 1732 are listed.
In the step S2412, the string manipulation which is designated from the nodes listed in the step S2411 is carried out and an inclusive relationship is extracted. A tree having the node selected in the step S2405 as a root is configured based on the inclusive relationship. The method of configuring the tree depends on various known string manipulation methods similarly to the first embodiment. The configured tree becomes a part of a generalization hierarchy tree based on the user defined hierarchy tree. The user defined hierarchy tree is updated using the configured tree.
In the step S2406, frequency information of the tree configured in the step S2412 is updated. The processing of the step S2406 will be described below. After completing the processing of the step S2406, the sequence returns to the evaluation of the above-mentioned step S2404.
In the step S2407, x−1 is substituted in x and the sequence returns to the evaluation of the above-mentioned step S2402.
As described above, when the attribute type is the “string manipulation type” attribute, the generalization hierarchy tree is configured based on the user defined hierarchy tree.
Referring to
First, in the step S2501, frequency information of nodes which become leaves of a partial tree which is a frequency information updating target is obtained. Here, the partial tree which is the frequency information updating target indicates a tree configured in the step S2412 and nodes which become leaves of the partial tree indicate all nodes listed in the step S2411.
In the step S2502, the frequency information obtained in the step S2501 is allocated to the corresponding leaves.
In the step S2503, a routine of
The routine of
Next, referring to
The processing of the steps S921, S2401, S2402, S2403, S2404, and S2405 is the same as described above.
In the step S2421, frequency information of nodes which become grandchildren of the node selected in the step S2405 is obtained. Specifically, if a node that does not have a child in the user defined hierarchy tree 1732 is selected in the step S2405, the frequency information of the attribute value which becomes a grandchild of the node is obtained using the attribute value information obtained in the step S921. Further, if a node that has a child in the user defined hierarchy tree 1732 is selected in the step S2405, frequency information of the node which is defined as a child of the node in the user defined hierarchy tree 1732 is obtained. For example, a node of “20 to 24 years old” is selected in the user defined hierarchy tree 1732, frequency information whose attribute values are “20 years old”, “21 years old”, “22 years old”, “23 years old”, and “24 years old” is obtained.
In the step S2422, using the frequency information obtained in the step S2421, a Hu-Tucker coding tree having the node selected in the step S2405 as a root is configured. The user defined hierarchy tree is updated using the configured tree.
The processing of the S2406 and S2407 is the same as described above.
As described above, when the attribute type is the “order preservation type” attribute, the generalization hierarchy tree is configured based on the user defined hierarchy tree.
Next, referring to
The processing of the steps S931, S2401, S2402, S2403, S2404, and S2405 is the same as described above.
In the step S2431, frequency information of attribute values of nodes which become grandchildren of the node selected in the step S2405 is obtained. Specifically, if a node that does not have a child in the user defined hierarchy tree 1732 is selected in the step S2405, the frequency information of the attribute value which becomes a grandchild of the node is obtained using the attribute value information obtained in the step S331, Further, if a node that has a child in the user defined hierarchy tree 1732 is selected in the step S2405, frequency information of the node which is defined as a child of the node in the user defined hierarchy tree 1732 is obtained. For example, “Europe” is selected in the user defined hierarchy tree 1732, frequency information of “England”, “France”, and “Germany” is obtained.
In the step S2132, using the frequency Information obtained in the step S2431, a Huffman coding tree or a Shannon-Fano coding tree is configured. Similarly to the first embodiment, which coding tree is used is determined by a designer of the computer 100 in advance. The user defined hierarchy tree is updated using the configured tree.
The processing of the S2406 and S2417 is the same as described above.
As described above, when the attribute type is the “the others” attribute, the generalization hierarchy tree is configured based on the user defined hierarchy tree.
The feature of the computer 100 configured in the third embodiment is that a part of the attribute and a part of the hierarchy having the user desired classification are defined as a user defined hierarchy tree so that a generalization hierarchy tree in which the classification desired by the user is considered is automatically generated. Further, the generalization hierarchy tree is automatically generated using frequency information data may be anonymized so as to have only a small lost information amount.
Reference Signs List
100 Computer
101 CPU
102 Memory
121 Generalization Hierarchy Tree Automatic Generation Unit
122 Recoding Unit
123 Lost Information Amount Metric Unit
103 Storage
131 Personal Information Table
132 Anonymization Information Table
133 Minimum, Identical Value Occurrence Information
134 Attribute type Information
135 Generalization Hierarchy Tree Table
151 Program
104 Input Device
105 Output Device
106 Communication Device
107 Internal Communication Line
1321 Pseudo-personal Information Generation Unit
1331 Program
1332 Generation Information Table
1721 Generalization Hierarchy Tree Generation Unit Based on User Defined Hierarchy Tree
1731 Program
1732 User Defined Hierarchy Tree Table
Number | Date | Country | Kind |
---|---|---|---|
2010-114885 | May 2010 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2011/058590 | 4/5/2011 | WO | 00 | 1/22/2013 |