The present invention relates to synthetic data generation techniques.
In the area of privacy protection, synthetic data as a substitute for personal information can be generated for analysis and the like when original data containing personal information cannot be handled due to concerns about security and the like. Consider here a case of creating synthetic data in tabular format from original data in tabular format. An example of data in tabular format is shown in
Non-patent Literatures 1 and 2 are known as conventional techniques for creating synthetic data in tabular format from original data in tabular format. In a case of a table for which synthetic data to be created only has numerical attributes, these conventional techniques generate synthetic data by formatting random numbers so that they can maintain natures (such as variance-covariance, correlation, and mean vector) among attributes in the original data.
Non-patent Literature 1: Zhengli Huang, Wenliang Du, and Biao Chen. “Deriving private information from randomized data”, In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 37-48. ACM, 2005.
Non-patent Literature 2: Haoran Li, Li Xiong, and Xiaoqian Jiang. “Differentially private synthesization of multi-dimensional data using copula functions”, In Advances in database technology: proceedings. International Conference on Extending Database Technology, Vol. 2014, p. 475. NIH Public Access, 2014.
With this kind of conventional approaches, however, the mean vector and correlations can be maintained but frequency distribution or the maximum and minimum of each attribute are not maintained. Thus, they have the problem of a large discrepancy occurring between the original data and the synthetic data such as when data is visualized and analyzed.
An object of the present invention is to provide a synthetic data generation apparatus, a method for the same, and a program that are capable of generating synthetic data without a large discrepancy from the original data even when data is visualized and analyzed.
To attain the object, a synthetic data generation apparatus according to an aspect of the present invention includes: a random number generating unit that generates first synthetic data with a ratio of a frequency distribution of each attribute being approximate to the ratio of the frequency distribution of that attribute in target data for which synthetic data is to be generated; and a data formatting unit that formats the first synthetic data using a matrix given by Cholesky decomposition of a variance-covariance matrix of the target data or a scaling matrix given by singular value decomposition of the variance-covariance matrix of the target data such that a mean vector and a correlation matrix of the first synthetic data agree with a mean vector and a correlation matrix of the target data and that a minimum and a maximum of the first synthetic data are present in ranges of a minimum and a maximum of the target data, and provides the first synthetic data after formatting as synthetic data.
To attain the object, a synthetic data generation method according to another aspect of the present invention is for execution by the synthetic data generation apparatus and includes: a random number generating step of generating first synthetic data with a ratio of a frequency distribution of each attribute being approximate to the ratio of the frequency distribution of that attribute in target data for which synthetic data is to be generated; and a data formatting step of formatting the first synthetic data using a matrix given by Cholesky decomposition of a variance-covariance matrix of the target data or a scaling matrix given by singular value decomposition of the variance-covariance matrix of the target data such that a mean vector and a correlation matrix of the first synthetic data agree with a mean vector and a correlation matrix of the target data and that a minimum and a maximum of the first synthetic data are present in ranges of a minimum and a maximum of the target data, and providing the first synthetic data after formatting as synthetic data.
The present invention has the effect of being able to generate synthetic data without a large discrepancy from the original data even when data is visualized and analyzed.
An embodiment of the present invention is described below. In the drawings used in the following description, components having the same functions and steps that perform the same processing are given the same reference characters and overlapping description is avoided. In the following description, processing that is performed on each element of a vector or a matrix is intended to be applied to all the elements of the vector or matrix otherwise specified.
Look at a matrix Q which is given by Cholesky decomposition of a variance-covariance matrix for the original data. By multiplying the matrix Q by a proportionality coefficient p, synthetic data in which data are present in the ranges of the maximum and minimum of each attribute can be created while perfectly maintaining the mean vector and the correlation matrix of the original data and approximating the frequency distribution.
The synthetic data generation apparatus according to the first embodiment includes a random number generating unit 210 and a data formatting unit 230.
The synthetic data generation apparatus is a special device configured by loading of a special program into a well-known or a dedicated computer having a central processing unit (CPU), main storage unit (random access memory: RAM), and the like, for example. The synthetic data generation apparatus executes various kinds of processing under control of the central processing unit, for example. Data input to the synthetic data generation apparatus and data resulting from processing are stored in the main storage unit, for example, and the data stored in the main storage unit is read into the central processing unit and utilized for other processing as necessary. The processing components of the synthetic data generation apparatus may be at least partially composed of hardware such as an integrated circuit. Storages of the synthetic data generation apparatus can include the main storage unit such as random access memory (RAM), auxiliary storage unit composed of a hard disk, an optical disk, or a semiconductor memory device such as flash memory, or middleware such as a relational database or a key value store, for example.
The synthetic data generation apparatus according to the first embodiment takes, as input, original data D and the number of records n′ which is contained in synthetic data D′ to be generated, and generates and outputs synthetic data D′. Here, the synthetic data D′ ∈ Rn′×d perfectly maintains a mean vector μD and a correlation matrix of the original data D and approximates the frequency distribution, with data present in the ranges of the maximum and minimum of each attribute.
Data in tabular format such as the one shown in
The numerical attributes to which this embodiment is applicable include a date attribute. When this embodiment is applied to a date attribute, a target date in an original database is previously converted to a sequential value, such as m milliseconds earlier or m milliseconds later with respect to a particular date.
Input: original data D ∈ Rn×d and the number of records to be generated n′
Output: first synthetic data X ∈ Rn×d
The random number generating unit 210 generates first synthetic data X with the ratio of the frequency distribution of each attribute being approximate to the ratio of the frequency distribution of that attribute in the original data D (S210), and outputs it. The accuracy of approximation is related to the magnitude of the number of records n′ contained in the synthetic data: the accuracy of approximation tends to be higher as n′ is greater.
For example, the random number generating unit 210 first calculates a frequency distribution of each attribute in the original data D.
Next, the random number generating unit 210 randomly generates an ith column vector so that the ratio of the frequency distribution of the ith attribute for the first synthetic data X is approximate to the ratio of frequency distribution hi for the original data D. This operation is repeated for i of 1 through d. For a way of generating column vectors, various known techniques can be employed. For example, the rejection method or the inverse function method known from Reference Literature 1 or the like may be employed.
(Reference Literature 1) Kazumasa Wakimoto, “Knowledge of Random Numbers”, Morikita Publishing Co., Ltd., 1970, p.61-71
The random number generating unit 210 arranges the d generated column vectors in the same order as the order in the original data D and outputs it as the first synthetic data X in tabular format.
Input: original data D ∈ Rn×d and the first synthetic data X ∈ Rn′×d
Output: synthetic data D′ ∈ Rn′×d
The data formatting unit 230 formats the first synthetic data X (S230) using a matrix given by Cholesky decomposition of the variance-covariance matrix for the original data D such that the mean vector μ and the correlation matrix of the first synthetic data X agree with the mean vector μD and the correlation matrix of the original data D and that the minimum and maximum of the first synthetic data X are present in the ranges of the minimum and maximum of the original data D, and outputs the first synthetic data after formatting as the synthetic data D′.
For instance, the first synthetic data X is formatted in processes 1 to 11 below.
The configuration described above can generate synthetic data D′ in which data are present in the ranges of the maximum and minimum of each attribute while perfectly maintaining the mean vector and the correlation matrix of the original data D and approximating the frequency distribution. This allows the generated synthetic data D′ to perfectly maintain the mean vector and the correlation matrix of the original data D, thus enabling the obtainment of exactly the same linear regression model as with the original data D. Particularly when the attributes in the original data D have similar ranges of values that they can assume, an approximation of the frequency distribution and the maximum/minimum of each attribute in the original data D can be maintained. Thus, synthetic data D′ without a large discrepancy from the original data D can be generated even when data is visualized and analyzed. For example, without a record having a profile of height −170 cm being generated, the frequency distribution of attributes in the original data D can be approximated.
When there is no original data D but there is data to be reproduced (target data for which synthetic data is to be generated) in this embodiment, statistics of the target data (such as the frequency distribution of attributes, the mean vector, variance-covariance matrix, and the range of values that can be assumed by each attribute) may be used as input instead of the original data D. The original data D can also be considered as an example of the target data.
Although in this embodiment the frequency distribution of attributes in the original data D, the mean vector, the variance-covariance matrix, and the range of values that can be assumed by each attribute (the maximum and the minimum) are calculated in each unit, they may be calculated outside the units in advance and given as the input to the random number generating unit 210 and the data formatting unit 230 so that no calculation is performed in the units.
Although the first synthetic data X is formatted with Cholesky decomposition at the data formatting unit 230 in this embodiment, the first synthetic data X may be formatted with singular value decomposition. An example of such processing is described. For example, processing similar to this embodiment is performed for processes 1 to 7, 10, and 11, and processing is performed as follows in processes 8 and 9.
Similarly in process 4, singular value decomposition may be used instead of Cholesky decomposition. That is, the following process 4 is performed.
In theory, a variance-covariance matrix is a positive definite matrix, so that it is possible to calculate Q and QD that give Σ=QQT and ΣD=QDQDT by Cholesky decomposition. In numerical calculations on a computer, however, Q and QD often become unstable and cannot be calculated when the number of records n in the original data or the number of records n′ in X is small. Thus, rather than determining Q and QD directly by Cholesky decomposition, Q=UΛ1/2 and QD=UDΛD1/2 can be calculated by calculating U and Λ, and UD and ΛD by singular value decomposition.
The above processing provides similar effects to the first embodiment.
The present invention is not limited to the above embodiment and modifications. For example, the above-described various kinds of processing may be executed, in addition to being executed in chronological order in accordance with the descriptions, in parallel or individually depending on the processing power of an apparatus that executes the processing or when necessary. In addition, changes may be made as appropriate without departing from the spirit of the present invention.
Further, various types of processing functions in the apparatuses described in the above embodiment and modifications may be implemented on a computer. In that case, the contents of processing function to be contained in each apparatus are written by a program. With this program executed on the computer, various types of processing functions in the above-described apparatuses are implemented on the computer.
This program in which the contents of processing are written can be recorded in a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory.
Distribution of this program is implemented by sales, transfer, rental, and other transactions of a portable recording medium such as a DVD and a CD-ROM on which the program is recorded, for example. Furthermore, this program may be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to other computers via a network.
A computer which executes such program first stores the program recorded in a portable recording medium or transferred from a server computer once in a storage thereof, for example. When the processing is performed, the computer reads out the program stored in the storage thereof and performs processing in accordance with the program thus read out. As another execution form of this program, the computer may directly read out the program from a portable recording medium and perform processing in accordance with the program. Furthermore, each time the program is transferred to the computer from the server computer, the computer may sequentially perform processing in accordance with the received program. Alternatively, a configuration may be adopted in which the transfer of a program to the computer from the server computer is not performed and the above-described processing is executed by so-called application service provider (ASP)-type service by which the processing functions are implemented only by an instruction for execution thereof and result acquisition. It should be noted that the program includes information which is provided for processing performed by electronic calculation equipment and which is equivalent to a program (such as data which is not a direct instruction to the computer but has a property specifying the processing performed by the computer).
Moreover, the apparatuses are assumed to be configured with a predetermined program executed on a computer. However, at least part of these processing contents may be realized in a hardware manner.
Number | Date | Country | Kind |
---|---|---|---|
2017-199201 | Oct 2017 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/037310 | 10/5/2018 | WO | 00 |