An aspect of the present invention relates to a data processing device, a data processing method, and a program for making effective use of data containing missing data.
With advancements in IoT (Internet of Things) technologies, an environment where household electronics such as a hemadynamometers and bathroom scales are connected to a network and health data such as blood pressure and body weight measured in daily life are collected through the network is being formed. Health data is often supposed to be measured regularly and in many cases contains information representing dates and times of measurement along with measured values. One issue with health data is that data is easily missing due to forgetting to measure, a fault of a measurement device and the like. Such missing data can lead to lower accuracy and the like in analysis of health data.
For data analysis with missing data taken into account, a learning method has been proposed and it takes the effect of missing data into consideration by using an array representing missing data to minimize an error only in portions without missing data (see Patent Literature 1, for example).
Patent Literature 1: International Publication No. WO 2018/047655
However, one possible issue with the analysis of data containing missing data is that data is reduced in amount. Particularly when an overall size of acquired data is small or when the proportion of missing data is large relative to the overall size of data, analysis disregarding missing data can result in a small amount of valid data.
For example, for health data that is measured multiples times a day like blood pressure, some of the measured values for a day can be missing.
Another issue is that degree of missingness is not taken into account. In the case of
The present invention has been made in view of such circumstances and an object thereof is to provide a data processing device, a data processing method and a program for making effective use of data containing missing data.
To solve the above issues, a first aspect of the present invention provides a data processing device including: a data acquisition section that acquires a series of data containing missing data; a statistics calculation section that calculates a representative value of data and a validity ratio which represents a proportion of valid data being present from the series of data according to a predefined unit of aggregation; and a learning section that performs learning of an estimation model so as to minimize an error which is based on a difference between an output resulting from inputting the representative value and the validity ratio to the estimation model, and the representative value.
According to a second aspect of the present invention, in the first aspect, the learning section inputs to the estimation model an input vector made up of elements which are a concatenation of a predefined number of representative values and validity ratios corresponding to the respective representative values.
According to a third aspect of the present invention, in the second aspect, when X is defined as a vector with elements being the predefined number of representative values, W is defined as a vector with elements being validity ratios corresponding to the respective elements of X, and Y is defined as an output vector resulting from inputting the input vector to the estimation model, the learning section performs the learning of the estimation model so as to minimize an error L represented by L=|W·(Y−X)|2.
According to a fourth aspect of the present invention, the first aspect further includes a first estimation section that, when a series of data containing missing data to be subjected to estimation is acquired by the data acquisition section, inputs representative values of the data and validity ratios representing the proportion of valid data being present calculated from the series of data by the statistics calculation section according to the unit of aggregation to the learned estimation model, and outputs an output from intermediate layers of the estimation model in response to the input as a feature value for the series of data.
According to a fifth aspect of the present invention, the first aspect further includes a second estimation section that, when a series of data containing missing data to be subjected to estimation is acquired by the data acquisition section, inputs representative values of the data and validity ratios representing the proportion of valid data being present calculated from the series of data by the statistics calculation section according to the unit of aggregation to the learned estimation model, and outputs an output from the estimation model in response to the input as estimated data with the missing data interpolated.
According to the first aspect of the present invention, a representative value of data and a validity ratio which represents the proportion of valid data being present are calculated from a series of data containing missing data according to a predefined unit of aggregation, and the estimation model is learned so as to minimize an error which is based on a difference between an output values resulting from inputting input values based on the representative value and the validity ratio to the estimation model, and the representative value.
As a result, even if the acquired series of data contains missing data, all the data can be effectively utilized as information per unit of aggregation without discarding data by calculating representative values and validity ratios as statistics according to a predefined unit of aggregation and using them for learning. Also, because not only whether there is missing data or not but the proportion of valid data being present in each unit of aggregation are calculated and used for learning, effective learning that takes into account even the degree of missingness can be performed.
According to the second aspect of the present invention, an input vector made up of elements which are a concatenation of a predefined number of representative values and validity ratios corresponding to the respective representative values is input to the estimation model and used for the learning of the estimation model. This enables learning to be performed with reliable association between the representative value and the validity ratio for each unit of aggregation without requiring complicated data processing even in a case where a learning data group contains missing data without regularity.
According to the third aspect of the present invention, learning of the estimation model is performed so as to minimize the error L=|W·(Y−X)|2, which is calculated from the vector X with elements being a predefined number of representative values, the vector W with elements being validity ratios corresponding to the respective elements of X, and the vector Y resulting from inputting the input vector to the estimation model. As a result, the validity ratio is applied to both the input-side vector X and the output-side vector Y, enabling learning of the estimation model to be performed using an error that explicitly takes into account the degree of missingness.
According to the fourth aspect of the present invention, when a series of data containing missing data to be subjected to estimation is acquired, representative values of the data and validity ratios representing the proportion of valid data being present which are calculated from the series of data according to the unit of aggregation are input to the learned estimation model, and an output from intermediate layers of the estimation model in response to the input is output as a feature value for the series of data. This can provide a feature value that take into account even the degree of missingness for a series of data containing missing data, allowing a more accurate grasp of the features of the series of data.
According to the fifth aspect of the present invention, when a series of data containing missing data to be subjected to estimation is acquired, representative values of the data and validity ratios representing the proportion of valid data being present which are calculated from the series of data according to the unit of aggregation are input to the learned estimation model, and an output from the estimation model in response to the input is output as estimated data with the missing data interpolated. This can provide an estimation result that take into account even the degree of missingness for a series of data containing missing data.
Accordingly, the aspects of the present invention can provide techniques for making effective use of data containing missing data.
In the following, embodiments of the present invention are described with reference to the drawings.
(Configuration)
The data processing device 1 is managed by a medical institute, a healthcare center and the like, for example, and is composed of a server computer or a personal computer, for example. The data processing device 1 can acquire a series of data (also referred to as a “data group”) containing missing data, such as health data, via a network NW or via an input device not shown in the figure. The data processing device 1 may be installed on a stand-alone basis or may be provided as one of expansions to a terminal of a medical professional such as a doctor, an Electronic Medical Records (EMR) server installed in an individual medical institute, an Electronic Health Records (EHR) server installed in an individual region including multiple medical institutes, or even a cloud server of a service provider. Furthermore, the data processing device 1 may be provided as one of expansions to a user terminal and the like possessed by a user.
The data processing device 1 according to an embodiment includes an input/output interface unit 10, a control unit 20, and a storage unit 30.
The input/output interface unit 10 includes one or more wired or wireless communication interface units, for example, enabling transmission and reception of information to/from external apparatuses. The wired interface can be a wired LAN, for example, and the wireless interface can be an interface that supports low-power wireless data communication standards such as wireless LAN and Bluetooth (a registered trademark), for example.
For example, the input/output interface unit 10 performs processing for receiving data transmitted from a measurement device such as a hemadynamometer having communication capability, or accessing a database server to read stored data, and passing the data to the control unit 20 for analysis, under control of the control unit 20. The input/output interface unit 10 can also perform processing for outputting instruction information entered via an input device (not shown), such as a keyboard, to the control unit 20. The input/output interface unit 10 can further perform processing for outputting results of learning or results of estimation output from the control unit 20 to a display device (not shown) such as a liquid crystal display, or transmitting it to external apparatuses over the network NW.
The storage unit 30 uses non-volatile memory capable of dynamic writing and reading, e.g., a HDD (Hard Disk Drive) or an SSD (Solid State Drive), as a storage medium, and includes a data storage section 31, a statistics storage section 32, a model storage section 33 in addition to a program storage section as storage areas necessary for implementing this embodiment.
The data storage section 31 is used to store a data group for analysis acquired via the input/output interface unit 10.
The statistics storage section 32 is used to store statistics calculated from the data group.
The model storage section 33 is used to store an estimation model for estimating a data group with missing data in it interpolated from a data group containing missing data.
The storage sections 31 to 33 however are not essential components; the data processing device 1 may acquire necessary data from a measurement device or a user device when necessary. Alternatively, the storage sections 31 to 33 may not be built in the data processing device 1, but may be provided in an external storage medium such as a USB memory or a storage device such as a database server located in a cloud, for example.
The control unit 20 has a hardware processor such as a CPU (Central Processing Unit) and an MPU (Micro Processing Unit), not shown in the figure, and memories such as DRAM (Dynamic Random Access Memory) and SRAM (Static Random Access Memory), and includes a data acquisition section 21, a statistics calculation section 22, a vector generation section 23, a learning section 24, an estimation section 25, and an output control section 26 as processing functions necessary for implementing this embodiment. These processing functions are all implemented by execution of programs stored in the storage unit 30 by the processor. The control unit 20 may also be implemented in any of other various forms, including an integrated circuit such as ASIC (Application Specific Integrated Circuit) and FPGA (field-programmable gate array).
The data acquisition section 21 performs processing for acquiring a data group for analysis via the input/output interface unit 10 and storing them in the data storage section 31.
The statistics calculation section 22 performs processing for reading data stored in the data storage section 31, calculating statistics according to a predefined unit of aggregation, and storing the result of calculation in the statistics storage section 32. In an embodiment, statistics include a representative value of data included in each unit of aggregation and a validity ratio representing the proportion of valid data included in each unit of aggregation.
The vector generation section 23 performs processing for reading statistics stored in the statistics storage section 32 and generating a vector made up of a predefined number of elements. In an embodiment, the vector generation section 23 generates a vector X with elements being a predefined number of representative values, and a vector W with elements being validity ratios corresponding to the respective elements of the vector X. The vector generation section 23 outputs the generated vector X and vector W to the learning section 24 in a learning phase and to the estimation section 25 in an estimation phase.
The learning section 24 performs, in the learning phase, processing for reading the estimation model stored in the model storage section 33 and inputting the vector X and vector W received from the vector generation section 23 into the estimation model for learning of parameters of the estimation model. In an embodiment, the learning section 24 inputs a vector formed from a concatenation of the elements of the vector X and the elements of the vector W to the estimation model, and acquires a vector Y which is output by the estimation model in response to the input. Then, the learning section 24 performs processing for learning the parameters of the estimation model so as to minimize an error which is calculated based on a difference between the vector X and the vector Y and updating the estimation model stored in the model storage section 33 as necessary.
The estimation section 25 reads, in the estimation phase, the learned estimation model stored in the model storage section 33 and inputs the vector X and vector W received from the vector generation section 23 into the estimation model to perform data estimation processing. In an embodiment, the estimation section 25 inputs a vector formed from a concatenation of the elements of the vector X and the elements of the vector W to the learned estimation model, and outputs the vector Y or a feature value Z of intermediate layers which is output by the estimation model in response to the input to the output control section 26 as an estimation result.
The output control section 26 performs processing for outputting the vector Y or the feature value Z output by the estimation section 25. Alternatively, the output control section 26 can output parameters related to the learned estimation model stored in the model storage section 33.
(Operation)
Next, information processing operations of the data processing device 1 configured as described above are described. The data processing device 1 can accept an instruction signal from an operator entered such as through an input device and operate as the learning phase or the estimation phase, for example.
(1) Learning Phase
When the learning phase is set, the data processing device 1 executes learning processing for the estimation model as follows.
(1-1) Acquisition of Learning Data
First, at step S201, the data processing device 1 acquires a series of data containing missing data as learning data via the input/output interface unit 10, and stores the acquired data in the data storage section 31 under control of the data acquisition section 21.
In addition, acquired data can also include a user ID, a device ID, information representing measurement dates/times and the like along with numerical data representing blood pressure measurements.
In
(1-2) Calculation of Statistics
Next, at step S202, the data processing device 1 performs processing for reading data stored in the data storage section 31 and calculating statistics according to a preset unit of aggregation under control of the statistics calculation section 22. The unit of aggregation is assumed to be set by the operator, designer, administrator and the like of the data processing device 1 as desired according to data type, for example, and stored in the storage unit 30. The statistics calculation section 22 reads the setting for the unit of aggregation stored in the storage unit 30, divides the data read from the data storage section 31 according to the unit of aggregation, and calculates statistics.
In the example shown in
The validity ratio indicates the proportion of valid data being present in the unit of aggregation. As shown in
The results thus calculated by the statistics calculation section 22 can be stored in the statistics storage section 32 as statistics data in association with identification numbers identifying the units of aggregation and/or date information, for example.
The unit of aggregation is not limited to per day but any unit can be employed. For example, it may be set to a certain time width such as per several hours, per three days and per week, or it may be a unit defined by the number of data containing missing data without using time information. Furthermore, units of aggregation may overlap each other. For example, they may be set such that with respect to a particular date, a moving average is calculated from data corresponding to two days, or the day before that date and the date in question.
(1-3) Generation of Vectors
Next, at step S203, the data processing device 1 performs processing for reading the statistics data stored in the statistics storage section 32 and generating two types of vectors (the vector X and the vector W) for use in the learning of the estimation model under control of the vector generation section 23.
The vector generation section 23 selects a preset number (n) of units of aggregation from the statistics data that has been read, extracts the representative values and validity ratios from the respective ones of the n units of aggregation, and generates the vector X (x1, x2, . . . , xn) with elements being n representative values and the vector W (w1, w2, . . . , wn) with elements being n validity ratios corresponding to the respective elements of the vector X. The number n of element corresponds to ½ of the number of input dimensions of the estimation model to be learned as mentioned later, and the number of input dimensions of the estimation model can be set as desired by the designer, administrator and the like of the data processing device 1. The number N of vector pairs (vector X and vector W) to be generated corresponds to the number of samples of learning data and this number N can also be set as desired.
For example, where the number of elements is set as n=3 and the number of vector pairs is set as N=2 in the example shown in
The vector generation section 23 outputs the vector pair (the vector X and the vector W) generated in the above-described manner to the learning section 24.
(1-4) Learning of Estimation Model
Next, at step S204, the data processing device 1 reads the estimation model to be learned which is previously stored in the model storage section 33 and inputs the vector X and vector W received from the vector generation section 23 to the estimation model to perform the learning of it under control of the learning section 24. The estimation model to be learned can be set as desired by the designer, the administer or the like.
In an embodiment, a hierarchical neural network is used as the estimation model.
In a neural network, generally the elements of an input vector are input to the nodes of the input layer, in which they are weight-added and given a bias, then enter the nodes of the next layer, in which they are subjected to application of an activation function in that node and then output. Accordingly, where a weight coefficient is A, the bias is B, and the activation function is f, an output Q of an intermediate layer (a first layer) when P is input to the input layer is generally represented by:
Q=f(AP+B) (1).
In this embodiment, a vector formed from a concatenation of the elements of the vector X and the elements of the vector W is input to the input layer. In the example shown in
In
In
Z
1
=f
1(A1P+B1) (2),
and a feature value Z2 of an intermediate layer (a second layer) is represented by:
Z
2
=f
2(A2(f1(A1P+B1))+B2) (3).
The subscripts 1 and 2 mean that it is a parameter contributing the output of the first and the second layer, respectively.
A feature value generally represents what kind of features the input data has. It is known that a feature value Z that is obtained from a learned model in which the number of units in the intermediate layers are smaller than those in the input layer as shown in
The learning section 24 inputs an input vector formed from a concatenation of the elements of the vector X and the elements of the vector W to such an estimation model as discussed above, and acquires the output vector Y which is output by the estimation model in response to the input. Then, the learning section 24 performs learning of the parameters of the estimation model (such as the weight coefficient and the bias) so as to minimize an error L, calculated with Formula (4) below, for all the generated vector pairs (vector X and vector W).
L=|W·(Y−X)|2 (4)
In Formula (4), it can be seen that the validity ratio vector W is applied to both the input-side vector X and the output-side vector Y, taking into account the degree of missingness in data in the learning of the estimation model.
In this manner, in the learning section 24, the estimation model is learned as an auto encoder so that the output from the output layer reproduces the input as much as possible. Here, the learning section 24 can perform learning of the estimation model so as to minimize the error L using stochastic gradient descent, such as Adam and AdaDelta, but these are not limiting and any other techniques can be used.
(1-5) Updating of the Model
After the parameters of the estimation model are determined so as to minimize the error L, the learning section 24 performs processing for updating the estimation model stored in the model storage section 33 at step S205. The data processing device 1 may also be configured to output the parameters of the learned model stored in the model storage section 33 through the output control section 26 in response to input of an instruction signal from the operator, for example, under control of the control unit 20.
When the learning phase ends, the data processing device 1 now can perform data estimation based on the newly acquired data group containing missing data using the learned model stored in the model storage section 33.
(2) Estimation Phase
When the estimation phase is set, the data processing device 1 can perform data estimation processing with the learned model as follows.
(2-1) Acquisition of Estimation Data
First, at step S301, the data processing device 1 acquires a series of data containing missing data as estimation data via the input/output interface unit 10 and stores the acquired data in the data storage section 31 as with step S201 under control of the data acquisition section 21.
(2-2) Calculation of Statistics
Next, at step S302, the data processing device 1 performs processing for reading the data stored in the data storage section 31 and calculating statistics according to a preset unit of aggregation as with step S202 under control of the statistics calculation section 22. For the unit of aggregation, the same settings as those used in the learning phase are preferably used; however, it is not necessarily limited to this. Likewise, for the representative value, the same representative value as that used in the learning phase (e.g., in the example above, an average of valid data) is preferably used; however, it is not necessarily limited to this. After the representative values and validity ratios have been calculated as statistics according to the unit of aggregation, the statistics calculation section 22 can store the results of calculation in the statistics storage section 32 as statistics data in association with identification numbers identifying the units of aggregation and/or date information, for example.
(2-3) Generation of Vectors
Next, at step S303, the data processing device 1 performs processing for reading the statistics data stored in the statistics storage section 32 and generating two types of vectors (the vector X and the vector W) for performing estimation as with step S203 under control of the vector generation section 23.
The vector generation section 23 selects a set number (n) of units of aggregation from the statistics data that has been read, extracts the representative values and validity ratios from the respective ones of the n units of aggregation, and generates the vector X (x if x2, . . . , xn) with elements being n representative values and the vector W (w1, w2, . . . , wn) with elements being n validity ratios corresponding to the respective elements of the vector X. The number n of elements may be a stored value of n used in learning or may be obtained as the number of input dimensions of the learned model stored in the model storage section 33 multiplied by ½, for example.
The vector generation section 23 outputs the generated vector pair (the vector X and the vector W) to the estimation section 25.
(2-4) Data Estimation
Next, at step S304, the data processing device 1 performs processing for reading the learned estimation model stored in the model storage section 33, and inputting the vector X and vector W received from the vector generation section 23 to the learned estimation model to acquire the output vector Y which is output by the estimation model in response to the input, under control of the estimation section 25. As described in the learning phase, the output vector Y shown in
Y=f
4(A4(f3(A3(f2(A2(f1(A1P+B1))+B2))+B3))+B4) (5)
In the example shown in
(2-5) Outputting of Estimation Result
At step S305, the data processing device 1 can output the result of estimation by the estimation section 25 via the input/output interface unit 10 in response to input of an instruction signal from the operator, for example, under control of the output control section 26. The output control section 26 can take the output vector Y output from the estimation model and output it on a display device such as a liquid crystal display or transmit it to external apparatuses over the network NW as a data group having missing data corresponding to the input data group interpolated, for example.
Alternatively, the output control section 26 can extract and output the feature value Z of the intermediate layers corresponding to the input data group. The feature value Z can be considered to be representing the intrinsic features of the input data group with less dimensions than the original input data group as noted above. Therefore, use of the feature value Z as the input to a certain separate learner enables processing with reduced load compared to when the original input data group is directly used. For such a separate learner, application to a classifier such as logistic regression, support vector machine and random forest, or a regression model using multiple regression analysis or regression tree is conceivable, for example.
(Effects)
As detailed above, in an embodiment of the present invention, a series of data containing missing data are acquired by the data acquisition section 21, and from this series of data, representative values of the data and validity ratios representing the proportion of valid data being present are calculated as statistics according to a predetermined unit of aggregation by the statistics calculation section 22. In the calculation of the validity ratio, missing data is represented by a continuous value as a proportion in the embodiment above, rather than being represented by binary values of present/absent.
Then, in the learning phase, the vector X with elements being representative values extracted from a predetermined number n of units of aggregation and the vector W with elements being the corresponding validity ratios are generated by the vector generation section 23. Next, an input vector formed from a concatenation of the elements of the vector X and the elements of the vector W is input to the estimation model by the learning section 24, and learning of the estimation model is performed as an auto encoder so as to minimize the error L, which is based on the vector Y output by the estimation model in response to the input.
As a result, even when some data or all the data in a unit of aggregation are missing, the data can be effectively utilized for use in learning without discarding the unit of aggregation so that reduction in data can be prevented in the learning of the estimation model. This is particularly advantageous when the proportion of missing data is large relative to the overall size of data or when the overall size of data is small.
Further, according to the embodiment above, for the representative values in the respective units of aggregation, learning can be performed taking into account the degree of missingness for the respective units of aggregation. Since learning is performed so that data with larger missing data contributes less by way of the W contained in the error L as shown in Formula (4), even the degree of missingness can be effectively employed to make effective use of data.
Also in the estimation phase, the vector X with elements being representative values extracted from a predetermined number n of units of aggregation and the vector W with elements being the corresponding validity ratios are generated by the vector generation section 23 as in the learning phase. Then, an input vector formed from a concatenation of the elements of the vector X and the elements of the vector W is input by the estimation section 25 to the learned estimation model which has been learned as described above, and the vector Y which is output by the estimation model in response to the input or the feature value Z which is output from the intermediate layers is acquired.
Thus, estimation processing can be performed with effective use of the original data without discarding data and also in consideration of even the degree of missingness when data is estimated using a learned estimation model on the basis of a data group containing missing data or when feature values are obtained from the intermediate layers of the learned estimation model.
Furthermore, since the embodiment above does not require excessively complicated manipulations for calculating statistics or generating the input vector for either the learning phase or the estimation phase, it can be implemented with desired settings or modifications by the administrator or the like according to the nature of data or the purpose of analysis.
The present invention is not limited to the foregoing embodiments.
For example, while in relation to
For instance, in the example of
The unit of aggregation that is used by the statistics calculation section 22 is not limited to the embodiment above but any unit of aggregation can be set.
Moreover, generation of vectors by the vector generation section 23 is not limited to the above described embodiment.
Furthermore, the embodiment above can be applied when multiple types of data are present.
When two types of data are present as shown in
Alternatively, as shown in
While the embodiment above was described taking time-series data that is recorded on a daily basis in particular as an example, the recording frequency of data does not have to be one day but data recorded at any frequency can be used.
Furthermore, the embodiment above can also be applied to data other than time-series data as noted above. For example, it can be temperature data recorded at different observation points or image data. For data that is represented by a two-dimensional array like image data, implementation is done by extracting data per row and concatenating and inputting them, as discussed for an instance where multiple types of data are present.
The embodiment above can also be applied to a compilation result for questionnaires or tests. For example, in the case of questionnaires, it is expected that data will be missing for some questions or data with completely no answers in relation to particular subjects will be obtained for reasons such as not applicable or unwillingness to answer. Even in such a situation, the embodiment above permits learning and estimation to be performed while distinguishing and taking into account partially unanswered and completely unanswered and making effective use of data without discarding it. Where data contains verbal information such as free answers to questionnaires, the embodiment above can be applied after converting data into numerical values by a certain method such as analyzing the frequency of appearance of keywords with text mining.
Moreover, all of the functional components of the data processing device 1 need not necessarily provided in a single device. For example, the functional components 21 to 26 of the data processing device 1 may be distributed across cloud computers, edge routers and the like such that those devices cooperate with each other to perform learning and estimation. This can reduce processing burden on the individual devices and improve processing efficiency.
Additionally, calculation of statistics, data storage format and the like can be implemented with various modifications within the scope of the present invention.
In short, the present invention is not limited to the exact embodiments described above but the components can be embodied with modifications within the scope of the invention in practicing stage. Also, different inventions can be formed by appropriate combination of the multiple components disclosed in the embodiment above. For example, several components may be removed from all the components shown in the embodiments. Further, components from different embodiments may be combined as appropriate.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/036262 | 9/17/2019 | WO | 00 |