The present invention relates to a learning device, an estimation device, a learning method, and a learning program.
A variational autoencoder (VAE) using a latent variable and a neural network to perform density estimation is known as a technology for estimating a probability distribution of data through machine learning (see NPL 1 to NPL 3). The VAE can estimate a probability distribution of large-scale and complicated data, and thus is applied to various fields such as abnormality detection, image recognition, moving image recognition, and voice recognition.
Meanwhile, it is known that a VAE of the related art requires a large amount of data for machine learning, and performance deteriorates when the amount of data is small. Thus, as a scheme for preparing a large amount of learning data, multitask learning in which data of other tasks is used to improve performance of density estimation of data of a target task is known. in the multitask learning, invariant features between tasks are learned and invariant knowledge between a target task and other tasks is shared, so that performance is improved. For example, with a conditional variational autoencoder (CVAE), a task-invariant prior distribution is assumed for a latent variable, so that dependency of the latent variable on a task can be reduced and task-invariant features can be learned.
NPL 1: Diederik P. Kingma, et al., “Semi-supervised Learning with Deep Generative Models,” Advances in neural information processing systems, 2014, [Retrieved on Oct. 25, 2019], Internet <URL: http://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf>
NPL 2: Christos Louizos, et al., “The Variational Fair Autoencoder,” [online], arXiv preprint arXiv: 1511.00830, 2015, [Retrieved on Oct. 25, 2019], Internet <URL: https://arxiv.org/pdf/1511.00830.pdf>
NPL 3: Hiroshi Takahashi, et al., “Variational Autoencoder with Implicit Optimal Priors,” [online], Proceedings of the AAA,' Conference on Artificial Intelligence, Vol. 33, 2019, [Retrieved on Oct. 25, 2019], Internet <https://aaai.org/ojs/index.php/AAAI/article/view/443>
However, in the CVAE, it is known that dependency of a latent variable on a task remains in many cases, and reduction of task dependency is insufficient. Thus, there is a problem that accuracy of multitask learning cannot be sufficiently improved in some cases.
The present invention has been made in view of the above, and an object of the present invention is to improve accuracy of multitask learning.
In order to solve the above-described problems and achieve the object, a learning device according to the present invention includes an acquisition unit configured to acquire data in a task; and a learning unit configured to learn a model representing a distribution of a probability that the data in the task is generated so that a mutual information amount between a latent variable and an observed variable is minimized in the model.
According to the present invention, it is possible to improve accuracy of multitask learning.
Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to the embodiment. Further, in description of the drawings, the same parts are denoted by the same reference signs.
Overview of Learning Device
A learning device of the present embodiment creates a generation model based on a CVAE and performs task-invariant density estimation. Here,
The encoder qφ(z|x, s) encodes data x in a task s to convert the data into a representation in which a latent variable z is used. Here, φ is a parameter of the encoder. Further, the decoder pθ(x|z, s) decodes the data encoded by the encoder to reproduce the original data x in the task s. Here, θ is a parameter of the decoder. When the original data x is a continuous value, a Gaussian distribution is typically applied to the encoder and decoder. In the example illustrated in
Specifically, the CVAE estimates a probability pθ(x, s) of the data x in the task s using the latent variable z, as expressed in Equation (1) below. Here, p(z) is called a prior distribution.
[Math. 1]
p
θ(x|s)=∫pθ(x|z, s)p(z)dz . . . ()
In the CVAE learning, learning is performed so that the expected value of a variational lower bound L of 1npθ(x|s) in Equation (2) below is maximized, and a parameter is determined,
Here, a first term of the variational lower bound L in Equation (3) below is called a reconstruction error (RE), and a second term is called a Kullback-Leibler information amount (KL),
[Math. 3]
(x,s;θ, ϕ)=zϕ(z|x,s)[1 np74(x|z,s)]−DKL(qϕ(z|x, s)||p(z)) . . . (3)
Specifically, in the CVAE, for a true joint distribution pD(x, s) of the data x and the tasks, an expected value of the variational lower bound L is used as the objective function as expressed in Equation (4) below, and learning is performed so that the objective function is maximized.
[Math. 4]
CVAE(θ, ϕ)=pD(x,s)[(x, s; θ, ϕ)] . . . (4)
Thus, in the CVAE, an expected value R(p) of KL of the CVAE in Equation (3) above is minimized, so that the expected value of the variational lower bound L is maximized. The expected value R(p) of KI of the CVAE is expressed by Equation (5) below.
[Math. 5]
Here, I (O; Z) is a mutual information amount between the latent variable z and observed variables x and s, and is expressed by Equation (6) below.
Further, when respective probabilities of K tasks are pD(s=k)=πk, JS divergence in Equation (8) below is introduced in a posterior distribution of the latent variable z with respect to the task s in Equation (7) below.
[Math. 7]
q
ϕ(z|s)=∫qϕ(z1x, s)pD(x|s)dx . . . (7)
Here, qϕ(z) is expressed by Equation (9) below.
J(φ), which is the JS divergence in Equation (8) above, has a large value in a case in which the latent variable z depends on the task s, and a small value in a case in which the latent variable z does not depend on the task s. Thus, the JS divergence can be used as a measure of task dependency.
In the CVAE, the expected value R(φ) of KL of the CVAE in Equation (5) above is minimized. Because this J(φ) is curbed from above by R(φ), J(φ) is also minimized in the CVAE, so that the dependence of the latent variable z on the task s is reduced.
Here,
Thus, the learning device of the present embodiment minimizes a mutual information amount I(O; Z). As illustrated in
Further, a difference between R(φ) and I(O; Z) in Equation (10) below derived from Equation (5) above becomes zero when p(z)=qφ(z). That is. minimizing I(O; Z) instead of R(φ) is equivalent to changing a prior distribution P(z) to qφ(z) in Equation (9) above.
[Math. 10]
(ϕ)−I(O; Z)=D KL(qϕ(z)||p(z)) . . . (10)
This allows the learning device of the present embodiment to further reduce the task dependency as compared with CVAE and improve the accuracy of multitask learning.
Configuration of Learning Device
The input unit H is achieved by using an input device such as a keyboard or a mouse, and inputs various types of instruction information such as processing start to the control unit 15 in response to an input operation from an operator. The output unit 12 is achieved by a display device such as a liquid crystal display, a printing device such as a printer, or the like.
The communication control unit 13 is achieved by a network interface card (NIC) or the like, and controls communication between an external device connected via a network 3, such as a server, and the control unit 15. For example, the communication control unit 13 controls communication between a management device or the like that manages various types of information and the control unit 15.
The storage unit 14 is achieved by a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disc, and stores, for example, a parameter of a data generation model learned through learning processing to be described below. The storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13.
The control unit 15 is achieved by using a central processing unit (CPU) or the like, and executes a processing program stored in a memory. This allows the control unit 15 to function as an acquisition unit 15a and a learning unit 15b, as illustrated in
The acquisition unit 15a. acquires the data in the task. For example, the acquisition unit 15a acquires, for each task, sensor data output by a sensor attached to an IoT device via the communication control unit 13. Examples of the sensor data include data of sensors for temperature, speed, rotation speed, traveling distance, and the like attached to a car, and data of sensors for temperature, frequency, sound, and the like attached to a wide variety of devices operating in a factory. Further, the acquisition unit 15a may store the acquired data in the storage unit 14. The acquisition unit 15a may transfer such information to the learning unit 15b without storing the information in the storage unit 14.
The learning unit 15b learns the generation model representing a distribution of a probability that the data x in the task s is generated so that the mutual information amount between the latent variable and the observed variable is minimized in the generation model, This mutual information amount is a predetermined mutual information amount I(O; Z) having, as an upper bound, an expected value R(φ) of the Kullback-Leibler information amount KL for a variational lower hound L of a logarithm of the probability distribution.
Specifically, the learning unit 15b creates a generation model representing a distribution of a probability that the data x in the task s is generated, in Equation (1) above, based on the CVAE. In this case, the learning unit 15b learns the generation model so that the mutual information amount I(O; Z) in Equation (5) above is minimized. I(O; Z) is minimized instead of R(φ) in this manner, so that the task dependency can be further reduced as compared with the CVAE.
Further, the learning unit 15b estimates I(O; Z) by using density ratio estimation. The density ratio estimation is a scheme for estimating a density ratio (difference) of two probability distributions without estimating each of the two probability distributions.
Here, as expressed in Equation (5) above. WO is an expected value of the Kullback-Leibler information amount KL for the variational lower bound L of the logarithm of the probability distribution in Equation (3) above, and is an upper bound of the mutual information amount I(O; Z). Thus, the learning unit 15b estimates the difference between R(φ) and I(O; Z) by using the density ratio estimation.
Specifically, the learning unit 15b estimates the difference between R(φ) and I(O; Z) using a neural network TΨ(φ), as expressed in Equation (11) below. It is known that the difference between R(φ) and I(O; Z) has a positive value.
[Math, 11]
D
KL(q99(z)||p(z))≃qϕ(z)[TΨ(z)] . . . (11)
Here, TΨ(φ) is a neural network that maximizes an objective function in Equation (12) below.
In this case, the mutual information amount I(O; Z) can be estimated by subtracting the difference estimated by Equation (11) above from the upper bound R(φ), as shown in Equation (13) below.
[Math. 13]
I(O; Z)≃pD(x,s) [DKL(q99(z|x,s)||p(z))]−qϕ(z)[T105(z)] . . . (13)
The learning unit 15b substitutes the estimated mutual information amount I(O; Z) into an objective function FCVAE(θ, φ) of the CVAE in Equation (4) above to obtain an objective function FProposed(θ, φ) of the present embodiment in Equation (14) below.
[Math. 14]
Proposed(θ, φ)=pD(x,s)[(x, s; θ,ϕ)]+qϕ(z)[Tqϕ(z)] . . . (14)
The learning unit 15b performs learning so that the objective function FProposed(θ, φ) is maximized to determine parameters. As expressed in Equation (14) above, the objective function FProposed(θ, φ) has a value greater by the difference in Equation (11) above than the objective function FCVAE(θ, φ) in Equation (4) above. Thus, the learning unit 15b can estimate the probability distribution of the data x in the task s with higher accuracy than in the CVAE.
In
As illustrated in
Configuration of Estimation Device
The input unit 21 is achieved by using an input device such as a keyboard or a mouse, and inputs various types of instruction information such as processing start to the control unit 25 in response to an input operation from the operator, The output unit 22 is achieved by a display device such as a liquid crystal display, a printing device such as a printer, or the like.
The communication control unit 23 is achieved by a network interface card (MC) or the like, and controls communication between an external device connected via a network, such as a server, and the control unit 25. For example, the communication control unit 23 controls communication between a management device or the like that manages various types of information and the control unit 15.
The storage unit 24 is achieved by a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disc, and stores, for example, a parameter of the data generation model learned by the learning device 10 described above. The storage unit 24 may be configured to communicate with the control unit 25 via the communication control unit 23.
The control unit 25 is achieved by using a central processing unit (CPU) or the like, and executes a processing program stored in a memory. This allows the control unit 25 to function as the acquisition unit 15a, the learning unit 15b, and the detection unit 25c, as illustrated in
Because the acquisition unit 15a and the learning unit 15b are the same functional units as the learning device 10 described above, description thereof will be omitted.
The detection unit 25c estimates a probability that newly acquired data in the task is generated, using the learned generation model, and detects an abnormality when the generation probability is lower than a predetermined threshold value. For example,
Further, the detection unit 25c uses the created generation model to estimate the distribution of the probability that the data in the task newly acquired by the acquisition unit 15a is generated. Further, the detection unit 25c determines a normality when the estimated probability that the data in the task newly acquired by the acquisition unit 15a is generated is equal to or more than the predetermined threshold value, and determines an abnormality when the estimated generation probability is lower than the predetermined threshold value.
For example, as illustrated in
As described above, the generation model created by the learning unit 15b has low task dependency and can estimate the data generation probability with high accuracy independently of the task. Thus, the detection unit 25c can detect abnormal data with high accuracy.
Further, the detection unit 25c outputs an alarm when the abnormality has been detected. For example, the detection unit. 25c outputs a message or an alarm indicating abnormality detection to the management device or the like via the output unit 22 or the communication control unit 23,
Learning Processing
Next, learning processing of the learning device 10 according to the present embodiment will be described with reference to
First, the acquisition unit 15a acquires the data in the task (step S1). For example, the acquisition unit 15a acquires, for each task, data of sensors for speed, rotation speed, traveling distance, and the like attached to an object such as a car.
Then, the learning unit 15b learns the generation model representing the distribution of the probability that the data x in the task s is generated so that the mutual information amount between the latent variable and the observed variable is minimized in the generation model (step S2), This mutual information amount is a mutual information amount I(O; Z) having, as the upper bound, the expected value R(φ) of the Kullback-Leibler information amount KL for the variational lower bound L of the logarithm of the probability distribution. Specifically, the learning unit 15b creates the generation model representing the distribution of the probability that the data x in the task s is generated based on the CVAE, and learns the generation model so that the mutual information amount I(O; Z) is minimized.
In this case, the learning unit 15b estimates I(O; Z) by using the density ratio estimation. Further, the learning unit 15b performs learning so that the objective function FProposed(θ, φ) obtained by substituting the estimated mutual information amount I(O; Z) into the objective function FCVAE(θ, φ) of the CVAE is maximized, to determine the parameter of the generation model. Thus, the series of learning processing ends.
Estimation Processing
Next, estimation processing in the estimation device 20 according to the present embodiment will be described with reference to
The detection unit 25c uses the created generation model to estimate the distribution of the probability that the data in the task newly acquired by the acquisition unit 15a is generated (step S3). Further, the detection unit 25c determines a normality when the estimated probability that the data in the task newly acquired by the acquisition unit 15a is generated is equal to or more than the predetermined threshold value, and determines an abnormality when the estimated probability of the data generation is lower than the predetermined threshold value (step S4). The detection unit 25c outputs an alarm when the detection unit 25c detects the abnormality. Thus, the series of estimation processes ends.
As described above, in the learning device 10 of the present embodiment, the acquisition unit 15a acquires the data in the task. Further, the learning unit 15b learns the generation model representing the distribution of a probability that the data in the task is generated so that the mutual information amount between the latent variable and the observed variable is minimized in the generation model. The mutual information amount is a predetermined mutual information amount having, as the upper bound, the expected value of the Kullback-Leibler information amount for the variational lower bound of the logarithm of the probability distribution. Further, this generation model includes an encoder that encodes data to convert the data into a representation using a latent variable, and a decoder that decodes the data encoded by the encoder, and is generated based on the CVAE.
Thus, the learning device 10 can reduce the task dependency and estimate the distribution of the probability that the data in the task is generated with higher accuracy. Thus, according to the learning device 10, it is possible to improve the accuracy of multitask learning.
Further, the learning unit 15b estimates the mutual information amount by using density ratio estimation. This allows the learning device 10 to efficiently reduce the task dependency of the generation model.
Further, in the estimation device 20 of the present embodiment, the acquisition unit 15a, acquires the data in the task. Further, the learning unit 15b learns the generation model representing the distribution of a probability that the data in the task is generated so that the mutual information amount between the latent variable and the observed variable is minimized in the generation model. Further, the detection unit 25c uses the teamed generation model to estimate the probability that the newly acquired data in the task is generated, and detects an abnormality when the probability of generation is lower than the predetermined threshold value. This allows the estimation device 20 to estimate the data generation probability with high accuracy independently of the task and detect the abnormal data with high accuracy through multitask learning.
For example, the estimation device 20 can acquire a large number of large-scale and complicated data output by various sensors for temperature, speed, rotation speed, traveling distance, and the like attached to a car, and detect an abnormality occurring in a traveling car with high accuracy. Alternatively, the estimation device 20 can acquire, for each task, large-scale and complicated data output by sensors for temperature, frequency, sound, and the like attached to a wide variety of devices operating in a factory, and detect an abnormality with high accuracy independently of the task when an abnormality occurs in any of the devices.
Further, the detection unit 25c outputs an alarm when the detection unit 25c has detected an abnormality. This allows the estimation device 20 to notify a notification destination capable of dealing with the detected abnormality so that the abnormality is dealt with.
The learning device 10 and the estimation device 20 of the present embodiment are not limited to those based on the CVAE of the related art. For example, processing of the learning unit 15b may be based on processing obtained by adding conditions of a task to an autoencoder (AE), which is a special case of VAE, or the encoder and the decoder may follow a probability distribution other than the Gaussian distribution.
Program
It is also possible to create a program in which the processing executed by the learning device 10 and the estimation device 20 according to the embodiment is described in a language that can be executed by a computer. In an embodiment, the learning device 10 can be implemented by a learning program that executes the learning processing being installed as package software or online software on a desired computer. For example, the information processing device is caused to execute the learning program so that the information processing device can function as the learning device 10. Similarly, an estimation program that executes the above estimation processing is installed on a desired computer so that the information processing device can function as the estimation device 20. The information processing device referred to herein includes a desktop type or notebook type personal computer. In addition, examples of the information processing device include a smartphone, a mobile communication terminal such as a mobile phone or a personal handyphone system (PHS), and a slate terminal such as a personal digital assistant (PDA). Further, functions of the learning device 10 or functions of the estimation device 20 may be implemented in a cloud server.
The memory 1010 includes a read only memory (ROM) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a basic input output system (BIOS). The hard disk drive interface 1030 is connected to a hard disk drive 1031. The disc drive interface 1040 is connected to the disc drive 1041. A removable storage medium such as a magnetic disk or an optical disc is inserted into the disc drive 1041. A mouse 10.51 and a keyboard 1052. for example, are connected to the serial port interface 1050. A display 1061, for example, is connected to the video adapter 1060.
Here, the hard disk drive 1031 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. Each piece of information described in the above embodiment is stored in, for example, the hard disk drive 1031 or the memory 1010.
Further, the learning program or the estimation program is stored in the hard disk drive 1031 as, for example, the program module 1093 in which commands executed by the computer 1000 are described. Specifically, the program module 1093 in which each processing executed by the learning device 10 or the estimation device 20 described in the above embodiment is described is stored in the hard disk drive 1031.
Further, data used for information processing in the learning program or the estimation program is stored as the program data 1094 in, for example, the hard disk drive 1031. The CPU 1020 reads the program module 1093 or the program data 1094 stored in the hard disk drive 1031 into the RAM 1012 as necessary, and executes each of the above-described procedures.
The program module 1093 or the program data 1094 related to the learning program or the estimation program is not limited to a case in which the program module 1093 or the program data 1094 are stored in the hard disk drive 1031, and for example, the program module 1093 or the program data 1094 may be stored in a removable storage medium and read by the CPU 1020 via the disc drive 1041 or the like. Alternatively, the program module 1093 or the program data 1094 related to the learning program or the estimation program may be stored in another computer connected via a network such as local area network (LAN) or wide area network (WAN) and read by the CPU 1020 via the network interface 1070.
Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which constitute a part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation technologies, and the like made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.
10 Learning device
11, 21 Input unit
12, 22 Output unit
13, 23 Communication control unit
14, 24 Storage unit
15, 25 Control unit
15
a Acquisition unit
15
b Learning unit
20 Estimation device
25
c Detection unit
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/045693 | 11/21/2019 | WO |