Neural networks are used in the field of machine learning and artificial intelligence (AI). A neural network comprises plurality of nodes which are interconnected by links, sometimes referred to as edges. The input edges of one or more nodes form the input of the network as a whole, and the output edges of one or more other nodes form the output of the network as a whole, whilst the output edges of various nodes within the network form the input edges to other nodes. Each node represents a function of its input edge(s) weighted by a respective weight, the result being output on its output edge(s). The weights can be gradually tuned based on a set of experience data (training data) so as to tend towards a state where the network will output a desired value for a given input.
Typically the nodes are arranged into layers with at least an input and an output layer. A “deep” neural network comprises one or more intermediate or “hidden” layers in between the input layer and the output layer. The neural network can take input data and propagate the input data through the layers of the network to generate output data. Certain nodes within the network perform operations on the data, and the result of those operations is passed to other nodes, and so on.
At some or all of the nodes of the network, the input to that node is weighted by a respective weight. A weight may define the connectivity between a node in a given layer and the nodes in the next layer of the neural network. A weight can take the form of a single scalar value or can be modelled as a probabilistic distribution. When the weights are defined by a distribution, as in a Bayesian model, the neural network can be fully probabilistic and captures the concept of uncertainty. The values of the connections 106 between nodes may also be modelled as distributions. This is illustrated schematically in
The network learns by operating on data input at the input layer, and adjusting the weights applied by some or all of the nodes based on the input data. There are different learning approaches, but in general there is a forward propagation through the network from left to right in
The input to the network is typically a vector, each element of the vector representing a different corresponding feature. E.g. in the case of image recognition the elements of this feature vector may represent different pixel values, or in a medical application the different features may represent different symptoms or patient questionnaire responses. The output of the network may be a scalar or a vector. The output may represent a classification, e.g. an indication of whether a certain object such as an elephant is recognized in the image, or a diagnosis of the patient in the medical example.
Training in this manner is sometimes referred to as a supervised approach. Other approaches are also possible, such as a reinforcement approach wherein the network each data point is not initially labelled. The learning algorithm begins by guessing the corresponding output for each point, and is then told whether it was correct, gradually tuning the weights with each such piece of feedback. Another example is an unsupervised approach where input data points are not labelled at all and the learning algorithm is instead left to infer its own structure in the experience data. The term “training” herein does not necessarily limit to a supervised, reinforcement or unsupervised approach.
A machine learning model (also known as a “knowledge model”) can also be formed from more than one constituent neural network. An example of this is an auto encoder, as illustrated by way of example in
The encoder is sometimes referred to as an inference network in that it infers the latent vector Z from an input observation Xo. The decoder is sometimes referred to as a generative network in that it generates a version {circumflex over (X)} of the input feature space from the latent vector Z.
Once trained, the auto encoder can be used to impute missing values from a subsequently observed feature vector Xo. Alternatively or additionally, a third network can be trained to predict a classification Y from the latent vector, and then once trained, used to predict the classification of a subsequent, unlabelled observation.
It is identified herein that conventional VAEs perform particularly poorly when the feature space of the input vector comprises mixed types of data. For example, in a medical setting, one or more of the features in the input feature space may be categorical values (e.g. a yes/no answer to a questionnaire, or gender) whilst one or more others may be continuous numerical values (e.g. height, or weight). Contrast for example with the case of image recognition where all the input features may represent pixel values.
In a VAE, the performance of any imputation or prediction performed based on the latent vector depends on the dimensionality of the latent space. In other words, the more elements (greater number of dimensions) are included in the latent vector, then the better the performance (where performance may be measured in terms of accuracy of prediction compared to a known ground truth in some test data). However, it is identified herein that when it comes to modelling mixed type data, the limiting factor on a conventional VAE is not the size of latent vector, but rather the mixed nature of the data types. It is identified herein that in such cases, increasing the latent size will not improve the performance significantly. On the other hand, the computational complexity (in terms of both training and prediction or imputation) will continue to scale with the dimensionality of the latent space (the number of elements in the latent vector Z) even if increasing the dimensionality is no longer increasing performance. Hence in applications where handling mixed types of data, conventional VAEs are not making efficient use of the computational complexity incurred.
It would be desirable to provide a machine learning model which can handle mixed types of data with reduced computational complexity for a given performance, or improved performance for a given computational complexity.
According to one aspect disclosed herein, there is provided a method comprising a first and a second stage. In the first stage, the method comprises training each of a plurality of individual first variational auto encoders, VAEs, each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data. In the second stage, following the first stage, the method comprises training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation.
As the first decoders are trained individually, separately from one another, then they can be trained without influencing one another. A second encoder and decoder can then be trained in a subsequent stage to encode into a second latent space and decode back to the individual first latent values, and thus learn the dependencies between the different data types. This two-stage approach, including a stage of separation between the different types of data, provides improved performance when handling mixed data.
In a conventional (“vanilla”) VAE, the dimensionality of the latent space is simply the dimensionality of the single latent vector Z between encoder and decoder. In the presently disclosed approach, the dimensionality is the sum of the dimensionality of the second latent representation (the number of elements in the second latent vector) plus the dimensionalities of each of the first latent representation (in embodiments one element each). E.g. the dimensionality may be represented as dim(H)+D, where dim(H) is the number of elements in the second latent vector H, and D is the number of features or feature subsets. However, an issue with a vanilla VAE is that under mixed type data, it cannot make use of the latent space very efficiently. Hence increasing the size of latent space will not help. On the contrary, since the disclosed method has a two-stage structure, it will actually have a larger latent size if H has the same dimensionality as Z. However, by disentangling the different feature types in the first learning stage, the increase of latent size in the disclosed model gives a significant boost compared with a vanilla VAE. So the latent space and training procedure are designed to make use of the latent space much more efficiently.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Nor is the claimed subject matter limited to implementations that solve any or all of the disadvantages noted herein.
To assist understanding of embodiments of the present disclosure and to show how such embodiments may be put into effect, reference is made, by way of example only, to the accompanying drawings in which:
Deep generative models often perform poorly in real-world applications due to the heterogeneity of natural data sets. Heterogeneity arises from having different types of features (e.g. categorical, continuous, etc.) each with their own marginal properties which can be drastically different. “Marginal” refers to the distribution of different possible values of the feature verses number of samples, disregarding co-dependency with other features. In other words the shape of the distribution for different types of feature can be quite different. The types of data may include for example: categorical (the value of the feature takes one of a plurality of non-numerical categories), ordinal (integer numerical values) and/or continuous (continuous numerical values). VAE will try to optimize all likelihood functions all at once. In practice some likelihood functions may have larger values, hence the VAE will pay attention to a particular likelihood function and ignore others. In this case, the contribution that each likelihood makes to the training objective can be very different, leading to challenging optimization problems in which some data dimensions may be poorly-modelled in favour of others.
Using VAEs for modelling mixed type real-world data is under explored in the literature, especially when combined with down-stream decision-making tasks. To overcome the limitations of VAEs in this setting, the present disclosure provides a new method which may be referred to as a variational auto-encoder for heterogeneous mixed type data (VAEM). Later some examples of its performance for decision making in real-world applications are studied. VAEM uses a hierarchy of latent variables which is fit in two stages. In the first stage, one type-specific VAE is learned for each dimension. These initial one-dimensional VAEs capture marginal distribution properties and provide a latent representation that is uniform across dimensions. In the second stage, another VAE is used to capture dependencies among the one-dimensional latent representations from the first stage.
Thus there is provided an improved model for heterogeneous mixed type data which alleviates the limitations of conventional VAEs. In embodiments the VAEM employs a deep generative model for the heterogeneous mixed type data.
The disclosure herein will study the data generation quality of VAEM comparing with VAEs and other baselines on five different datasets (e.g. see
In embodiments, VAEM may be extended to handle missing data, perform conditional data generation, and employ algorithms that enable it to be used for efficient sequential active information acquisition. It will be shown herein that VAEM obtains strong performance for conditional data generation as well as sequential active information acquisition in cases where VAEs perform poorly.
The two-stage VAEM model will be discussed in more detail shortly with reference to
The computing apparatus 200 comprises a controller 202, an interface 204, and an artificial intelligence (AI) algorithm 206. The controller 202 is operatively coupled to each of the interface 204 and the AI algorithm 206.
Each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on processing apparatus comprising one or more processors such as CPUs, work accelerator co-processors such as GPUs, and/or other application specific processors, implemented on one or more computer terminals or units at one or more geographic sites. The storage on which the code is stored may comprise one or more memory devices employing one or more memory media (e.g. electronic or magnetic media), again implemented on one or more computer terminals or units at one or more geographic sites. In embodiments, one, some or all the controller 202, interface 204 and AI algorithm 206 may be implemented on the server. Alternatively, a respective instance of one, some or all of these components may be implemented in part or even wholly on each of one, some or all of the one or more user terminals. In further examples, the functionality of the above-mentioned components may be split between any combination of the user terminals and the server. Again it is noted that, where required, distributed computing techniques are in themselves known in the art. It is also not excluded that one or more of these components may be implemented in dedicated hardware.
The controller 202 comprises a control function for coordinating the functionality of the interface 204 and the AI algorithm 206. The interface 204 refers to the functionality for receiving and/or outputting data. The interface 204 may comprise a user interface (UI) for receiving and/or outputting data to and/or from one or more users, respectively; or it may comprise an interface to one or more other, external devices which may provide an interface to one or more users. Alternatively the interface may be arranged to collect data from and/or output data to an automated function or equipment implemented on the same apparatus and/or one or more external devices, e.g. from sensor devices such as industrial sensor devices or IoT devices. In the case of interfacing to an external device, the interface 204 may comprise a wired or wireless interface for communicating, via a wired or wireless connection respectively, with the external device. The interface 204 may comprise one or more constituent types of interface, such as voice interface, and/or a graphical user interface.
The interface 204 is thus arranged to gather observations (i.e. observed values) of various features of an input feature space. It may for example be arranged to collect inputs entered by one or more users via a UI front end, e.g. microphone, touch screen, etc.; or to automatically collect data from unmanned devices such as sensor devices. The logic of the interface may be implemented on a server, and arranged to collect data from one or more external user devices such as user devices or sensor devices. Alternatively some or all of the logic of the interface 204 may also be implemented on the user device(s) or sensor devices its/themselves.
The controller 202 is configured to control the AI algorithm 206 to perform operations in accordance with the embodiments described herein. It will be understood that any of the operations disclosed herein may be performed by the AI algorithm 206, under control of the controller 202 to collect experience data from the user and/or an automated process via the interface 204, pass it to the AI algorithm 206, receive predictions back from the AI algorithm and output the predictions to the user and/or automated process through the interface 204.
The machine learning (ML) algorithm 206 comprises a machine-learning model 208, comprising one or more constituent neural networks 101. A machine-leaning model 208 such as this may also be referred to as a knowledge model. The machine learning algorithm 206 also comprises a learning function 209 arranged to tune the weights w of the nodes 104 of the neural network(s) 101 of the machine-learning model 208 according to a learning process, e.g. training based on a set of training data.
Each node 104 represents a function of the input value(s) received on its input edges(s) 106i, the outputs of the function being output on the output edge(s) 106o of the respective node 104, such that the value(s) output on the output edge(s) 106o of the node 104 depend on the respective input value(s) according to the respective function. The function of each node 104 is also parametrized by one or more respective parameters w, sometimes also referred to as weights (not necessarily weights in the sense of multiplicative weights, though that is certainly one possibility). Thus the relation between the values of the input(s) 106i and the output(s) 106o of each node 104 depends on the respective function of the node and its respective weight(s).
Each weight could simply be a scalar value. Alternatively, as shown in
As shown in
The different weights of the various nodes 104 in the neural network 101 can be gradually tuned based on a set of experience data (training data), so as to tend towards a state where the output 108o of the network will produce a desired value for a given input 108i. For instance, before being used in an actual application, the neural network 101 may first be trained for that application. Training comprises inputting experience data in the form of training data to the inputs 108i of the graph and then tuning the weights w of the nodes 104 based on feedback from the output(s) 108o of the graph. The training data comprises multiple different input data points, each comprising a value or vector of values corresponding to the input edge or edges 108i of the graph 101.
For instance, consider a simple example as in
The classification Y could be a scalar or a vector. For instance in the simple example of the elephant-recognizer, Y could be a single binary value representing either elephant or not elephant, or a soft value representing a probability or confidence that the image comprises an image of an elephant. Or similarly, if the neural network 101 is being used to test for a particular medical condition, Y could be a single binary value representing whether the subject has the condition or not, or a soft value representing a probability or confidence that the subject has the condition in question. As another example, Y could comprise a “1-hot” vector, where each element represents a different animal or condition. E.g. Y=[1, 0, 0, . . . ] represents an elephant, Y=[0, 1, 0, . . . ] represents an hippopotamus, Y=[0, 0, 1, . . . ] represents a rhinoceros, etc. Or if soft values are used, Y=[0.81, 0.12, 0.05, . . . ] represents an 81% confidence that the image comprises an image of an elephant, 12% confidence that it comprises an image of an hippopotamus, 5% confidence of a rhinoceros, etc.
In the training phase, the true value of Yi for each data point i is known. With each training data point i, the AI algorithm 206 measures the resulting output value(s) at the output edge or edges 108o of the graph, and uses this feedback to gradually tune the different weights w of the various nodes 104 so that, over many observed data points, the weights tend towards values which make the output(s) 108i (Y) of the graph 101 as close as possible to the actual observed value(s) in the experience data across the training inputs (for some measure of overall error). I.e. with each piece of input training data, the predetermined training output is compared with the actual observed output of the graph 108o. This comparison provides the feedback which, over many pieces of training data, is used to gradually tune the weights of the various nodes 104 in the graph toward a state whereby the actual output 108o of the graph will closely match the desired or expected output for a given input 108i. Examples of such feedback techniques include for instance stochastic back-propagation.
Once trained, the neural network 101 can then be used to infer a value of the output 108o (Y) for a given value of the input vector 108i (X), or vice versa.
Explicit training based on labelled training data is sometimes referred to as a supervised approach. Other approaches to machine learning are also possible. For instance another example is the reinforcement approach. In this case, the neural network 101 begins making predictions of the classification Yi for each data point i, at first with little or no accuracy. After making the prediction for each data point i (or at least some of them), the AI algorithm 206 receives feedback (e.g. from a human) as to whether the prediction was correct, and uses this to tune the weights so as to perform better next time. Another example is referred to as the unsupervised approach. In this case the AI algorithm receives no labelling or feedback and instead is left to infer its own structure in the experienced input data.
The encoder 208q is arranged to receive the observed feature vector Xo as an input and encode it into a latent vector Z (a representation in a latent space). The decoder 208p is arranged to receive the latent vector Z and decode back to the original feature space of the feature vector. The version of the feature vector output by the decoder 208p may be labelled herein {circumflex over (X)}.
The latent vector Z is a compressed (i.e. encoded) representation of the information contained in the input observations Xo. No one element of the latent vector Z necessarily represents directly any real world quantity, but the vector Z as a whole represents the information in the input data in compressed form. It could be considered conceptually to represent abstract features abstracted from the input data Xo, such as “wrinklyness” and “trunk-like-ness” in the example of elephant recognition (though no one element of the latent vector Z can necessarily be mapped onto any one such factor, and rather the latent vector Z as a whole encodes such abstract information). The decoder 208p is arranged to decode the latent vector Z back into values in a real-world feature space, i.e. back to an uncompressed form {circumflex over (X)} representing the actual observed properties (e.g. pixel values). The decoded feature vector {circumflex over (X)} has the same number of elements representing the same respective features as the input vector Xo.
The weights w of the inference network (encoder) 208q are labelled herein ø, whilst the weights w of the generative network (decoder) 208p are labelled θ. Each node 104 applies its own respective weight as illustrated in
With each data point in the training data (each data point in the experience data during learning), the learning function 209 tunes the weights ø and θ so that the VAE 208 learns to encode the feature vector X into the latent space Z and back again. For instance, this may be done by minimizing a measure of divergence between qø(Zi|Xi) and pθ(Xi|Zi), where qø(Zi|Xi) is a function parameterised by ø representing a vector of the probabilistic distributions of the elements of Zi output by the encoder 208q given the input values of Xi, whilst pθ(Xi|Zi) is a function parameterized by θ representing a vector of the probabilistic distributions of the elements of Xi output by the encoder 208q given Zi. The symbol “|” means “given”. The model is trained to reconstruct Xi and therefore maintains a distribution over Xi. At the “input side”, the value of Xoi is known, and at the “output side”, the likelihood of {circumflex over (X)}i under the output distribution of the model is evaluated. Typically p(z|x) is referred to as posterior, and q(z|x) as approximate posterior. p(z) and q(z) are referred to as priors.
For instance, this may be done by minimizing the Kullback-Leibler (KL) divergence between qø(Zi|Xi) and pθ(Xi|Zi). The minimization may be performed using an optimization function such as an ELBO (evidence lower bound) function, which uses cost function minimization based on gradient descent. An ELBO function may be referred to herein by way of example, but this is not limiting and other metrics and functions are also known in the art for tuning the encoder and decoder networks of a VAE.
The requirement to learn to encode to Z and back again amounts to a constraint placed on the overall neural network 208 of the VAE formed from the constituent neural networks of the encoder and decoder 208q, 208p. This is the general principle of an autoencoder. The purpose of forcing the autoencoder to learn to encode and then decode a compressed form of the data, is that this can achieve one or more advantages in the learning compared to a generic neural network; such as learning to ignore noise in the input data, making better generalizations, or because when far away from a solution the compressed form gives better gradient information about how to quickly converge to a solution. In a variational autoencoder, the latent vector Z is subject to an additional constraint that it follows a predetermined form of probabilistic distribution such as a multidimensional Gaussian distribution or gamma distribution.
There are a number of ways that a VAE 208 can be used for a practical purpose. One use is, once the VAE has been trained, to generate a new, unobserved instance of the feature vector {circumflex over (X)} by inputting a random or unobserved value of the latent vector Z into the decoder 208p. For example if the feature space of X represents the pixels of an image, and the VAE has been trained to encode and decode human faces, then by inputting a random value of Z into the decoder 208p it is possible to generate a new face that did not belong to any of the sampled subjects during training. E.g. this could be used to generate a fictional character for a movie or video game.
Another use is to impute missing values. In this case, once the VAE has been trained, another instance of an input vector Xo may be input to the encoder 208q with missing values. I.e. no observed value of one or more (but not all) of the elements of the feature vector Xo. The values of these elements (representing the unobserved features) may be set to zero, or 50%, or some other predetermined value representing “no observation.” The corresponding element(s) in the decoded version of the feature vector {circumflex over (X)} can then be read out from the decoder 208p in order to impute the missing value(s). The VAE may also be trained using some data points that have missing values of some features.
Another possible use of a VAE is to predict a classification, similarly to the idea described in relation to
An improved method of forming a machine learning model 208′, in accordance with embodiments disclosed herein, is now described with reference to
The model is trained in two stages. In a first stage, an individual VAE is trained for each of the individual features or feature types, without one influencing another. In a second stage, a further VAE is then trained to learn the inter-feature dependencies.
Both a vanilla VAE and the disclosed form of VAE use multiple likelihood functions. However, the issue with a vanilla VAE is that it tries to optimize all likelihood functions all at once. In practice some likelihood functions may have larger values, hence the VAE will pay attention to a particular likelihood function and ignore others. In contrast, the disclosed method works to optimize all likelihood function separately so that it mitigates this issue.
As shown in
The number of subsets may be labelled herein d=1 . . . D, where d is an index of the subset and D is the total number of subsets. In embodiments, each subset Xod is only a single respective feature. E.g. one feature Xo1 could be gender, another feature Xo2 could be age, whilst another feature Xo3 could be weight (such as in an example for predicting or imputing a medical condition of a user). Alternatively features of the same type could be grouped together into the subset trained by one of the individual VAEs. E.g. one subset Xo1 could consist of categorical variables, another subset Xo2 could consist of ordinal variables, whilst another subset Xo3 could consist of continuous variables.
Each individual VAE comprises a respective first encoder 208qd (d=1 . . . D) arranged to encode the respective feature Xod into a respective latent representation (i.e. latent space) Zd. Each individual VAE also comprises a respective first decoder 208pd (d=1 . . . D) arranged to decode the respective latent representation Zd back into the respective dimension(s) of the feature space of the respective subset of features, i.e. to generate a decoded version {circumflex over (X)}d of the respective observed feature subset Xod. So Xo1 is encoded into Z1 and then decoded into {circumflex over (X)}1, whilst Xo2
In embodiments each of the latent representations Zd is one-dimensional, i.e. consists of only a single latent variable (element). Note however this does not imply the latent variable Zd is a modelled only as simple, fixed scalar value. Rather, as the auto-encoder is a variation auto-encoder, then for each latent variable Zd the encoder learns a statistical or probabilistic distribution, and the value input to the decoder is a random sample from the distribution. This means that for each individual element of latent space, the encoder learns one or more parameters of the respective distribution, e.g. a measure of centre point and spread of the distribution. For instance each latent variable Zd (a single dimension) may be modelled in the encoder by a respective mean value μd and standard deviation σd or variance σd2. The possibility of a multi-dimensional Zd is also not excluded (in which case each dimension is modelled by one or more parameters of a respective distribution), though this would increase computational complexity, and generally the idea of a latent representation is that it compresses the information from the input feature space into a lower dimensionality.
In the first stage, each individual VAE is trained (i.e. has it weights tuned) by the learning function 209 (e.g. an ELBO function) to minimize a measure of difference between the respective observed feature subset Xod and the respective decoded version of that feature subset {circumflex over (X)}d.
As shown in
At the input of the second encoder 208qH, each of the feature subsets Xod is combined with its respective latent vector Zd (using the values of Zd learned using the first VAE in the first stage). In embodiments this combination comprises concatenating each feature subset Xod with its respective latent vector Zd. However in principle any function which combines the information of the two could be used, e.g. a multiplication, or interleaving, etc. Whatever function is used, each such combination forms one of the inputs of the second encoder 208qH. The second encoder 208qH is arranged to encode these inputs into a second latent representation in the form of a latent vector H, having multiple dimensions (with each dimension—i.e. each element of the vector—being modelled as a respective distribution, so represented in terms of one or more parameters of the respective distribution, e.g. a respective mean and variance or standard deviation). H is also referred to later as h (in the vector form), not to be confused with h(⋅) the function.
The second decoder 208pH is arranged to decode the second latent representation H back into a version of the individual first latent representations {circumflex over (Z)}d (d=1 . . . D). In the second learning stage, the second VAE is trained (i.e. has its weights tuned) by the learning function 209 to minimize a measure of difference between the first latent representation Z and the decoded version thereof {circumflex over (Z)} (where Z is the vector made up of the individual first latent representations Z1, Z2, Z3, . . . ; and {circumflex over (Z)}1, {circumflex over (Z)}2, {circumflex over (Z)}3, . . . are the corresponding decoded versions). In
A more abstracted, higher-level representation of the model 208′ is shown in
Based on this two-stage approach, the model thus learns to first disentangle the dependencies between different data types, and then learn to the effect of the dependencies between data types.
The computational complexity of an auto-encoder increases with the dimensionality of the latent space. For instance, consider a conventional VAE 208 as shown in
As shown in
In another example, the model 208′ can be used to impute missing values in the input feature vector Xo. Following training, a subsequent observed instance of the feature vector Xo may be input to the second encoder 208qH, wherein this instance of the feature vector Xo which has some (but not all) of the features (i.e. elements) of the feature vector missing (i.e. unobserved). The missing elements may be set to zero, 50% or some other predetermined value representing “no observation”. The value(s) of the corresponding features (i.e. same elements) of the feature space can then be read out from the decoded version {circumflex over (X)} of the feature vector, and taken as imputed values of the missing observations. In embodiments, the model 208′ may also be trained using some data points that have one or more missing values.
An issue with this basic method of imputation is that the predetermined value representing “no observation” may still be interpreted by the encoder as if it was a sampled value. E.g. if 0 is used, then the encoder cannot tell the difference between “no observation” and an actual observation of zero (e.g. a black pixel, or a sensor reading of zero, etc.). Similar issues may apply if, say, a predetermined value of 50% probability is used.
Each value v is combined with its respective embedding, e.g. by multiplication or concatenation, etc. In embodiments multiplication is used here, but it could be any operator that combines the information from the two. The embedding e is the coordinate of the respective input—it tells the encoder which element is being input at that input. E.g. this could be a coordinate of a pixel or an index of the feature d.
Each individual neural network h(⋅) outputs a vector. These vectors are combined by a permutation invariant operator g, such as a summation. A permutation invariant operator is an operator which outputs a value—in this case a vector—which depends on the values of the inputs to the operator but which is independent of the order of those inputs. Furthermore, this output vector is of a fixed size regardless of the number of inputs to the operator. This means that g(⋅) can supply a vector c of a given format regardless of which inputs are present and which are not, and the order in which they are supplied. This enables the encoder 208qH to deal handle missing inputs.
The encoder 208qH comprises a further, common neural network f(⋅) which is common to all of the inputs v. The output c of the permutation invariant operator g(⋅) is supplied to the input of this further neural network f(⋅). This neural network encodes the output c of g(⋅) into the second latent vector H (also labelled h, as a vector rather than a function, in the later working). In embodiments the further neural network f(⋅) is used, rather than just using c directly, because the size of observed features are not fixed. Therefore first a common function f is preferably applied to all observed features.
In an optional additional application of the disclosed model, a reward function R1 may be used to determine which observation to make next following the first and second stages of training of the model 208′. The reward function is a function of the observations obtained so far, and represents the amount of new information that would be added by observing a given missing input. By determining which currently missing feature maximizes the reward function (or equivalently minimizes a cost function), this determines which of the unobserved inputs would be the most informative input to collect next. It represents the fact that some inputs have a greater dependency on one another than others, so the input that is least correlated with the other, already-observed inputs will provide the most new information. The reward function is evaluated for a plurality of different candidate unobserved features, and the feature which maximises the reward (or minimizes the cost) will be the feature that gives the most new information by being observed next. In some cases the model 208′ may then undergo another cycle of the first and second training stage, now incorporating the new observation. Alternatively the new observation could be used to improve the quality of a prediction, or simply be used by a human analyst such as a doctor in conjunction with the result (e.g. classification Y or an inputted missing feature Xd) of the already-trained model 208′.
Note that while examples herein have been described as using labelled training data, the disclosed techniques are not limited to a supervised approach. More generally “training” herein could refer to any of supervised, reinforcement or unsupervised learning. The disclosed method is a specific way to obtain a model that can model datasets with mixed-type variables. Once the model is trained, it can be used in many ways such as reinforcement learning and prediction.
Some example implementation details of various concepts discussed above will now be discussed further by way of illustration.
In order to properly handle the mixed type data with heterogeneous marginals, the proposed method fits the data in a two-stage procedure. As shown in
Stage one: training individual marginal VAEs to each single variable. In the first stage, we focus on modelling the marginal distributions of each variable, by training D individual VAEs pθp(zd)pθ
where p(zd) is the standard Gaussian prior, qϕ
Stage two: training dependency network to assemble marginal VAEs. In the second stage, we model the intervariable statistical dependencies by training a new multi-dimensional VAE pψ(z)=p(h)pψ(z|h), called the dependency network, is built on top of the latent representations z provided by encoders of marginal VAEs in first Stage. Specifically, we train pψ(z) by:
where h is the latent space of the dependency network. The above procedure effectively disentangles the inter-variable, heterogeneous properties of mixed type data (modelled by marginal VAEs), from inter-variable dependencies (modelled by prior networks). We call our model VAE for heterogeneous mixed type data (VAEM).
After training the marginal VAEs and dependency network, our final generative model is given by:
To handle complicated statistical dependencies, we utilize the VampPrior, which uses a mixture of Gaussians (MoGs) as the prior distribution for the high-level latent variable i.e.,
where K<<N and uk are a subset of points.
In generic machine learning applications, normalization is considered to be an essential preprocessing step. For example, it is common to first normalize the data to have the zero mean and standard deviation. However, for mixed-type data, no standard normalization method can be applied. With our VAEM, each marginal VAE is trained independently to model the heterogeneous properties of each data dimension, thus transforming the mixed type data xd to a continuous representation zd. The collection of zd forms the aggregated posterior which is close to a standard normal distribution thanks to the regularization effect from the prior p(z). In this way, we overcome the heterogeneous mixed-type problem and the dependency VAE can focus on learning the relationships among variables.
We further extend our method for decision making under uncertainty. In particular, we focus on the sequential active information acquisition application as an exemplar case. With this application context, we present the extension of using our model in the presence of missing data and Lindley information estimation.
Suppose that for a data instance x, we are interested in predicting a target xΦ∈xU of interest given currently observed xO(xϕ∩xO=ø), where xO denotes the set of currently observed variables, and xU the unobserved ones. One important problem is sequential active information acquisition (SAIA): how can we decide which variable xi⊂xUΦ is the best one to observe next, so that we can optimally increase our knowledge (e.g., predictive ability) regarding xΦ?
To solve the problem:
1) we should have a good generative model that can handle missing data, and which can effectively generate conditional samples from log p(xU|xO),
2) the ability to estimate a reward function, in this case Lindley information, to enable decision making based on generative models. We now present our extensions of VAEM to fulfil these two requirements.
The amortized inference approach of VAEM cannot handle partially observed data, since the dimensionality of observed variables xo might vary across different data instances. We apply a PointNet encoding structure to build a partial inference network for the dependency VAE to infer h based on partial observations in an amortized manner. Specifically, at the first stage, we estimate each marginal VAE with only the observed samples for that dimension.
where x
At the second stage, a VAE which can handle partial observation is needed. Similarly to the partial-VAE, the dependency VAE in the presence of missing data is defined by
This is trained by maximising the partial ELBO:
where h is the latent space of the dependency network, qλ(h|zO, xO) is a set-function the so-called partial inference net, the structure of which is shown in M→
K, where M and K is the
dimension of the feature embedding and the feature map, respectively. Finally, we apply a permutation invariant aggregation operation g(.), such as summation. In this way, qλ(h|zO, xO) is invariant to the permutations of elements of xO, and xO can have arbitrary length.
Once the marginal VAEs and the partial dependency network is trained, we can generate conditional samples from log pθ(xU xO) by the following inference procedure: first, the latent representation zd for the observed variables are inferred. With this representation, we utilize the partial inference network to infer the h, which is the latent code for the second stage VAE. With the h, we can generate zs which are the latent code for the unobserved dimensions and then generate the xs.
z
d
˜q
d(zd|xd,ϕd)∀d∈O, zO=zd|d∈O Eq. 10
h˜q
λ(h|zO,zO) Eq. 11
z
s
˜p
ψ(zs|h)∀s∈U, zU={zs|s∈U} Eq. 12
x
s
˜p
θ(xs|zU,zO)∀s∈U, xU=xs|s∈U Eq. 13
SAIA can be framed as a Bayesian experimental design problem. xi⊂xUΦ is selected by the following information reward function
R
I(xi,xo)=x
[p(xΨ|xO)] Eq. 14
We use a pre-trained partial VAEM model to estimate the required distributions p(xi|xO), p(xψ|xi,xO), and p(xΦ|xO). Due to the intractability of [p(xΦ(xi,xO)∥p(xΦ|xO)], we must resort to approximations. An efficient latent space estimation method of RI(xi, xO) can be approximated by:
(xi,xO)=
pθ(x
[qλ(h|zO)]−
(pθx
[qλ(h|zΦ,zi,zO)qλ(h|zΦ,zO)}] Eq. 15
Note that for compactness, we omitted the notation for input xo and xi to the partial inference nets. The approximation is very efficient to compute, since all KL terms can be calculated analytically, assuming that the partial inference net qλ(h|zO) is Gaussian (or other common distributions such as normalising flows).
In active information acquisition, the target of interest xΦ∩xO=Ø is often the target that we try to predict. In order to enhance the predictive performance of VAEM, we propose to use the following factorization:
p
θ(xO,xΦ)=pθ(x
where pλ(xΦ|xO, xUΦ) is the discriminator that gives probabilistic prediction of xΦ based on both observed variables xO, imputed variables xU and the global latent representation h (the last one is optional). The discriminator in Equation Eq. 16 offers additional predictive power for the target xΦ of interest.
To evaluate the performance and validity of our proposed VAEM model, we first assess it on the task of mixed type heterogeneous data generation. Then, we compare the performance of conditional mixed type data generation (imputation). Finally, to evaluate the conditional generation quality of our models more comprehensively, we apply VAEM to the task of sequential active information acquisition. In these tasks, the underlying generative model is asked to generate samples of unobserved variables for each instance, and then decide which variables to acquire next. The same set of datasets are used for all experiments, which include two UCI benchmark datasets (Boston housing and Energy), two real-world datasets Avocado and Bank, and medical dataset (MIMIC-III). We compare our proposed VAEM with a number of baseline methods.
For our proposed method VAEM, unless specified, we use the partial version proposed above, and the discriminator structure specified by equation Eq. 16
Throughout the experiments, we consider a number of baselines. Unless specified, all VAE baselines also use similar partial inference method and discriminator structure. Moreover, all baselines are equipped with MoG priors. Our main baselines include:
We use the same set of mixed type datasets for all tasks. They include:
In this task, we evaluate the quality of our generative model in terms of mixed type data generation. During training, the range of all variables all scaled to the range between 0 and 1. For all datasets, we first train the models and quantitatively compare their performance on test set using a 90%-10% train-test split. All experiments are repeated 5 times over different random seeds.
Visualization by pair plots: For deep generative models, the data generation quality reflects how well the model models the data. Thus, we first visualize the data generation quality of each model on a representative dataset, Bank marketing. Bank dataset containing three different data types, each with drastically different marginals which present challenges for learning. We fit our models to Bank dataset, and then generate the pair plots for three of the variables, x0, x1 and x2 (first two are categorical, the third one is continuous) selected from the data (
The vanilla VAE is able to generate the second categorical variable. However, note that the third variable of the dataset (
Quantitative evaluation on all datasets: To evaluate the data generation quality quantitatively, we compute the marginal negative log-likelihood (NLL) of the models on test set. Note that all NLL numbers are divided by the number of variables of the dataset. As shown in Table 1, VAEM can consistently generate realistic samples, and on average significantly outperforms other baselines.
An important aspect of generative models is the ability to perform conditional data generation. That is, given a data instance, to infer the posterior distribution regarding unobserved variables xU given xO. For all baselines evaluated in this task, we train the partial version of them (i.e., generative+partial inference net). To train the partial models, we randomly sample 90% of the dataset to be training set, and remove a random portion (uniformly sampled between 0% and 99%) of observations each epoch during training. Then, we remove 50% of the test set and use generative models to make inference regarding unobserved data. Since all inference are probabilistic, we report the negative test NLLs on unobserved data, as opposed to imputation RMSE typically used in the literature.
Results are summarized in Table 2, where all NLL values have been divided by the number of observed variables. We repeat our experiments for 5 runs and report standard errors. Note that how the automatic balancing strategy—VAE-balanced almost always makes the performance worse. On the contrary, Table 2 shows that our proposed method is very robust, yielding significantly better performance than all baselines on 4 out of 5 datasets, and competitive performance on energy dataset.
In our final experiments, we apply VAEM to the task of sequential active information acquisition (SAIA) based on the formulation described above. We use this task as an example to showcase how VAEM can be used in decision making under uncertainty. In SAIA, at each step the underlying generative model is asked to generate posterior samples of unobserved variables xU for each data instance x, and then decide which variables to acquire next. SAIA is a perfect task for evaluating generative models on mixed type data, since it integrates data generation, conditional generation, target prediction and decision making into a single task. Deep generative models with efficient inference that can handle partial observations are essential components for SAIA task.
We first pre-train our models and baselines according to the settings outlined above. Then, in SAIA, we actively select variable for each test instance starting with empty observation xO=ø. The reward function of VAEM is estimated as described above. We add an additional baseline, denoted by VAE-no-disc, which is a VAE without discriminator structure. This is the baseline to show the importance of the extension described above in prediction tasks. Other settings are the same as VAE baseline. All experiments are repeated for ten times.
It will be appreciated that the above embodiments have been described by way of example only.
More generally, according to one aspect disclosed herein, there is provided a method comprising: in a first stage, training each of a plurality of individual first variational auto encoders, VAEs, each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data; and in a second stage following the first stage, training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation.
As the auto encoders are variational auto encoders, each dimension of their latent representation is modelled as a probabilistic distribution. In embodiments, the decoded versions of the features as output by the decoders may also be modelled as distributions, or may be simple scalars. The weights of the nodes in the neural networks may also be modelled as distributions or scalars.
Each of the encoders and decoders may comprise one or more neural networks. The training of each VAE may comprise comparing the features as output by the decoder with the features as input to the encoder, and tuning parameters of the nodes of the neural networks in the VAE to reduce the difference therebetween.
In embodiments, each of said subsets is a single feature.
Alternatively, in embodiments, each of said subsets may be more the one feature. In this case the respective features within each subset may be of the same type but a different respective data type relative to the other subset.
In embodiments, each of the first latent representations is a single respective one-dimensional latent variable.
Note again however that as the auto encoders are variational auto encoders, each latent variable is nonetheless still modelled as a distribution.
In embodiments, the different data types may comprise two or more of: categorical, ordinal, and continuous.
In embodiments, the different data types may comprise: binary categorical, and categorical with more than two categories.
In embodiments, the features may comprise one or more sensor readings from one or more sensors sensing a material or machine.
In embodiments, the features may comprise one or more sensor readings and/or questionnaire responses from a user relating to the users health.
In embodiments, a third decoder may be trained to generate a categorization from the second latent representation.
In embodiments, the second encoder may comprise a respective individual second encoder arranged to encode each of a plurality of the feature subsets and/or first latent representations, a permutation invariant operator arranged to combine encoded outputs of the individual second encoders into a fixed size output, and a further encoder arranged to encode the fixed size output into the second latent representation.
In embodiments, said combination may be a concatenation.
Aspects disclosed herein also provide a method of using the second VAE, after having been trained as hereinabove mentioned in any of the aspects or embodiments, to perform a prediction or imputation.
In embodiments, the method may use the second VAE to predict or impute a condition of the material or machine.
In embodiments, the method may use the second VAE to predict or impute a health condition of the user.
In embodiments, the method may use the third decoder together with the second encoder, after having been trained, to predict the categorization of a subsequently observed feature vector of said feature space.
In embodiments the method may use the second VAE, after having been trained, to impute a value of one or more missing features in a subsequently observed feature vector of said feature space, by:
In embodiments the method may use the second encoder after having been trained, to impute one or more unobserved features by:
Another aspect provides a computer program embodied on computer-readable storage and configured so as when run on one or more processing units to perform the method of any of the aspects or embodiments hereinabove defined.
Another aspect provides a computer system comprising: memory comprising one or more memory units, and processing apparatus comprising one or more processing units; wherein the memory stores code arranged to run on the processing apparatus, the code being configured so as when run on the processing apparatus to carry out the method of any of the aspects or embodiments hereinabove defined.
In embodiments, the computer system may be implemented as a server comprising one or more server units at one or more geographic sites, the server arranged to perform one or both of:
In embodiments the network for the purpose of one or both of these services may be a wide area internetwork such as the Internet. In the case of gathering observations, said gathering may comprise gathering some or all of the observations from a plurality of different users through different respective user devices. As another example said gathering may comprise gathering some or all of the observations from a plurality of different sensor devices, e.g. IoT devices or industrial measurement devices.
Another aspect provides use of a variational encoder which has been trained by, in a first stage, training each of a plurality of individual first variational auto encoders, VAEs, each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data; and in a second stage following the first stage, training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation, the use being of the second variational encoder.
In example applications, the trained model may be employed to predict the state of a condition of a user, such as a disease or other health condition. For example, once trained, the model may receive the answers to questions presented to a user about their health status to provide data to the model. A user interface may be provided to enable questions to be output to a user and to receive responses from a user for example through a voice or other interface means. In some example, the user interface may comprise a chatbot. In other examples, the user interface may comprise a graphical user interface (GUI) such as a point and click user interface or a touch screen user interface. The trained algorithm may be configured to generate an overall score from the user responses, which provide his or her health data, to predict a condition of the user from that data. In some embodiments, the model can be used to predict the onset of a certain condition of the user, for example, a health condition such as asthma, depression or heart disease.
A user's condition may be monitored by asking questions which are repeated instances of the same question (asking the same thing, i.e. the same question content), and/or different questions (asking different things, i.e. different question content). The questions may relate to a condition of the user in order to monitor that condition. For example, the condition may be a health condition such as asthma, depression, fitness etc. The monitoring could be for the purpose of making a prediction on a future state of the user's condition, e.g. to predict the onset of a problem with the user's health, or for the purpose of information for the user, a health practitioner or a clinical trial etc.
User data may also be provided from sensor devices, e.g. a wearable or portable sensor device worn or carried about the user's person. For example, such a device could take the form of an inhaler or spirometer with embedded communication interface for connecting to a controller and supplying data to the controller. Data from the sensor may be input to the model and form part of the patient data for using the model to make predictions.
Contextual metadata may also be provided for training and using the algorithm. Such metadata could comprise a user's location. A user's location could be monitored by a portable or wearable device disposed about the user's person (plus any one or more of a variety of known localisation techniques such as triangulation, trilateration, multiliteration or finger printing relative to a network to known nodes such WLAN access points, cellular base stations, satellites or anchor nodes of a dedicated positioning network such an indoor location network).
Other contextual information such as sleep quality may be inferred from personal device data, for example by using a wearable sleep monitor. In further alternative or additional examples, sensor data from e.g. a camera, localisation system, motion sensor and/or heart rate monitor can be used as metadata.
The model may be trained to recognise a particular disease or health outcome. For example, a particular health condition such as a certain type of cancer or diabetes may be used to train the model using existing feature sets from patients. Once a model has been trained, it can be utilised to provide a diagnosis of that particular disease when patient data is provided from a new patient. The model may make other health related predictions, such as predictions of mortality once it has been trained on a suitable set of patient training data with known mortality outcomes.
Another example of use of the model can be to determine geological conditions, for example for drilling to establish the likelihood of encountering oil or gas, for example. Different sensors may be utilised on a tool at a particular geographic location. The sensors could comprise for example radar, lidar and location sensors. Other sensors such as the thermometers or vibration sensors may also be utilised. Data from the sensors may be in different data categories and therefore constitute mixed data. Once the model has been effectively trained on this mixed data, it may be applied in an unknown context by taking sensor readings from equivalent sensors in that unknown context and used to generate a prediction of geological conditions.
A possible further application is to determine the status of a self-driving car. In that case, data may be generated from sensors such as radar sensors, lidar sensors and location sensors on a car and used as a feature set to train the model for certain condition that the car may be in. Once a model has been trained, a corresponding mixed data set may be provided to the model to predict certain car conditions.
A further possible application of the trained model is in machine diagnosis and management in an industrial context. For example, readings from different machine sensors including without limitation, temperature sensors, vibration sensors, accelerometers, fluid pressure sensors may be used to train the model for certain breakdown conditions of a machine. Once a model has been trained, it can be utilised to predict what may have caused a machine breakdown once data from that machine has been provided from corresponding sensors.
A further application is in the context of predicting heat load and cooling load for different buildings. Attributes of a building may be provided to the model for training purposes, these attributes including for example surface area, wall area, roof area, height, orientation etc. Such attributes may be of a mixed data type. As an example, orientation may be a categorical data type and area may be a continuous data type. Once trained, the model can be used to predict the heating load or cooling load of a particular building once corresponding data has been supplied to it for a new building.
Other variants or use cases of the disclosed techniques may become apparent to the person skilled in the art once given the disclosure herein. The scope of the disclosure is not limited by the described embodiments but only by the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
2006809.4 | May 2020 | GB | national |