The present disclosure pertains to systems and methods for optimizing hyperparameters for machine learning models.
Machine learning (ML) models allow computers to perform tasks such as learning patterns or making decisions without being explicitly programmed to do so. Instead, such tasks are learned (fully or partially) from training data via a structured training process. A training set comprises examples from which the model learns to improve its performance on a defined task. During training, one or more parameters of a machine learning model are tuned in a systematic fashion based on the training set, e.g., using a gradient-based approach (such gradient descent or gradient ascent) applied to the model parameter(s). Examples of such parameters include coefficients in of a linear regression model, weights in a neural network, or support vectors in a support vector machine. For supervised training, elements of the training set are labelled and the model generates outputs (referred to as predictions) comparable to the labels. A training loss function is defined that quantifies overall error between the model outputs and the labels, and the model parameters are tuned so as to reduce the overall error expressed in the training loss function. For example, in a gradient-based approach, gradients of the loss function with respect to the parameters are computed, and used to calculate updated parameters with the aim of reducing those gradients to zero or thereabouts. Parameter(s) of a machine learning model thus may be characterized as internal variables that the model learns from training data.
By contrast, hyperparameters are external configuration settings that are not learned from the training data. For example, hyperparameters may be set prior to the training process and fixed during training.
Incorrect or poorly tuned hyperparameters can lead to suboptimal model performance. Therefore, choosing appropriate hyperparameter values is an important aspect of machine learning model development to optimize model performance. Conventional hyperparameter selection may, for example, involve hand-tuning or random searching. Such approaches involve training and evaluating multiple models with different hyperparameters to compare model performance, which consumes significant computational resources.
A combined hyperparameter and proxy model tuning method is described by way of example. The method involves multiple search iterations. In each search iteration, candidate hyperparameters are considered. An initial (‘seed’) hyperparameter is determined, and used to train one or more first proxy models on a target dataset. From the first proxy model(s), one or more first synthetic datasets are sampled. A first evaluation model is fitted to each first synthetic dataset, for each candidate hyperparameter, enabling each candidate hyperparameter to be scored. Based on the respective scores assigned to the candidate hyperparameters, a candidate hyperparameter is selected and used to train one or more second proxy models on the target dataset. In embodiments, the method repeats, to consider a new round of candidate hyperparameters in a further search iteration, with further synthetic dataset(s) being sampled from the second proxy model to assess the new candidate hyperparameters in the same manner (but now with synthetic data sampled from the second proxy model). A new hyperparameter selected in this assessment may, in turn, be used to train one or more third proxy models on the target dataset.
Particular embodiments will now be described, by way of example only, with reference to the following schematic figures, in which:
Hyperparameters pertaining to an ML model control overall model behaviour. Hyperparameters are important for tuning and optimizing the performance of a machine learning model. They are conventionally set by a machine learning engineer before the training process begins and remain constant during training. Examples of hyperparameters include a learning rate in gradient descent or gradient ascent, the number of hidden layers and neurons in a neural network, the depth of a decision tree, or the choice of a kernel in a support vector machine. As another example, a training loss function may comprise multiple terms that reward or penalize different aspects of model performance. Those terms may be weighted by one or more hyperparameter(s) in the form of relative weighting factors (to prioritize/deprioritize one aspect of performance relative to another) that are fixed in the training process.
Hyperparameters can be variables of the model itself (e.g. number of layers) or the training process (e.g. learning rate). Whilst parameters are tuned as part of the training process itself, hyperparameter tuning is a separate optimization task to enhance model performance. Usually this is not a choice, but rather a consequence of the fact that it is often not possible to take the gradient of the training loss function with respect to the hyperparameters. As noted, incorrect or poorly tuned hyperparameters can lead to suboptimal model performance. Therefore, choosing appropriate hyperparameter values is an important aspect of machine learning model development to achieve high model performance.
Hyperparameter tuning is a laborious process. To evaluate a hyperparameter set (that is, a hyperparameter or a combination of multiple hyperparameters), a model needs to be trained using those hyperparameter(s) and the trained model needs to be evaluated, as the model performance is used as in indication of the quality of the hyperparameter(s) used to train it. A training process performed to train at least one instance of a machine learning model with a given hyperparameter (referred to as a single ‘training run’ herein) coupled with an evaluation process to evaluate model performance (referred to as an ‘evaluation run’ herein) consumes significant computational resources. In some cases, a single run might involve training of multiple models, e.g. in a cross-validation approach. Hyperparameter optimization involves multiple training and evaluation runs with different hyperparameter sets. Conventional approaches typically involve a large element of trial and error, perhaps guided by human expert knowledge with ‘educated guesswork’.
Significant computational resources are wasted in performing training and evaluation runs for ‘bad’ choices of hyperparameters. More structured approaches to hyperparameter tuning have been explored, such as grid search, random search and Bayesian optimization. Nevertheless, such approaches have still been found to involve significant numbers of ‘wasted’ training and evaluation runs evaluating poor-quality hyperparameters.
In embodiments described herein, an improved hyperparameter optimization process is described, which is able to increase the quality of selected hyperparameter(s) over a given number of training and evaluation runs or, equivalently, reduce the number of training and evaluation runs required to achieve a given level of hyperparameter quality (where hyperparameter quality is assessed in terms of resulting trained model performance). Considering the significant computational resources consumed by a single training and evaluation run, an aim is to use computational resources of a training system more efficiently by reducing the number of ‘wasted’ training and evaluation runs performed in hyperparameter optimization.
Particular consideration is given to hyperparameter tuning for generative data synthesis models, also referred to as ‘proxy models’ herein. A proxy model supports generation of synthetic data of a specified type. A proxy model has parameters θ which are learned via training based on a target dataset D and hyperparameters v (note, in the following, references to parameters and hyperparameters in the plural also encompass a signal parameter or single hyperparameter). The following description may refer to a hyperparameter of a model for conciseness, noting that terminology extends to external variables of the process used to train the model (such as learning rate) which do not explicitly form part of the trained model but nevertheless influence the performance of the trained model.
Looking beyond the efficiency benefits, the described approach has the potential to achieve a better end result than is possible with conventional hyperparameter tuning since it enables an optimization for a target objective on synthetic proxies.
A proxy model trained on a target dataset {Xi},=D where Xi is an individual sample in the target dataset D, can be used to synthesize corresponding synthetic data samples {tilde over (X)}i. D is treated as having been drawn from some unknown distribution, denoted P, ({Xi}=D˜P). An iterative hyperparameter optimization method is used to learn a parameterized distribution {tilde over (P)}θi(parameterized by θ) for each iteration i that gets gradually closer to P over multiple iterations. By assuming that Xi are IID (Independent, Identically Distributed), each can be modelled and sampled independently. The parameters θ are learned though training in each iteration, and are influenced by the choice of hyperparameters, denoted v below.
Proxy models have many practical applications. For example, proxy models may be used to support causal methods, such as causal discovery methods or causal inference methods. Causal methods area broad class of methods concerning truly causal relationships exhibited in data (as opposed to mere correlations). Causal methods have broad applications in many fields of technology, some of which are considered in more detail below.
Causal discovery is primarily concerned with identifying and understanding an underlying causal structure within a given system. It involves the use of statistical techniques to analyze patterns in data and draw conclusions about potential cause-effect relationships. The goal may be to predict one or more properties of a dataset, such as causal properties, e.g., the form of a causal graph. A causal graph G is one way to capture information about a data generating process (DGP) associated with a dataset. For example, a DGP may be encoded probabilistically as a joint distribution over a causal graph and a dataset. A causal graph encodes assumptions about a DGP, with nodes representing variables in a system and interactions between different variables as directed edges between nodes. Other forms of causal model may be used to encode causal properties of datasets.
On the other hand, causal inference focuses on estimating the effect of a specific cause on an outcome. Once a causal model has been established (e.g., through causal discovery), causal inference seeks to quantify an impact of changing one variable on another variable. Causal inference may be used when considering potential interventions (treatments), where the goal is to understand what would happen to an outcome if a certain treatment action were performed.
The hyperparameter optimization process also involves iterations of model training dependent on the target dataset 101, as described in more detail below. A fit function 106 is provided for this purpose. The fit function 106 is configured to receive as input a hyperparameter set and a dataset, and returns a trained model that has been trained on the inputted dataset using the inputted hyperparameter set. This function may also be referred to as fitting a model to the inputted dataset. Depending on the context in which the fit function 106 is used, the inputted dataset could be the target dataset 101 itself or a synthetic dataset generated using a proxy model (see below for further details). For use in generating the latter, a sampling function 108 is provided. The sampling function 108 receives as input a proxy model and returns one or more synthetic datasets sampled from the inputted proxy model.
An initialization function 110 is used to initialize the hyperparameter optimization process. The initialization function returns an initial set of hyperparameters v0 used to train an initial proxy model (or models).
A hyperparameter generator 112 is configured to generate candidate hyperparameter sets for use in the optimization process. As described in more detail below, the hyperparameter optimization process is performed in multiple search iterations, with each search iteration involving a search over multiple hyperparameter sets. Within a given iteration, a search process such as random search, grid search or Bayesian search may be used.
The hyperparameter optimization process described herein incorporates hyperparameter search methods such as random, grid and Bayesian hyperparameter searching in a manner than is different to conventional hyperparameter optimization methodologies.
In brief, each search iteration starts from a current proxy model, and performs a search over multiple candidate hyperparameter sets (e.g. grid search, random search, Bayesian search etc.). Suitability of each of those candidates is evaluated using the current proxy model. At the end of a search iteration, the most promising candidate hyperparameter set is chosen. This, in turn, is used to train a new proxy model on the target dataset 101, and the next search iteration is performed in the same manner, but starting from the new proxy model (meaning the new proxy model is now used as a baseline to evaluate candidate hyperparameter suitability). The first iteration is performed using a first proxy model generated using initial hyperparameters returned by the initialization function 110. Subsequent search iterations are performed using the proxy model fitted to the target dataset 101 using the most promising hyperparameter set at the end of the previous search iteration. Thus, over multiple search iterations, not only are the hyperparameters iteratively improved, but the proxy model is also iteratively improved.
Evaluating the suitability of a candidate hyperparameter set uses the fit function 106 and a scoring function 114. Firstly, an evaluation model is trained, using the candidate hyperparameter set, on a synthetic dataset sampled from the current proxy model (using the sampling function 108). Then, the fitted evaluation model is evaluated using the scoring function 114. The scoring function 114 receives as input an evaluation model and a dataset, and returns a score that quantifies how well the evaluation model fits the dataset. One example of a suitable score is an F1 score for the evaluation model with respect to some known property (or properties) of the dataset. The F1 score is a machine learning evaluation metric that measures a model's accuracy (combining precision and recall). Other scores, such as a precision score or a recall score may be used. This score may be referred to as a matching score, as it quantifies an extent to which the evaluation model matches the dataset. To evaluate suitability of a candidate hyperparameter set, the scoring function 114 is applied to an evaluation model and the same synthetic dataset on which the evaluation model was trained using the candidate hyperparameter set under evaluation. In some embodiments, multiple synthetic datasets are samples from the current proxy model. In that case, an evaluation model may be fitted to each synthetic dataset, resulting in multiple evaluation models (trained using the same candidate hyperparameter set as each other), each of which is used to score the candidate hyperparameter set. Those scores are then aggregated to provide an overall matching score, which in turn is used to quantify the suitability of the candidate hyperparameter set.
Causal discovery applications are considered herein. In causal discovery, a true data generating process underlying a real dataset is generally unknown. However, a generative model (e.g. having a generative artificial neural network (ANN) architecture) can be trained on a dataset and used to predict a data generating process underlying the dataset. A known data generating process can also be used to synthesize a synthetic dataset.
Generative causal models, e.g., having generative neural network architectures are considered herein. Deep neural network architectures are considered. A class of generative causal model is considered with the ability to predict a causal property (e.g., a causal graph) for a dataset on which it is trained.
Reference is made to Geffner at al. “Deep End-to-end Causal Inference” (2022) arXiv: 2202.02195 (the DECI paper), which discloses a deep learning-based end-to-end causal inference framework named DECI. DECI is a single flow-based non-linear additive noise model (ANM) that takes in observational data and can perform causal, enabling causal quantities to be estimated using only observational data as input. The DECI model can also be used to perform causal inference.
In DECI, a causal graph G is described using a structural equation model (SEM). DECI takes a Bayesian approach to causal discovery. Given a training dataset x=(x1, . . . , xN) of d-dimensional datapoints xn with an unknown causal graph G, a variational distribution, parameterized by first parameters ϕ over the unknown causal graph G is learned. The variational distribution, denoted qϕ(G), approximates a posterior distribution, dependent on second parameters θ, of the causal graph G given the training dataset, pθ(G|x). Functional relationships are modelled using a predetermined noise distribution pz and a set of feedforward, fully-connected artificial neural networks (also known as multilayer perceptrons or MLPs) whose operation is described by the following equations:
where each ζi and li is an MLP with weights shared across nodes. The second parameters θ comprise the (shared) weights of the neural networks ζi and li.
The first and second parameters ϕ, θ are learned in training for the given training set x by maximizing an evidence lower bound (ELBO) thereof, denoted ELBO (θ, ϕ). Adam optimization is used. In one example, implementation, the fit function 106, when applied to a target dataset D of the form=(x1, . . . , xN), trains a DECI model in this manner, using a given set of hyperparameters v. The model training component 104 may operate in the same way, using optimized hyperparameters. Different trained models may be obtained with the same hyperparameters, for example, with different random initializations of the first and second parameters θ, ϕ. In this context, hyperparameters to be optimized may, for example, include one of more of those listed in Appendix B.2 of the DECI paper, such as a scalar λs used to define a causal graph prior (see Equation 6 of the DECI paper), a temperature of a Gumbel softmax method used for ELBO gradient estimation, the number of layers and/or neurons of neural networks ζi and li etc. It will be appreciated this not an exhaustive list of hyperparameters and other implementations are applied to additional or alternative hyperparameter(s).
Once trained, synthetic data is sampled by sampling noise variables z from the predetermined noise distribution, z˜pz, and sampling a causal graph G from the learned variational distribution, G˜qϕ(G). Synthetic data samples are then obtained by solving for x the final two equations listed above (corresponding to equations (1) and (8) in the DECI paper) using the sampled graph G˜qϕ(G) and the sampled noise variables z˜pz. In one example implementation, the sampling function 108 operates in this manner, utilizing the functional relationships described in equations (7) and (8) of the DECI paper. The data sampling procedure can be summarized as: sample graph, sample noise, and forward propagate through the graph with the functional relationships to sample data.
In one example causal discovery application, the proxy and evaluation models take the form of casual discovery models (such as DECI models) trained on real and synthetic dataset respectively, with the model training component 104, the fit function 106 and the sampling function 108 configured as described above. A dataset (D) having an unknown causal graph (such as a real dataset) is received. Some initial hyperparameter set v0 is chosen. A first proxy model (model1) is fitted to D though training based on v0. Once trained, model1 provides a first prediction of the unknown causal graph (denoted G1) for the target dataset D on which it is trained. A first synthetic dataset ({circumflex over (D)}1) is, in turn, synthesised using G1. A first evaluation model (e-model1) is then fitted to {circumflex over (D)}1 based on a first hyperparameter set v1. The first evaluation model, e-model1, provides a predicted causal graph for synthetic dataset {circumflex over (D)}1, denoted H1, in the same way that proxy model1 provides the predicted causal graphG1 for target dataset D. Note that, whilst G1 is only a first estimate of the (unknown) causal graph for the target dataset D, it is by definition the true causal graph for synthetic dataset {circumflex over (D)}1. Therefore, it is possible to score e-model1 by comparing G1 (the true causal graph underlying {circumflex over (D)}1) with H1 (e-model1's prediction of G1). A hyperparameter search is used to evaluate different combinations of hyperparameters v1 in this manner (training and scoring different evaluation models with different v1), enabling the most promising v1 to be selected. Having selected the most promising v1, the method repeats iteratively, training a second proxy model, model2, on the target dataset D using the selected v1, (yielding a second causal graph prediction for D, G2), synthesising a second synthetic dataset {circumflex over (D)}2 using G2, fitting a second evaluation model, e-model2, to {circumflex over (D)}2 through training based on a second hyperparameter set v2 (yielding a causal graph prediction for {circumflex over (D)}2, denoted H2), and scoring e-model2 based on a comparison of G2 (the true causal graph underling {circumflex over (D)}2) with H2 (e-model2's prediction of G2). This is repeated over T iterations, eventually yielding optimized hyperparameters vT and optimized proxy modelT.
With DECI, the described approach can be extended to causal inference, such as average treatment effect (ATE) estimation, by incorporating additional samples from an interventional distribution in the manner described in the DECI paper.
In summary, a generative model is learned that is able to predict the causal graph of the target dataset. This causal graph is used to generate data. However, finding this causal graph (on the real target dataset) is the ultimate objective. The method assumes that the proxy is close to the real dataset, enabling that objective to be approximated on the synthetic datasets.
In each iteration t, Gt and Ht may each take the form of causal graph adjacency matrix (meaning a matrix representation of a finite causal graph). In such embodiments, a matching score (e.g., F1 score) may computed between the computational graph adjacency matrices Gt and Ht (that is, between the true adjacency matrix of the proxy, and the adjacency matrix predicted by the evaluation model).
Similar principles can be applied more broadly, where a synthetic data set is generated based on a known property or properties, where a proxy model (e.g. generative neural network) trained on a target dataset provides a predicted property (or properties) of the dataset that can in turn be used to synthesise a synthetic dataset. In this case, an evaluation model (e.g. generative neural network) trained on the synthetic dataset provides a predicted property (or properties), which can be compared with the true property (or properties) in a similar manner. This approach is particularly useful in circumstances where the property (or properties) of the target dataset are unknown.
The terms “predicted target dataset property” and “predicted synthetic dataset property” are used to refer to predictions by a proxy model (trained on a target dataset) and an evaluation model (trained on a synthetic dataset) respectively.
In a validation use-case, several different causal discovery models may be validated. This can be achieved by tuning them to the proxy dataset and evaluating their performance on the known graph of the synthetic dataset.
An evaluation model is fitted to a synthetic dataset in the same way a proxy model is fitted to the target dataset 101, and has the same form as a proxy model. An evaluation model may the therefore be generated in the same manner as a proxy model, using the fit function 106 applied to a candidate hyperparameter set and a synthetic dataset sampled from a proxy model (in place of the target dataset 101). The ‘proxy/evaluation’ terminology is used purely for ease of understanding, as it reflects the different roles of these models in the hyperparameter optimization process.
The optimized proxy model 105 obtained at the end of the process may also be generated in the same manner, using the fit function 106 applied to the target dataset D and the optimized hyperparameter set 103 obtained in the hyperparameter optimization process. This proxy model 105 is optimized in the sense that it has been trained using the optimized hyperparameters 103. Note, the term optimized does not necessarily imply ‘fully optimal’. Rather, it refers to hyperparameters than have been improved by way of the iterative optimization process. In an extended use case, the generative model (e.g., the final, optimized model) can be used to tune another target algorithm, such as a non-generative machine learning model.
A proxy model application 120 is shown in
Proxy models of the kind described above have many practical applications. As mentioned, one broad class of application pertains to causal methods.
As discussed, causal inference is a fundamental problem with wide ranging real-world applications in fields such as manufacturing, engineering and medicine. Causal inference involves estimating a treatment effect of actions on a system (such as interventions or decisions affecting the system). This is particularly important for real-world decision makers, not only to measuring the effect of actions, but also to pick the best action that is the most effective.
One such application is causal method selection or validation. In this case, a proxy model trained on a causal dataset can be used to validate a casual method, such as a causal inference method. In this context, improvements to the proxy model have consequent improvements in the casual method evaluation, which in turn increases the probability of selecting the most appropriate causal method for a given application context.
For example, proxy models may be used to generate synthetic datasets exhibiting causal relationships, which in turn may be used to evaluate different causal methods, e.g., to select an appropriate causal method from multiple candidate causal methods for use a given practical application. High-quality synthetic causal data is highly desirable in this context because, in practice, it is challenging or impossible to obtain real data with ground truth that can be used for cross-validation between candidate causal methods. By contrast, for synthetic data, such ground truth can be readily generated (or may be intrinsic to the process of generating the synthetic data). For the validation case, different causal discovery models can be evaluated tuning them to the proxy dataset and evaluating their performance on the known graph of the synthetic dataset.
Causal inference methods may be used to estimate a treatment effect of an action on some real-world system. For example, a causal graph (or other causal properties) predicted by a proxy model (optimized using the described methods) may be used in a causal inference method. A ‘treatment’ refers to an action performed on a physical or logical system. Testing may be performed on a number of ‘units’ to estimate effectiveness of a given treatment, where a unit refers to a physical system in a configuration that is characterized by one or more measurable quantities (referred to as ‘covariates’). Different units may be different physical systems, or the same physical system but in different configurations characterized by different (sets of) covariates. Treatment effectiveness is evaluated in terms of a measured ‘outcome’ (such as resource consumption). Outcomes are measured in respect of units where treatment is varied across the units. For example, in one a ‘binary’ treatment set up, a first subset of units (the ‘treatment group’) receives a given treatment, whilst a second subset of units (the ‘control group’) receives no treatment, and outcomes are measured for both). More generally, units may be separated into any number of test groups, with treatment varied between the test groups.
For example, in the manufacturing industry, causal inference can help quantitatively identify the impact of different factors that affect product quality, production efficiency, and machinery performance in manufacturing processes. By understanding causal relationships between these factors, manufacturers can perform industrial machine actions (such as tuning, adapting, modifying or replacing an industrial machine) to optimize their processes, reduce waste, and improve overall efficiency. As another example, in the field of engineering, causal inference can be used for root cause analysis and identify underlying causes of faults and malfunctions in machines or electronic systems such as vehicles or unmanned drones (e.g. aircraft systems). By analyzing data from sensors, maintenance records, and incident reports, causal inference methods can help determine which factors are responsible for observed issues and guide targeted maintenance and repair actions. In genome-wide association studies (GWAS), causal inference may be used, for example, to associate between genetic variants and a trait or disease, accounting for potential confounding factors, which in turn may allow therapeutic treatments to be developed or refined. As another example, different energy management actions may be evaluated in a manufacturing or engineering context, or more generally in respect of some energy-consuming system, to estimate their effectiveness in terms of energy saving, as a way to reduce energy consumption of the energy-consuming system. A similar approach may be used to evaluate effectiveness of an action on a resource-consuming physical system with respect to any measurable resource. Causal inference may interface with the real-world in term of both its inputs and its outputs/effects. For example, multiple candidate actions may be evaluated via causal inference, in order to select an action (or subset of actions) of highest estimated effectiveness, and perform the selected action on a physical system(s) resulting in a tangible, real-world outcome. Input may take the form of measurable physical quantities such as energy, material properties, processing, usage of memory/storage resources in a computer system, therapeutic effect etc. Such quantities may, for example, be measured directly using a sensor system or estimated from measurements of another physical quantity or quantities. Casual analysis may be performed in a cybersecurity context, e.g. to identify causes of cyberthreats or potential cyberthreats, and mitigate such causes though appropriate security mitigation actions.
The hyperparameter turning method is especially useful for any type of generative process where tuning hyperparameters is hard, especially where formulating a true target is hard. Other applications of proxy models beyond causal methods are envisaged.
For example, in industrial processes where certain quantities are hard to measure, a model may be formulated for this process and its hyperparameters iteratively optimize via proxy tuning. Other examples include scenarios behaviour of any type of agent is modelled. e.g. humans driving in traffic (applicable to autonomous driving), or players in a game. With the present techniques, it is possible to learn models that gradually learn to reproduce their behaviour better, and since internal characteristics of the models are accessible, it is feasible to exactly measure e.g. the likelihood of a certain action.
Other applications include, for example, hyperparameter tuning for generative image models, generative audio models or language models (such as large language models), or other content models, particularly synthesis based on inferred properties of images, audio, text and/or other content etc.
At step 202, the target dataset 101 (D) is received.
At step 203, the initialization function 110 is used to generate an initial set of hyperparameters v0. Several example initialization processes are described below.
At step 204, a first proxy model (model1) is trained on the target dataset D using the initial hyperparameters v0.
Reference numeral 205 is used to denote a hyperparameter search process, which is repeated multiple times in an iterative manner, using iteratively updated proxy models. Each instance of the hyperparameter search process is referred to as a search iteration. In the following description, an index t is used to denote a current search iteration, with t=1, . . . , T. Reference is made to a current proxy model (modelt), and for the first search iteration (t=1, which is step 204), this is the first proxy model 300-1 generated at step 204 (model1).
In the first extension, multiple proxy models are trained at each iteration t, with different random seeds, and in that case the notation modeltj is used to denote the jth proxy model in the tth iteration.
The hyperparameter search process 205 is performed as follows.
At step 206, a synthetic dataset {circumflex over (D)}t is sampled from the current proxy model (modelt) using the sampling function 108 applied to the current proxy model.
In the first extension (multiple proxies), a synthetic dataset {circumflex over (D)}tj is sampled from each current proxy model modeltj with j=1, . . . , kt, resulting in kt synthetic datasets sampled from the kt current proxy models.
In a second extension visualized in
The notation {circumflex over (D)}t,m (used in
In search iteration t, nt candidate hyperparameter sets are considered. The number of candidate hyperparameter sets nt may or may not vary between different search iterations depending on the embodiments (that is, nt may or may not be constant). The nt candidate hyperparameter sets are generated in steps 208-1, . . . , 208-nt in
The notation vt,n (used in
For each of the nt candidate hyperparameter sets, an evaluation model is trained on the synthetic dataset {circumflex over (D)}t sampled in step 206 (steps 210-1, . . . , 210-nt). That is, in step 201-1, an evaluation model is trained on {circumflex over (D)}t using vt,1; in step 201-2, an evaluation model is trained on {circumflex over (D)}t using vt,2 etc. In the simplest case of a single proxy modelt and single synthetic dataset {circumflex over (D)}t, steps 210-1, . . . , 210-nt result in nt evaluation models in total (one for each candidate hyperparameter set).
In the first extension (multiple proxies), kt synthetic datasets have been sampled from the kt current proxy models, and an evaluation model is fitted to each of these for each set of candidate hyperparameters vt,n, resulting in nt×kt evaluation models.
In the second extension of
In
Each evaluation model (1,1), . . . (kt,1), . . . , (1,nt), . . . (kt, nt) is trained by applying the fit function 106 to the applicable synthetic dataset and candidate hyperparameter. So, evaluation models (1,1), . . . , (1,nt) are each trained on synthetic dataset {circumflex over (D)}t,1, using hyperparameter sets vt,1, . . . , vt,n
As noted, the first and second extensions may be combined, resulting in nt×kt×mt evaluation models.
For each evaluation model, the scoring function 114 is evaluated for the applicable candidate hyperparameter set, at steps 212-1, . . . , 212-nt respectively.
Thus at step 212-1, the scoring function 114 is evaluated for the evaluation model trained in step 210-1 using the candidate hyperparameter set generated at step 208-1, resulting in a matching score. With multiple evaluation models, resulting in multiple scores, those scores are aggregated using an aggregation function 302 (e.g. summation function) to provide an overall score for the candidate hyperparameter set generated in step 208-1 (and so on).
As depicted in
In the second extension of
Steps 208-1, 210-1 and 212-1 together constitute a first training and evaluation run. Steps 208-2, 210-2 and 212-2 constitute a second training run, and so on. In search iteration t, nt training and evaluation runs are thus performed in total, to generate and score nt candidate hyperparameter sets.
The hyperparameter generator 112 generates the candidate hyperparameters at steps 208-1, . . . , 208-n, as depicted in
In one embodiment, a random search is performed. In a random search, hyperparameters are selected randomly, within some defined criteria (for example, they may be randomly sampled from a probability distribution defined over hyperparameter space).
In another embodiments, a grid search approach, is used, in which the candidate hyperparameter sets are typically chosen to be spaced uniformly apart in hyperparameter space (in a grid-line fashion).
For random and grid searches, there is no inter-dependence between selected hyperparameters. Therefore, in this case, steps 208-1, . . . , 208-nt can be performed in parallel, or sequentially in any order with respect to each other. The same is true of steps 210-1, . . . , 210-nt and steps 212-1, . . . , 212-nt.
In another embodiment, a Bayesian search approach, is used. In essence, a Bayesian search strategy applies Bayesian optimization to the hyperparameter search problem. A Bayesian approach uses a posterior of the previous search iteration as the prior in the next. One or more initial hyperparameter set(s) are chosen (e.g., randomly), and used to perform initial training and evaluation run(s). The resulting trained models are used to evaluate the initial hyper parameter set(s). Those results are then used to build a performance prediction model, which in turn can be used to select a subsequent hyperparameter set (e.g., which is predicted to yield best model performance according to the Bayesian prediction model). A further training run is then performed using the selected hyperparameter set(s), again resulting in a trained model that is used to evaluate the selected hyperparameter set. That result is, in turn, used to update the performance prediction model, which in turn is used to select the next hyperparameter set and so on. In this manner, earlier training runs based on previously-selected hyperparameters are used to guide the selection of hyperparameters for later runs. The idea is that earlier runs can indicate potentially promising area(s) of the hyperparameter space, with the selection of hyperparameters for subsequent training runs guide being guided towards those promising region(s) by the incrementally updated performance prediction model.
In a Bayesian search embodiment, the hyperparameter generation at step 208-2 is dependent on the evaluation of the previous hyperparameters at step 212-1, with an intervening step (not depicted) of training a performance prediction model (and so on). Thus, in such embodiments, step 208-2 is dependent on step 212-1 and so on.
Random search has had practical advantages, as it enables a larger number of trials to be run in parallel. However, a Bayesian search may yield better performance of the hyperparameter optimization method.
At step 214, one of the nt candidate hyperparameter sets is selected based on the overall scores computed steps 212-1, . . . , 212-nt. In one embodiment, the highest scoring candidate hyperparameter set is be selected. In another embodiment, an acceptance condition is applied, as described in further detail below. The hyperparameter set selected in step 214 of search iteration t is denoted vt. Search iteration t terminates at this point.
In step 216, a determination is made as to whether to terminate the hyperparameter optimization process as a whole. The final search iteration is denoted by T. In one embodiment, the value of T is predetermined. Thus, the process terminates after a fixed number of search iterations. In other embodiments, different termination condition(s) may be defined, meaning that T is variable.
If the termination condition(s) are not satisfied (t<T), the method proceeds to step 222, at which a new (intermediate) proxy model is fitted to the target dataset 101, using the candidate hyperparameter set selected in step 214 of the most recent search iteration (vt). Training of the new proxy comprises selecting the candidate hyperparameters in step 214 and fitting the new proxy to the target dataset D, which in turn involves learning parameters of the new proxy based on the target dataset using the selected hyperparameters. From here, the method returns to step 206, beginning a new search iteration based on the new proxy model determined at step 222, with t incrementing by one (t: =t+1). At this point, the proxy model trained in step 222 becomes the new current proxy model, modelt (noting that t has now incremented by one) denoted by reference sign 300-t in
In the first extension, multiple proxy models are trained at step 222 based on the selected hyperparameters vt based on respective random seeds, resulting in multiple proxy models that form the basis of the next iteration.
The process continues iteratively in this manner, until eventually returning to returning to step 216 for the final time when the termination condition(s) are determined to be satisfied (thus reaching t=T). At this point, the optimized hyperparameter set 103 (VT) has been obtained, and this is returned (step 218) as a final output of the hyperparameter optimisation process.
At step 220, the optimized hyperparameter set 103 is, in turn, used to train an optimized proxy model 105 (modelT). In the first extension, multiple firs proxy models are trained using vT.
In a causal discovery application, the optimized proxy model(s) 105 provides an iteratively refined estimate of the causal properties (e.g. causal graph) of the target dataset D.
The implementation of
To determine the ‘argmax’ in Algorithm 1, multiple candidate hyperparameters are evaluated and scored, as per
As noted, another embodiment incudes an acceptance condition at this stage. Rather than selecting vt directly based on the scores, all candidates vt,n are ordered by their scores. Then, the algorithm iterates over the candidates in batches (subsets) of a fixed size (e.g. around ten candidates in one example). For each batch, training jobs are launched on the target dataset (to train a ‘candidate’ proxy model on the target dataset for each candidate in the batch). This embodiment thus involves training of multiple new (candidate) proxies at the end of each iteration. As part of this process, one or more batches of hyperparameters are selected based on their scores to enable different candidate proxies to be evaluated in the following manner. The batch is then re-ordered by log likelihood (for each candidate, this is the log likelihood of the target dataset with respect to the corresponding proxy model) in order to test whether to accept or reject the candidates in order. A candidate is accepted with a probability determined by its log likelihood (11) and the previously accepted log likelihood (prev_11) (that is, the log likelihood of the candidate selected in the previous iteration) and, once accepted, the corresponding candidate proxy model is selected for the next iteration. In one example, a candidate is accepted based on a score computed as exp(11−prev_11), which is interpreted as a probability of acceptance if less than 1. This quantity may be larger than 1 which happens when 11>prev_11, in which case the candidate is always accepted. With this second approach, the hyperparameter selection at the end of each iteration is still dependent on the scores, but unlike the first approach, it is not necessarily the highest scoring candidate that is selected.
The introduction of the acceptance condition addresses a convergence problem that can otherwise occur. Specifically, an issue can arise, in that the highest score in iteration t+1 might be lower than the highest score in iteration t, implying a drop in performance between iterations. The acceptance condition has been found to more reliability achieve an increase in performance across the multiple iterations.
The reordering means that, in a given iteration, the highest-scoring candidate will not necessarily be selected (it may or may not be, with a probability that is determined by the reordering and the log likelihoods), but that over multiple iterations the scores will tend to increase.
Algorithm 1 does not expressly cover the extension of
There are various ways of implementing the initialization function 110.
A first implementation starts with a manually constructed synthetic approximation which allows computing the objective (e.g. orientation f1 score). This may be written in pseudocode as:
In the first implementation, the ‘synthetic’ function returns an initial model, model0, from which v0 is determined. This can be seen as iteration t=0 that is performed in the same way as subsequent iterations, but starting from e.g. a random initial proxy model.
A second implementation randomly or manually picks the initial set of hyperparameters v0, and train a first model to act as the proxy.
A third implementation performs hyperparameter tuning on a different objective that can be computed without access to the underlying graph (such as tuning a log likelihood of the target data with respect to an initial proxy model, model0). Existing hyperparameter tuning method may be used for this purpose.
In one causal application, the target dataset 101 may take the form of D={(Xi, Ti, Yi)}, with i=1, . . . , N entities (e.g. physical systems), where Xi denotes one or more covariates associated with entity i, Ti denotes a treatment (e.g. a binary indicator of whether a treatment was applied to the entity), and Zi denotes an outcome (e.g. a binary indicator of whether a particular outcome was obtained).
In another example, the target dataset 101 contains purely observational data, such as have real world outcomes with no particular controlled determined treatment or outcome, in which case D={Xi}, with Ti and Yi effectively subsumed in Xi.
A first aspect herein provides a computer-implemented method, comprising: receiving a target dataset; determining an initial hyperparameter; training a first proxy model based on the target dataset and the initial hyperparameter, resulting in a trained first proxy model; sampling a first synthetic dataset based on the trained first proxy model; training a first evaluation model based on the first synthetic dataset and a first candidate hyperparameter, resulting in a trained first evaluation model; calculating a first matching score based on the trained first evaluation model and the trained first proxy model; and training a second proxy model based on the target dataset, the first candidate hyperparameter and the first matching score, resulting in a trained second proxy model.
The method may comprise training a further first proxy model based on the target dataset and the initial candidate hyperparameter; sampling a further first synthetic dataset based on the trained further first proxy model; training a further first evaluation model based on the first candidate hyperparameter and the further first synthetic dataset; calculating a further first matching score based on the trained further first evaluation model and the trained further first proxy model; computing a first aggregate score for the first candidate hyperparameter based on the first matching score and the further first matching score, wherein the second proxy model may be trained based on the first aggregate score.
The method may comprise training multiple first proxy models based on the target dataset and the initial candidate hyperparameter; sampling a first synthetic dataset based on each first proxy model, resulting in multiple first synthetic datasets; training a first evaluation model based on the first candidate hyperparameter and each first synthetic dataset, resulting in multiple first evaluation models dependent on the first candidate hyperparameter; calculating a first matching score between each first synthetic dataset and the first evaluation model trained thereon, resulting in multiple first matching scores relating to the first candidate hyperparameter; computing a first aggregate score for the first candidate hyperparameter based on the multiple first matching scores, wherein the first candidate hyperparameter is selected based on the first aggregate score.
The first proxy model may be trained based on the target dataset, the initial candidate hyperparameter, and a first random seed, wherein the further first proxy model is trained based on the target dataset, the initial candidate hyperparameter, and a further first random seed.
The method may comprise sampling a second synthetic dataset based on the second proxy model; training a second evaluation model based on the second synthetic dataset and a second candidate hyperparameter, resulting in a trained second evaluation model; calculating a second matching score based on the trained second evaluation model and the trained second proxy model; and training a third proxy model based on the target dataset, the second candidate hyperparameter and the second matching score, resulting in a trained third proxy model.
The method may comprise sampling a third synthetic dataset based on the trained third proxy model; training a third evaluation model based on the third synthetic dataset and a third candidate hyperparameter, resulting in a trained third evaluation model; calculating a third matching score based on the trained third evaluation model and the trained third proxy model; and training a fourth proxy model based on the target dataset, the third candidate hyperparameter and the third matching score, resulting in a trained fourth proxy model.
The first proxy model, the second proxy model and the first evaluation model may each have a generative neural network architecture.
The method may comprise determining using the trained first proxy model a first predicted target dataset property, wherein the first synthetic dataset may be sampled based on the first predicted target dataset property; and determining using the trained first evaluation model a first predicted synthetic dataset property, wherein the first matching score may be computed between the first predicted target dataset property and the first synthetic dataset property.
The first predicted target dataset property may comprise a first predicted target dataset causal property, wherein the first predicted synthetic dataset property may comprise a first predicted synthetic dataset causal property.
The first predicted target dataset causal property and the first predicted synthetic dataset causal property may be embodied in respective causal graphs sampled from the trained first proxy model and the trained first evaluation model respectively.
The first proxy model and the second proxy model may be causal models, the method may comprise: sampling a causal graph from the trained second proxy model or another trained proxy model derived from the trained second proxy model via additional sampling and training operations.
The method may comprise determining a treatment action based on the causal graph sampled from the trained second proxy model or the other proxy model; and performing the treatment action on a physical or logical system.
The first evaluation model may be one of multiple first evaluation models trained based on respective first candidate hyperparameters, resulting in multiple trained first evaluation models, wherein training the second proxy model based on the target dataset, the first candidate hyperparameter and the first matching score may comprise: selecting the first candidate hyperparameter from the respective first candidate hyperparameters based on respective first matching scores computed based on the multiple trained first evaluation models and the trained first proxy model, and fitting the second proxy model to the target dataset using the first candidate hyperparameter as selected based on based on respective first matching scores, resulting in the trained second proxy mode
Selecting the first candidate hyperparameter may comprise selecting a subset of the respective first candidate hyperparameters based on the respective first matching scores, the subset comprising the first candidate hyperparameter and an additional first candidate hyperparameter; fitting an additional second proxy model to the target dataset using the additional first candidate hyperparameter, resulting in a trained additional second proxy model; determining a first likelihood of the trained first proxy model with respect to the target dataset; determining a second likelihood of the trained second proxy model with respect to the target dataset; determining an additional second likelihood of the trained additional second proxy model with respect to the target dataset; and selecting the trained second proxy model from a set comprising the trained second proxy model and the trained additional second proxy model based on the first likelihood, the second likelihood and the additional second likelihood.
The method may comprise training multiple first evaluation models based on respective first candidate hyperparameters, the first candidate hyperparameter selected from the respective first candidate hyperparameters based on respective first matching scores computed between the multiple first evaluation models and the first synthetic dataset.
Selecting the first candidate hyperparameter may comprise: determining a subset of the respective first candidate hyperparameters based on the respective first matching scores; for each first candidate hyperparameter of the determined subset: training a candidate second proxy model based on the target dataset, and determining a likelihood of the candidate second proxy model with respect to the target dataset, resulting in multiple candidate second proxy models; and selecting the second proxy model from the multiple candidate second proxy models based on the likelihood and a likelihood of the first proxy model with respect to the target dataset.
The respective first candidate hyperparameters may be generated randomly.
The respective first candidate hyperparameters may be generated via a Bayesian search based on the respective first matching scores.
Determining the initial hyperparameter may comprise: generating an initial proxy model, sampling an initial synthetic dataset based on the initial proxy model, training multiple initial evaluation models based on the initial synthetic dataset and respective initial hyperparameters, computing an initial matching score based on each trained initial evaluation model and the initial proxy model, and selecting the initial candidate hyperparameter from the respective initial hyperparameters based on the initial matching score computed for each trained initial evaluation model.
The initial hyperparameter may be determined by generating an initial proxy model, sampling an initial synthetic dataset based on the initial proxy model, training an initial evaluation model based on the initial synthetic dataset and the initial hyperparameter, an initial matching score between the initial evaluation model and the initial synthetic dataset, and selecting the initial candidate hyperparameter based on the first matching score.
The initial hyperparameter may be randomly generated, or determined via tuning of a likelihood of the target dataset with respect to an initial proxy model.
The method may comprise, based on the trained second proxy model: tuning, adapting, modifying or replacing an industrial machine, performing a maintenance or repair action performed on a machine, generating image data, audio data, text or other content, or performing a security mitigation action.
A second aspect provides a computer system comprising: at least one memory configured to store computer-readable instructions; and at least one hardware processor coupled to the at least one memory, wherein the computer-readable instructions are configured to cause the at least one hardware processor to perform any above method.
A third aspect provides a computer-readable storage media embodying computer readable instructions, the computer-readable instructions configured upon execution on at least one hardware processor to cause the at least one hardware processor to perform any above method.
It will be appreciated that the above embodiments have been disclosed by way of example only. Other variants or use cases may become apparent to a person skilled in the art once given the disclosure herein. The scope of the present disclosure is not limited by the above-described embodiments, but only by the accompanying claims.