The present disclosure generally relates to modeling techniques for black-box functions, and in particular, to a system and associated method for modeling and optimization of unknown, globally non-stationary heterogeneous functions that can be applied in various science and engineering applications.
Gaussian Processes (GP) are a powerful framework for modeling expensive black-box functions. However, many current modeling frameworks operate under assumptions of stationarity of an underlying, unknown function. When applied for modeling and optimization of real-world systems, this assumption can lead to significant limitations and inaccuracies.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.
The present patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Corresponding reference characters indicate corresponding elements among the view of the drawings. The headings used in the figures do not limit the scope of the claims.
Many modern-day science and engineering applications, such as machine learning, hyperparameter optimization of neural networks, robotics, cyber-physical systems, etc., call for modeling techniques to model black-box functions. Gaussian Process (GP) modeling is a popular Bayesian non-parametric framework heavily employed to model expensive black-box functions for analysis such as prediction or optimization. Traditionally, GP models assume stationarity of the underlying, unknown, function. As a result, a unique covariance kernel (with constant hyperparameters) can be used over the entire domain. However, many real-world systems such as cyber-physical, natural, recommendation can only be characterized by locally stationary but globally non-stationary functions. Breaking the assumption underlying the stationary kernel can deteriorate the quality of predictions generated by the GP.
Bayesian optimization (BO) is a powerful sequential optimization algorithm for non-convex expensive to evaluate black-box functions due to its sample efficient and derivative free framework. The standard BO includes two key components: (i) Bayesian statistical prior (typically Gaussian Process) to model the function to be optimized; and (ii) acquisition functions to navigate the input space efficiently. Bayesian statistical prior typically assumes the homogeneity of the function over the domain, which allows the use of a single covariance kernel function with constant hyperparameters over the entire domain to model the function. However, in many emerging engineering applications such as robotics, circuit design optimization, and the falsification of Cyber-Physical systems, the function to be optimized is heterogeneous with some underlying structure. The standard BO algorithm does not utilize the underlying structure for optimization. Further, its performance for heterogeneous functions drops significantly due to very small kernel length scale values to capture high-frequency components, leading to inadequate sampling in smooth or low-frequency regions. Hence, these drawbacks call for more efficient and sophisticated optimization techniques that can effectively utilize the underlying structure to sample the input space efficiently.
In many engineering systems, a finite set of locally stationary functions can characterize the heterogeneous functions where one can evaluate the structure of the non-stationarity or is known a priori. For example, a vehicle's automatic transmission will exhibit switching behaviors with a discrete and finite number of switches (gear changes). When a specific behavior (gear) is exercised, the system exhibits smooth state dynamics, and the associated metrics of interest maintain such smoothness. One study considered the case of identifying unsafe system-level behaviors for cyber-physical systems without exploiting any information about, for example, switching dynamics.
The efficiency of BO heavily relies on the modeling capabilities of the statistical prior (or surrogate), and many techniques have been proposed to scale BO for heterogeneous function optimization by improving the surrogate. Some studies use non-stationary kernels to model the non-stationarity; however, these methods are complex and computationally intractable. Other studies warp the input space to a new space where the function is stationary and the standard BO framework is applied, but finding such a mapping is a different problem on its own. Further studies partition the input space to fit individual GP in each sub-region. However, none of these methods utilize the system's underlying structure for efficient sampling.
Many studies in the literature tackle this problem. These studies can be classified into three categories:
1. Locally stationary and partition-based approaches: One early work tackles modeling of heterogeneous functions by partitioning the input space with tree-based algorithms and using independent stationary GPs to model the underlying function. Other works propose Voronoi tessellation-based partitioning of the input space. Another work overcomes the modeling limitation of axis aligned partitions by using a support vector machine (SVM) based classifiers at each node of the tree. A further work uses an alternative kernel which is modeled as a convolution of fixed kernel with independent stationary kernel whose parameters vary over the sub regions of the space.
2. GPs with non-stationary kernels: Some studies used non-stationary kernels to model the heterogeneous function, including one which uses non stationary kernels with a input dependent kernel parameters and further models the parameter functions with a smooth stationary Gaussian process. However, the use of non-stationary kernel makes these methods complex and computationally intractable.
3. Warping or spatial deformation based methods: Some methods map the original input space to a new space where the process is assumed to be stationary, one such work uses composite GP to warp the input space.
The present disclosure outlines a novel tree-based Bayesian Optimization framework to optimize heterogeneous black-box functions where the heterogeneity is represented by a finite set of homogeneous functions defined over the sub-regions of the input space. The present disclosure improves upon the partition-based approaches mentioned above by additionally employing the underlying structure and allowing for information sharing across non-contiguous sub-regions where the locally stationary functions are from similar function classes. Specifically, we tackle two key subproblems: (i) Modeling and learning the structure of the heterogeneous functions using a Monte Carlo tree-based algorithm (i.e., including identifying partitions of the input space), followed by (ii) optimization of the heterogeneous black box function using a Bayesian optimization approach that uses the learned structure and a novel acquisition function (“Class-UCB”).
The present disclosure outlines a first step in the direction of improving analysis of systems that are characterized by a discrete and finite number of “classes” of behaviors. Notice that a single class may be represented by disconnected sets of the input space. In particular, given an input, it is assumed that the class and the closest class that the system can switch to can be both evaluated. Under this scenario, the existing partition-based methods can be extended to a family of heterogeneous functions adding the class information. A system outlined herein models homogeneous behavior classes by imposing that the GPs learnt within the sub-regions of the input space that belong to the same class have same kernel hyperparameters. These functions are often encountered in many real-world applications, where the class information can be accessed by learning a classifier using the features. The system implements a tree-based framework with information sharing across non-contiguous partitions that belong to same homogeneous behaviour class to better learn the GP models. Contributions of the present disclosure include:
Let ƒ: ⊆
d→
be a heterogeneous black-box function defined over a compact input space
such that the function ƒ can be represented by a finite set of homogeneous functions gj's for all j∈[p] defined over with p axis aligned sub-regions (i.e., partitions) {
j}j=1p of the input space. Further, each gj is independently sampled from a Gaussian process with a continuous stationary kernel K, i.e., gj˜
(μj(.), κj(.,.)). For notational convenience, and without loss of generality, the prior mean μj=0. Additionally, multiple sub-regions (i.e., partitions) j1, j2∈[p] that belong to the same class i∈[m] include functions sampled from GPs with same covariance kernel function i.e., κj
For consistency, partitions are denoted with a subscript j and classes with subscript i, owing to this notation any variable with subscript i or j would refer to class or partition variable respectively.
Let i denote all the partitions with class label i, i.e.,
i=∪{j:
(j)=i}
j, this can be visualized with the help of an example as shown in
The observation model is such that evaluating the function at any point x reveals function evaluation (y), the class label (), and the minimum distance to a boundary of sub-region (w), since the structure of non-stationarity can be evaluated for many engineering problems. A training data set
={xn, yn,
n, wn}n=1N where N is number of training data points. Also, X=[x1, . . . , xn]T, y, z are the vector of corresponding evaluations and classes respectively and W is a list of tuples of distance and feature index along which the distance is measured.
System 100 comprises one or more network interfaces 110 (e.g., wired, wireless, PLC, etc.), at least one processor 120, and a memory 140 interconnected by a system bus 150, as well as a power supply 160 (e.g., battery, plug-in, etc.).
Network interface(s) 110 include the mechanical, electrical, and signaling circuitry for communicating data over the communication links coupled to a communication network. Network interfaces 110 are configured to transmit and/or receive data using a variety of different communication protocols. As illustrated, the box representing network interfaces 110 is shown for simplicity, and it is appreciated that such interfaces may represent different types of network connections such as wireless and wired (physical) connections. Network interfaces 110 are shown separately from power supply 160, however it is appreciated that the interfaces that support PLC protocols may communicate through power supply 160 and/or may be an integral component coupled to power supply 160.
Memory 140 includes a plurality of storage locations that are addressable by processor 120 and network interfaces 110 for storing software programs and data structures associated with the embodiments described herein. In some embodiments, system 100 may have limited memory or no memory (e.g., no memory for storage other than for programs/processes operating on the device and associated caches). Memory 140 can include instructions executable by the processor 120 that, when executed by the processor 120, cause the processor 120 to implement aspects of the ClassGP and/or ClassBO framework and associated methods (e.g., method 200 and/or method 400) outlined herein. In some examples, memory 140 can include non-transitory computer readable media including instructions encoded thereon that are executable by the processor 120 to implement aspects of the ClassGP and/or ClassBO framework and associated methods (e.g., method 200 and/or method 400) outlined herein.
Processor 120 comprises hardware elements or logic adapted to execute the software programs (e.g., instructions) and manipulate data structures 145. An operating system 142, portions of which are typically resident in memory 140 and executed by the processor, functionally organizes system 100 by, inter alia, invoking operations in support of software processes and/or services executing on the device. These software processes and/or services may include Class-based Bayesian Modeling/Optimization processes/services 190, which can include aspects of the methods and/or implementations of various modules described herein, particularly with respect to Section 4 herein including method 200 and method 400. Note that while Class-based Bayesian Modeling/Optimization processes/services 190 is illustrated in centralized memory 140, alternative embodiments provide for the process to be operated within the network interfaces 110, such as a component of a MAC layer, and/or as part of a distributed computing network environment.
It will be apparent to those skilled in the art that other processor and memory types, including various computer-readable media, may be used to store and execute program instructions pertaining to the techniques described herein. Also, while the description illustrates various processes, it is expressly contemplated that various processes may be embodied as modules or engines configured to operate in accordance with the techniques herein (e.g., according to the functionality of a similar process). In this context, the term module and engine may be interchangeable. In general, the term module or engine refers to model or an organization of interrelated software components/functions. Further, while the Class-based Bayesian Modeling/Optimization processes/services 190 is shown as a standalone process, those skilled in the art will appreciate that this process may be executed as a routine or module within other processes.
This section gives a brief overview of Gaussian Process modeling and the classification tree algorithm used in the ClassGP framework.
Gaussian process (GP) modeling is a popular statistical framework to model non-linear black box functions ƒ due to its analytical tractability of posteriors. Within this framework the function, ƒ: ⊆
d→
, being modeled is assumed to be a distributed as per a Gaussian process prior, formally written as follows:
GP is completely given by its mean μ(.) and covariance kernel κ(.,.), where for convenience, and without loss of generality, the mean function μ(.) is set to 0. The choice of the covariance kernel depends on the degree of smoothness warranted by the function being modeled and is defaulted to stationary kernels such as squared exponential (SE) kernel or Matérn kernel. Functions modeled within this framework are typically assumed to be stationary, i.e., function can be modeled using a same stationary covariance function over the entire input space.
Learning a GP model involves computing a posterior conditioned on the observed data and learning kernel hyperparameters. Let n:{(x1,y1) . . . (xn,yn)} be the sampled training data of the objective function ƒ. The posterior of the function ƒ conditioned on the training data
n is given by a Gaussian distribution i.e., ƒ(x)|
n˜
(μn(x),σn2(x)), where the mean μn(x) and covariance σn2(x) are given as follows:
Here, y is the vector of noise free observations, k is a vector with kp=κ(x, xp). The matrix K is such that Kp,q=κ(xp,xq)p,q∈{1, . . . , n}.
The hyperparameters of the model are learnt by maximizing the log marginal likelihood which is given as follows:
and θ*=arg maxθ, log p(y|X,θ), this optimization problem is solved using off the shelf non convex optimization packages such as Dividing Rectangles (DIRECT), LBGFS, CMA-ES.
A classification tree (also referred to as a decision tree classifier) is a binary tree algorithm which yield axis-aligned partitions of the input space by recursively partitioning the input space on one dimension (feature) at a time. The classification tree is “learnt” from the training data and the predictions for any input x are given by traversing the learnt tree from root node to a leaf node. Notice that each leaf node corresponds to a partition of the input space.
In some examples, CART algorithm is used to grow/learn the tree. During training, the goal is to select the best feature and splitting threshold that minimizes the weighted impurity metric of the children nodes. Many tree-based algorithms typically use Gini index as the impurity metric to grow the tree, which is given as follows:
where pi is the probability of a given class at a given node. The recursion is continued until one of the stopping criterion is met or no further improvement in the impurity index is achievable. Typical choice of stopping criterion include maximum depth of the tree, minimum number of samples in a leaf, maximum number of leaf/partitions.
Standard BO uses a Bayesian statistical model typically Gaussian process to model the objective function and an acquisition function to navigate the input space efficiently.
Bayesian statistical model (prior) formulates a distribution over the space of objective functions which is then used to compute the posterior over the set of observations. A Gaussian process (GP) prior is used in general due to its analytical tractability to evaluate the posterior, i.e., as Class GP outlined in Section 4A. The GP prior is completely given by its mean function μ(.), and covariance function κθ(.,.) i.e.,
here, θ represents the (unknown) hyperparameters (length scales) of the model and w.l.o.g, the mean function μ(.) is set to 0. For the set of noisy observations n: {(x1,y1) . . . (xn,yn)}, where the noise is i.i.d. Gaussian with variance σ2, the posterior on the data
n at any new point in the input space is given by a Gaussian distribution. i.e., ƒ(x)|
n˜
(μn(x),σn2(x)), with the mean μn(x) and variance σn2(x) given by the following closed form expressions:
Here, Y is observation vector, k is a vector with ks=κθ(x, xs), the matrix K is such that Ks,t=κθ(xs,xt) for all s,t∈{1, . . . , n}. The hyperparameters θ of the model are usually estimated by maximising the log marginal likelihood (MLE) which is given as follows:
An Acquisition function uses information from past observed samples to efficiently navigate the input space by leveraging both mean and uncertainty estimates of the posterior. Most widely used acquisition functions are Upper Confidence Bound (UCB) and Expected Improvement (EI). In some examples outlined herein, an example UCB acquisition function can be given as follows:
Here, β is a hyperparameter that controls the trade-off between exploration and exploitation.
This section introduces ClassGP for implementation by the systems outlined herein to model an observed system as a family of heterogeneous functions as defined in sections 2 and 3, as well as
⊆
d can be a compact space with p axis aligned partitions {
j}j=1p and each partition j∈[p] is assigned a class label i∈[m] i.e., =
(j)=i, referred to as class−partition space. The class labels may be known, and the partitions (i.e., as subsets of the classes) can be unknown. The unknown function of the observed system 10 can be modeled as a family of non-stationary functions ƒ defined over class-partition space, ƒ:
→
, such that ƒ boils down to a stationary functions gj over each partition j∈[p] where each gj is sampled from a Gaussian process with a continuous stationary kernel κj i.e., gj˜
(μj(.),κj(.,.)). Importantly, partitions j1, j2 that belong to same class i can have same covariance kernel i.e., κj
The Class-based Bayesian Modeling/Optimization processes/services 190 can be implemented by the system 100 to develop a heterogeneous function model ƒ for the observed system 10 based on observation data obtained from the observed system 10 (i.e., training data), where individual stationary functions gj over each partition j∈[p] can be sampled from a Gaussian process with a continuous stationary kernel κj. As further shown, the output of the system 100 can be used in downstream applications, such as but not limited to simulation (e.g., simulation application 20) to simulate the observed system 10 using the heterogeneous function model ƒ. This can help researchers, designers, and engineers by providing a more accurate model for simulating the observed system. The output of the system 100 can also be used for optimization (e.g., optimization application 30) which can help researchers, designers, and engineers optimize aspects of a system or supporting structure that depends on, affects, or is affected by the dynamics of the observed system 10. Other practical uses of the output of the system 100 outlined herein can include further study of the observed system 10, including understanding the dynamics of the observed system 10 based on how the system behaves and how individual partitions and classes contribute to behavior of the observed system 10.
Referring to step 210 of a method 200 shown in
As further shown at step 220 of the method 200 of
At step 230 of the method 200 of
The binary classification tree is further elaborated on in
A decision tree algorithm tailored for the framework is used to learn the partitions of the input space. The decision tree algorithm implemented by the system “learns” or otherwise constructs a binary classification tree in two or more sub-steps as shown in step 220 of the method 200 of
In a non-limiting illustrative example corresponding to an automatic transmission within a vehicle being the observed system 10, where partitions correspond to individual gears that the transmission could switch to, an input instance to the observed system 10 could include a current ground speed of the vehicle, a current gear, and/or an engine speed (e.g., RPM). For the root node, consider an example where a feature index obtained during construction of the binary classification tree 310 could start with the current gear, and a splitting threshold could be “3rd gear or higher”. If the input instance indicates that the vehicle is in 3rd gear or higher, then a “yes” pathway can be selected which leads to a first child node. If not, then a “no” pathway can be selected which leads to a second child node. Continuing with the example, consider that the “yes” pathway at the root node is taken leading to the first child node. For the first child node, a feature index could be the same feature (e.g., directed to which gear the vehicle is currently in) but with a different splitting threshold (e.g., “5th gear or higher) which can isolate higher gears from medium gears. Alternatively, the first child node can be directed to a different feature (e.g., directed to a current engine speed) having a splitting threshold (e.g., higher or lower than a certain RPM value). The binary splitting process can continue recursively along the length of the binary classification tree 310 until a leaf node is reached which can correlate with a partition within a class.
The partitions of the input space learnt from the decision tree algorithm in section 4A.1 are used to divide the training data set into partition-based subsets
j with nj data points for all j∈[p]. Referring to
j underlying stationary function gj is modeled using a zero mean Gaussian Process prior with continuous stationary kernel κj(.,.) and subset of training data
j is used to learn/train the model. The function modeling and training in each partition can be similar to that of a standard Gaussian process regression problem with the exception of learning the hyperparameters of the partition GPs. The posterior of partition GP conditioned on the partition training data is given by y(x)|x,j,
j˜
(μj,n
(y|x,j,
j) and variance σj,n
where yj is the vector of observations in given partition j, kj is the vector with kj(s)=κj(x,xs). The matrix Kj is such that Kj(s,t)=κj(xs,xt) where s,t∈{1, . . . , nj}. Note the superscripts here represent the components of the vector and matrix.
Learning hyperparameters in a standard GP typically involves finding the set of hyperparameters that maximize the log marginal likelihood of the observations (eq. 4), whereas in the current framework the set of hyperparameters need to maximize the log marginal likelihood across all the partitions within a class. As such, a method of learning hyperparameters by the system is outlined herein.
Each partition of the plurality of partitions of the class label can respectively have an underlying partition-level stationary function (gj) modeled using a Gaussian process (GP) prior distribution having a continuous stationary kernel (κj), the continuous stationary kernel (κj) being applicable for each partition of the plurality of partitions of the class label. A new class likelihood is formed as a summation of log marginal likelihoods of all partition GPs for a given class. Class kernel hyperparameters are learnt by joint optimization (e.g., maximization) of the partition-level log marginal likelihoods for the plurality of partitions of the class label. The formulation of the class-likelihood function assumes that the data from different partitions are independent of each other but can still be combined into the class-likelihood function, and this reduces the computational complexity of learning the hyperparameters significantly while still taking data from all the partitions in the class.
The extensive empirical results show that this new class likelihood reduces the modelling error and provides with the better estimates of the hyperparameters, intuitively this makes sense as there are more data points to estimate hyperparameters even though all the data points do not belong to the same partition. The new class likelihood function for a given class i is given as follows:
where θ*i=argmaxθi(yi|Xi, θi), Ki is the block diagonal matrix with blocks of Kj for all {j:
(j)=i}, ni=Σ{j:
(j)=i} nj, and yi is the vector formed by yj for all {j:
(j)=i}.
In
A heterogeneous function model ƒ for the observed system can include, for the first partition j1, a first partition-level stationary function gj
Further, for the third partition j3, the heterogeneous function model ƒ for the observed system can include a third partition-level stationary function gj
Because the third input example x3 for the third partition j3 belongs to a class i=2 which is different from that of the first partition j1 and the second partition j2, the third covariance kernel κj
The heterogeneous function model ƒ can be constructed from class likelihood functions for each class of the plurality of classes, which are themselves constructed from partition-level likelihood functions from partitions within their respective class (where the partition-level likelihood functions correspond with their associated Gaussian Process priors or stationary functions, i.e., gj
The functions performed in the processes and methods may be implemented in differing order. Furthermore, the outlined steps and operations are provided as examples, and some of the steps and operations may be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed embodiments.
Referring to
t = {((x1, y1), ... , (xt > yt))}
j,t
at(x) (ClassUCB acquisition function).
(0, η2)
t+1 =
t ∪ {(xt+1, yt+1)}
As discussed above with respect to step 420 of method 400, partition sampling for ClassBO involves two key steps learning the initial partitions of the input space based on distance information using a decision tree algorithm (similar to sub-steps 1 and 2 of step 220 shown in
Decision tree (CART) algorithm “learns” a tree T by executing recursive binary splits on the data by selecting the feature and the splitting threshold that provides the maximum reduction in the impurity (which can be measured as Gini index). To reiterate, Gini impurity index used in this work is given as follows:
where pi is the probability of a given class at a given node. Notice that each leaf nodes of the tree correspond to a sub-regions of the input space. A tree can be represented with T, interior nodes indexed by η∈{1, . . . , u}, leaf nodes indexed by j∈{1, . . . , v} where u is the total number of interior nodes and v is total number of leaf nodes in the tree T.
ClassBO can use the modified decision tree algorithm of ClassGP to learn the sub-regions (i.e., partitions) of the input space based on the distance information. Specifically, ClassBO uses the distance to the closest boundary information W for selecting the features and splitting threshold to form an initial tree in each iteration of ClassBO algorithm (as in sections 3A and 4A pertaining to ClassGP). The partitioned sub-regions (leaf nodes) of the input space are assigned class labels based on the maximally observed sample class in the given sub-region.
ClassBO can employ a Metropolis Hastings (MH) based tree sampling algorithm to account for the uncertainty in the initial tree from steps 1 and 2 of the decision tree algorithm of ClassGP. MH algorithm is a Markov chain Monte Carlo method to sample from a target distribution when direct sampling is difficult.
MH tree sampler uses tree learned from decision tree algorithm as initialization and uses proposal distribution to sample candidate tree from the posterior. A MH tree sampler can be formulated by defining: (i) a target distribution on the space of trees p(T|), (ii) a proposal distribution with transition kernel q(T, T′) to generate tree candidates and (iii) an acceptance ratio A(T, T′) to accept of reject the candidate trees.
An implicit tree prior can be specified on the space of trees. Specifically, a stochastic process is defined for generating trees and each realization of this tree generating process is a random sample from the prior p(T). The tree generating process starts with an initial tree T0. The tree is then grown randomly by spitting each leaf node (j) where the leaf node can be split with a probability psplit, followed by assigning a splitting rule with probability prule. The present disclosure outlines a split probability function dependent on the depth of the node, given as follows:
where dj is the depth of the node j, and α,β>0 are user defined tunable parameters. A splitting rule is assigned to the splitting node by uniformly selecting a feature to split followed by uniformly selecting the threshold to split on the feasible range.
The proposal distribution generates the candidate tree T′ given the current tree T. As illustrated in
The acceptance ratio A (T, T′) to accept or reject the generated tree candidate is given as follows:
Notice that it is not necessary to compute p(T) or p(T′). Instead, the ratio p(T′)/p(T) can be computed which is only dependent on psplit, prule.
ClassBO uses ClassGP as outlined in sections 3A and 4A herein to model the target function, where the sub-regions (i.e., partitions) of the input space learned based on the binary classification tree divide the training data set into sub-regions denoted by subsets
j with nj data points for all j∈[p]. For each sub-region
j, the underlying stationary function gj can be modeled using a zero mean Gaussian Process prior with continuous stationary kernel κj( . . . ). The subset
j is used to compute the posterior for the sub-region j. The computation of the posterior is similar to that of the standard GP regression with one exception of learning the hyperparameters for each sub-region. The conditional GP posterior for a sub-region j is given by y(x)|x,j,
j˜
(μj,n
where yj is the vector of observations in given sub-region j, kj is the vector kj(s)=κj(x,xs), the matrix Kj is such that Kj(s,t)=κj(xs,xt) where s,t∈{1, . . . , nj}. Note the superscripts here represent the components of the vector and matrix.
Learning hyperparameters in a standard GP involves maximizing the log marginal likelihood of the observations as shown in equation (6), whereas in order to share information across the sub-regions with same class labels the hyperparameters are learned by maximizing the log marginal likelihood across all the sub-regions that have same class label. The class likelihood as outlined in section 4A.2 formed by maximizing a summation of the log marginal likelihoods of GPs of all the sub-regions with a given class label to learn the hyperparameters. Intuitively, the new class likelihood makes sense as the sub-regions that have same class label have the same underlying homogeneous function. The new class likelihood function for a given class i is given as follows:
where the set {j:(j)=i} denotes all sub-regions j that have a class label i, Ki is the block diagonal matrix with blocks of Kj's, yi is the vector formed by concatenating yj for all {j:
(j)=i}, ni=Σ{j:
(j)=i}nj, and of θ*i=argmaxθ
i(yi|Xi,θi).
As discussed with respect to step 440 of method 400, ClassBO utilizes the statistical model learned over the sub-regions of the input space to formulate the ClassUCB acquisition function to navigate the input space efficiently. The next sampling point for each iteration is decided by maximizing the ClassUCB acquisition function given as follows:
Where anan(x). In each iteration, once the new data point is sampled the algorithm relearns the sub-regions of the input space followed by computing the posterior in each sub-region.
The functions performed in the processes and methods may be implemented in differing order. Furthermore, the outlined steps and operations are provided as examples, and some of the steps and operations may be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed embodiments.
This section provides a formal statement for probabilistic uniform error bounds for ClassGP framework. These bounds are derived based on the assumption that each unknown function gj is a sample from a Gaussian process with known kernel function κj.
Theorem 1A. Consider an unknown function ƒ:→
which induces p partitions on the input space, and is given as ƒ=Σj=1p
{x∈
j}gj where each gj has a Lipschitz constant Lg
j from which each gj is sampled for all j∈[p]. Given nj∈
noisy observations yj with i.i.d zero mean Gaussian noise in a given partition the posterior mean (μn
j={Xj, yj} of the partition are continuous with Lipschitz constant
and modulus of continuity
on j such that:
where {circumflex over (K)}j=(Kj+η2In
Moreover, pick δj∈(0,1), r∈+ and set:
then the following bound on each partition holds with probability 1−δj:
and the following bound on the entire input space holds with probability 1−δ where δ=Σj=1p{x∈
j}δj i.e.,
Corollary 1. Given problem setup defined in Theorem 1, the following bound on L1 norm holds with probability 1-8:
where ζ is a non negative constant, δ=Σj=11δj and δj=1−M(r,)e−B
This section analyzes the ClassBO algorithm under the assumption that the true partitioning of the input space (tree T) is known, and shows that the algorithm converges to the optimal value asymptotically. To restate the problem setup and assumptions that are used to analyze the algorithm: Let ƒ:⊆
d→
be an heterogeneous black-box function defined over a compact space
and let gj's be the homogeneous functions in the sub-regions
j, ∀j∈[p].
Let x*∈⊆
d be the unique optimal point. Querying the function returns a noisy observation yv=ƒ(xv)+ϵv with i.i.d. ϵv˜
(0, σ2).
Assumption 1. The function in each sub-region is a sample from a Gaussian process i.e., gj˜(0,κj(.,.))∀j∈[p] with the known kernel function κj. This is equivalent to ƒ˜
(0,κj(.,.)) where κ(x,x′)=Σj
{x,x′∈
}κj(x,x′).
A mild assumption on the smoothness of ƒ is needed to make problem solvable:
Assumption 2. Let D⊂[0,r]d be compact and convex, d∈N,r>0. The kernel κ(x,x′) satisfies the high probability bound on the derivatives of sample paths ƒ for some constants a,b>0 a.e.
By this assumption, the Lebesgue measure for discontinuities of ƒ is, which holds true in this setup as the discontinuities would only occur along the boundaries of the sub-regions.
Assumption 3. For the purpose of analysis assume that sub-regions of the input space are known a priori.
Additional notations: Without loss of generality, assume that the maximum ƒ(x*) lies in sub-region j=1 i.e., ƒ(x*)=g1(x*). At any instant t of the algorithm run, tj is the number of times jth sub-region is sampled, and Tj is the total number of times jth sub-region is sampled after T iterations of the algorithm. Hence, Σj=1ptj=t and Σj=1pTj=T. The regret rt at iteration t when the sub-region j is sampled is represented by rj,t
The analysis of the algorithm is divided into two parts, first when the decision set j is finite for all partition index j∈[p] and then we extend the analysis to the general compact decision sets. We start by giving probabilistic error bounds followed by bounding the regret with predictive variance and then expressing the predictive variance in terms of the mutual information for the case of finite decision set.
Lemma 1. For a finite decision set Xj, the following holds with probability at least (1-δj):
Proof sketch: The following inequality and union bound for both tj and x are used to prove the lemma. If r˜(0, 1), then
{r>c}≤(1/2) e−c
Lemma 2. For a fixed tj≥1, if |gj(x)−μj,tj, regret rt is bounded by 2βt
Proof. By definition of xt
Lemma 3. The information gain for the points selected can be expressed in terms of the predictive variances. For any sub-region j, if gj,TT
Further, note that the mutual information between the samples of two different sub-regions is 0, since the functions in each sub-regions are independent.
Lemma 4. Pick δj∈(0, 1) and let βj,t
where C1:=8/log(1+σ−2)≥8σ2 and, γT
Now, consider the total cumulative regret RT of the function ƒ after T iterations,
Using Cauchy Schwarz inequality following holds Rj,T
We now extend the analysis for general case of compact decision set j for all j={1, . . . , p}, using assumption 2. The analysis follows similar steps as in finite decision set except for the additional step of discretizing the input space
j to
j,t
Lemma 5. Pick δj∈(0, 1) and set βt, the regret is bounded as follows:
The Lemma 3, 4, & 5 follows from (Srinivas et al. 2009). Now, we state the convergence results for ClassBO algorithm.
Theorem 1B. Under the assumptions stated above, running the ClassBO for ƒ with δ∈(0,1) and βt
The proof for the theorem above follows from Lemmas 1 to 5.
This section compares performance of the system implementing the ClassGP framework with other baselines Partition-GP and Standard-GP over the extensive empirical evaluations on both noisy and noiseless simulated dataset. Performance of the models is evaluated using the mean square error (MSE) metric. A brief overview of the baseline models used for comparison is given below:
Standard GP: Standard GP uses a single GP with continuous stationary kernel across the entire space to model the underlying function.
Partition GP: Partition GP is similar to that of Treed Gaussian process framework with additional class information. Partitions of the input space are learnt using the class information followed by modeling the function in each partition individually, i.e., the hyperparameters in each partition are learnt independently of other partition data of the same class. Put differently, in Partition GP the hyperparameters learned belong to the individual partitions ONLY and do not cover the whole class. In contrast, ClassGP introduced within the present disclosure learns or otherwise determines hyperparameters for a class in view of the individual partitions within the class.
Synthetic data for all the experiments is uniformly sampled from input space =[−10, 10] where, d is the dimension of the input space. Partitions p are generated by equally dividing the input space and each partition j∈[p] is assigned a class label i∈[m] to forms a checkered pattern. Gini index as the node splitting criterion to learn the tree and, squared exponential (SE) covariance kernel for all GPs to model the underlying function in each learnt partition.
Each model is evaluated and compared across different functions as given below:
1. Harmonic function:
2. Modified Levy function (for d=2):
3. Modified Qing function (for d=2):
4. Modified Rosenbrock function (for d=2):
here, vk=kω
(j),k, where the frequency vector ωi=i*ω depends on the class i, ω
(i),k is the kth component. Further, vector w is sampled from a normal distribution, constant bj (intercepts) depends on the partition j and each bj is sampled from a normal distribution.
The following parameters are initialized for each simulation: dimension d, training budget N, number of partitions p, number of classes m, frequency vector ω, constant (intercept) vector b, maximum depth of the tree, minimum samples per node, initial kernel hyperparameters θ. For a fixed set of initialized parameters, 50 independent simulation runs for each baseline are performed to compute a confidence interval over the model performance, training data is re-sampled for each run.
To analyze the effects and compare the model performance with baselines each parameter i.e., number of partitions (p), number of classes (m), training budget (N) and, dimension (d), is varied while keeping the others constant. Various initialization of the parameters for the simulations are shown in Table 1. The performance measure metric (MSE) for each model across all the simulations is evaluated on uniformly sampled test data set of fixed size i.e., 5000 data points.
Effects of each parameter on the model performance are analyzed below:
Performance of the model is also evaluated over noisy data sets for various parameter initialization, but for brevity a subset of the results are provided herein in the tabular format. Table 2 shows average MSE (not normalized) for each model over 50 independent runs. It can be observed that when the number of partitions is low, the performance of ClassGP is as good as Partition GP, whereas when the partitions increase, ClassGP outperforms other methods.
The ClassGP framework improves upon partition-based methods to a family of heterogeneous functions with access to class information. The framework incorporates new class likelihood function to exploit homogeneous behaviour of the partitions that belong to same class, leading to enhanced performance of GP models by enabling them to learn hyperparameters across an entire class rather than individual partitions. Furthermore, the framework establishes a tailored tree algorithm that uses the closest boundary information to learn the initial tree. Theoretical results are provided in terms of the probabilistic uniform error bounds and bounds on L1 norm. Finally, extensive empirical analysis was conducted that clearly shows superior performance of ClassGP as compared other baselines. The ClassGP modeling framework can be extended to optimization (e.g., using ClassBO), scaling to higher dimensions, and extensive theoretical analysis of the algorithm with practical error bounds.
We compare the performance of ClassBO against TURBO algorithm on an objective function formulated by assigning a standard test functions for each sub-region based on the class labels. The test function were centered (translated) to its sub-region. The test functions used are given as follows (the enumeration order is as per class label assigned):
1. Modified Griewank function:
2. Modified Rosenbrock function:
3. Modified Rastrigin function:
4. Modified Levy function:
5. Modified Qing function:
6. Harmonic function:
The simulations are performed with fixed initialization of dimension d=2, initial samples N=5, iterations T=150 and for different initialization of classes k=2, 3, 4, 5, 6, subregions p=4, 9, 16, 25, for both balanced (all sub-regions of same size) and unbalanced (all sub-regions with different sizes) sub-regions, and both in presence and absence of noise. In each simulation for a fixed set of initialization all the algorithms are run 5 times (macro runs) to get average performance plots. To compare the performance of the algorithms we plot the max observed reward (averaged over multiple runs) vs iterations T, and the algorithm that achieves the function maximum quickest is considered to outperform other methods.
In all our simulations we observe that ClassBO performs at least as good as TurBO when the number of sub-regions of the input space is low, whereas in other cases ClassBO outperforms TurBO algorithm.
ClassBO employs a novel tree based BO algorithm coupled with MCMC tree sampling strategy to sample trees from posterior tree distribution. The ClassBO, using ClassGP, learns the underlying structure of the function to be optimized and to share information across non-contiguous sub-regions and improve optimization performance.
It should be understood from the foregoing that, while particular embodiments have been illustrated and described, various modifications can be made thereto without departing from the spirit and scope of the invention as will be apparent to those skilled in the art. Such changes and modifications are within the scope and teachings of this invention as defined in the claims appended hereto.
This is a non-provisional application that claims benefit to U.S. Provisional Application Ser. No. 63/539,278 filed on Sep. 19, 2023, which is herein incorporated by reference in its entirety.
This invention was made with government support under 2003081, 2048223, 2200161, and 1540040 awarded by the National Science Foundation. The government has certain rights in the invention.
| Number | Date | Country | |
|---|---|---|---|
| 63539278 | Sep 2023 | US |