SYSTEMS AND METHODS FOR ADAPTIVE PROBING OF PIECEWISE CONTINUOUS SURFACES

BACKGROUND

Imaging or scanning a sample is often generally performed by “probing” different points on the sample to obtain data about the sample. Based on these data points, the reconstructed image of the sample may be generated. Methods currently exist to select these probing locations. However, these methods are typically optimized for probing continuous surfaces instead of piecewise continuous surfaces. Consequentially, these existing methods are inefficient for probing piecewise continuous surfaces because they require a larger number of probe data points to be captured to produce an accurate image reconstruction. This requirement of additional probe data points results in additional probing time and a higher risk of damaging the sample due to increased exposure to the probes.

BRIEF SUMMARY

Embodiments of the subject invention provide novel and advantageous systems and methods for image reconstruction via adaptive probing of piecewise continuous surfaces. A machine learning algorithm can be employed with scanning-based measurement instruments or experimental probers to optimize the selection of probe locations for effectively scanning piecewise continuous surfaces. A limited number of initial probes may first be obtained to estimate the piecewise continuous surface. The machine learning algorithm may then be leveraged to identify any subsequent probe locations used to obtain additional data points about the piecewise continuous surface. The selection of the probe locations may be performed iteratively until sufficient data has been obtained to generate an accurate image reconstruction.

In an embodiment, a system for reconstructing an image of a sample can comprise: a processor; and a machine-readable medium in operable communication with the processor and having instructions thereon that, when executed, perform the following steps: receiving first data corresponding to a plurality of probe points of the sample; generating a first estimate of a piecewise continuous surface based on the first data; and using a machine learning algorithm to perform adaptive probing on the piecewise continuous surface to obtain a reconstructed image of the sample. The using of the machine learning algorithm to perform adaptive probing on the piecewise continuous surface can comprise: i) identifying (e.g., by the machine learning algorithm), based on the first estimate of the piecewise continuous surface, an updated plurality of probe points of the sample; ii) receiving (e.g., by the machine learning algorithm) updated data corresponding to the updated plurality of probe points of the sample; iii) generating (e.g., by the machine learning algorithm) an updated estimate of the piecewise continuous surface based on the updated data; iv) identifying (e.g., by the machine learning algorithm), based on the updated estimate of the piecewise continuous surface, an updated plurality of probe points of the sample; and v) repeating substeps ii)-iv) at least once. Substep v) can comprise iteratively repeating substeps ii)-iv) until the updated data is sufficient data to generate an accurate reconstructed image (this sufficiency can be determined by, for example, the machine learning algorithm, or based on user input (e.g., reviewing the current iteration and deciding whether to continue) and/or a predetermined number of iterations of the algorithm). Substep v) can comprise iteratively repeating substeps ii)-iv) a predetermined number of times (e.g., at least twice, at least three times, at least four times, at least five times, at least six times, at least seven times, at least 10 times, at least 20 times, at least 30 times, at least 50 times, at least 100 times, or more). In substeps i) and iv), the updated plurality of probe points of the sample can be identified based on bias and variance. In substeps i) and iv), the updated plurality of probe points of the sample can be identified using a jump Gaussian process (JGP). The JGP can use mean square error (MSE) and/or mean square prediction error (MSPE). The instructions when executed can further perform the step of training the machine learning algorithm (e.g., before receiving the first data). The system can further comprise a display in operable communication with the processor and/or the machine readable medium. The instructions when executed can further perform the step of displaying the reconstructed image (and/or any intermediate image, updated data, and/or updated plurality of probe points) on the display.

In another embodiment, a method for reconstructing an image of a sample can comprise: receiving (e.g., by a processor) first data corresponding to a plurality of probe points of the sample; generating (e.g., by the processor) a first estimate of a piecewise continuous surface based on the first data; and using (e.g., by the processor) a machine learning algorithm to perform adaptive probing on the piecewise continuous surface to obtain a reconstructed image of the sample. The using of the machine learning algorithm to perform adaptive probing on the piecewise continuous surface can comprise: i) identifying (e.g., by the machine learning algorithm), based on the first estimate of the piecewise continuous surface, an updated plurality of probe points of the sample; ii) receiving (e.g., by the machine learning algorithm) updated data corresponding to the updated plurality of probe points of the sample; iii) generating (e.g., by the machine learning algorithm) an updated estimate of the piecewise continuous surface based on the updated data; iv) identifying (e.g., by the machine learning algorithm), based on the updated estimate of the piecewise continuous surface, an updated plurality of probe points of the sample; and v) repeating substeps ii)-iv) at least once. Substep v) can comprise iteratively repeating substeps ii)-iv) until the updated data is sufficient data to generate an accurate reconstructed image (this sufficiency can be determined by, for example, the machine learning algorithm, or based on user input (e.g., reviewing the current iteration and deciding whether to continue) and/or a predetermined number of iterations of the algorithm). Substep v) can comprise iteratively repeating substeps ii)-iv) a predetermined number of times (e.g., at least twice, at least three times, at least four times, at least five times, at least six times, at least seven times, at least 10 times, at least 20 times, at least 30 times, at least 50 times, at least 100 times, or more). In substeps i) and iv), the updated plurality of probe points of the sample can be identified based on bias and variance. In substeps i) and iv), the updated plurality of probe points of the sample can be identified using a JGP. The JGP can use MSE and/or MSPE. The method can further comprise training (e.g., by the processor) the machine learning algorithm (e.g., before receiving the first data). The method can further comprise displaying (e.g., by the processor) the reconstructed image (and/or any intermediate image, updated data, and/or updated plurality of probe points) on a display (e.g., a display in operable communication with the processor).

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example sparse or compressive imaging process, according to an embodiment of the subject invention.

FIG. 2 illustrates example sparse or compressive imaging, according to an embodiment of the subject invention.

FIG. 3 illustrates example partitioned regression approaches, according to an embodiment of the subject invention.

FIG. 4 illustrates example image reconstructions based on the example partitioned regression approaches, according to an embodiment of the subject invention.

FIG. 5A illustrates an example local nonparametric estimation, according to an embodiment of the subject invention.

FIG. 5B illustrates another example local nonparametric estimation, according to an embodiment of the subject invention.

FIG. 6 illustrates an example involving linear boundaries, according to an embodiment of the subject invention.

FIG. 7 illustrates an example involving quadratic boundaries, according to an embodiment of the subject invention.

FIG. 8 illustrates an example involving quadratic boundaries, according to an embodiment of the subject invention.

FIG. 9 illustrates a comparison between image construction performed using a jump Gaussian process (JGP) and conventional methods, according to an embodiment of the subject invention.

FIG. 10 illustrates a comparison between probe locations selected using the JGP approach described herein and probe locations selected using a conventional approach, according to an embodiment of the subject invention.

FIG. 11 is a plot depicting an estimation error based on probe location selection, according to an embodiment of the subject invention.

FIG. 12 illustrates an empirical distribution of {circumflex over (p)}_ifor a test function, according to an embodiment of the subject invention.

FIG. 13 illustrates bias and variances of JGP, according to an embodiment of the subject invention.

FIG. 14 illustrates three acquisitions functions, according to an embodiment of the subject invention.

FIG. 15 illustrates active selection of design points for three acquisition functions, according to an embodiment of the subject invention.

FIG. 16 illustrates an example of a computing system, according to an embodiment of the subject invention.

DETAILED DESCRIPTION

Embodiments of the subject invention provide novel and advantageous systems and methods for image reconstruction of a sample via adaptive probing of piecewise continuous surfaces. A machine learning algorithm can be employed with scanning-based measurement instruments or experimental probers to optimize the selection of probe locations for effectively scanning piecewise continuous surfaces. A limited number of initial probes may first be obtained to estimate the piecewise continuous surface. The machine learning algorithm may then be leveraged to identify any subsequent probe locations used to obtain additional data points about the piecewise continuous surface. The selection of the probe locations may be performed iteratively until sufficient data has been obtained to generate an accurate image reconstruction (this sufficiency can be determined by, for example, the machine learning algorithm, or based on user input and/or a predetermined number of iterations of the algorithm).

This method for probe location selection is advantageous over existing methods because it allows for probing piecewise continuous surfaces, whereas existing algorithms may only be applicable to probing continuous surfaces. Probing of piecewise continuous surfaces finds many commercial applications in microscopy, instrumentation, and autonomous systems, for example (this is not intended to be limiting). One currently promising application is usage in electron microscopes (while reference is made herein to microscopes, these systems and methods may also be applicable to any other scanning-based instruments as well) to make effective uses of electron doses for scanning nanomaterials. Another promising application is as an algorithmic experimental planner in an autonomous system for scientific discovery, to algorithmically optimize experimental probes (or designs) for scientific and engineering experiments.

In one or more embodiments, the methods described herein may involve the use of a jump Gaussian process (JGP) model as surrogates for piecewise continuous response surfaces, which may be continuous within the same regimes of a design space but discontinuous across the regimes. Estimates of the bias and variance may be developed for the JGP model. The model bias may be largely influenced by the accuracy of classifying training data by governing regimes of surrogates, and the model variance may be comparable to that of the standard GP model (e.g., spacing of training data largely contributes to the variance). This suggests that, in order to reduce the model bias and variance together, more data points may be obtained around the boundaries between regimes while placing data points around less populated areas of a design space. Based on this principle and bias and/or variance estimates of Jump GP, three active learning criteria may be introduced: one minimizing the integrated mean square prediction error (IMSPE criterion), another placing at the peak of the mean square prediction error (MSPE criterion), and the last one placing at the peak of the predictive variance (e.g., variance criterion).

The three criteria were evaluated using various simulation scenarios by tracking the changes in the mean square error (MSE) and the negative log posterior density (NLPD) metrics for each choice of the criteria. The method described herein may involve the use of JGP with the MSPE criterion, however, this is not intended to be limiting.

Turning to the figures, FIG. 1 illustrates an example sparse or compressive imaging process, according to one or more embodiments of the subject invention.

The sparse or compressive imaging process generally involves receiving an unknown sample image (shown on the left in FIG. 1), selecting probe locations to obtain data about the unknown sample image (shown in the middle in FIG. 1), and reconstructing the sample image using the probe data (shown on the right in FIG. 1).

Let custom-character represent an image scape (2D grid of coordinate locations. Let f (x) represent an unknown sample image with f (x) as an unknown image intensity at an image location x∈. Reconstruction may be a regression problem for estimating unknown f, given (noisy) partial probe data given by:

custom-character ={(x_i, y_i), y_i=f(x_i)+∈_i, i=1, . . . , N}.

Say the estimate is given by {circumflex over (f)}(x; custom-character ).

Subsampling is an active machine learning problem: how to optimize the probe locations {x_i, i=1, . . . , N} for minimizing the reconstruction error given by:

Err
custom-character
=

_y[(f(x)−{circumflex over (f)}(x; ))²dx].

FIG. 2 illustrates example sparse or compressive imaging, according to one or more embodiments of the subject invention.

In most regression analysis, the underlying regression model f(x) may be assumed to be a smooth and continuous function. The assumption helps to average out image noises. However, image intensities may not be continuous. The top row of images shown in FIG. 2 illustrates sample images (images that are desired to be reconstructed) and selected probing locations for the sample images (shown as red dots on the sample images). The bottom row of images shows the reconstructed images based on the probe data obtained from the sample images in the first row. As shown in image reconstruction 202 on the right, conventional approaches for probe location selection may result in poor image reconstructions of the original sample image for piecewise continuous surfaces.

FIGS. 3 and 4 illustrate example partitioned regression approaches, in accordance with one or more embodiments of the subject invention.

Proper models for image intensity function f(x) may be a partitioned regression. Consider a partition of the image space custom-character into subregions {_k, k=1, . . . , K}. There may be an independent regression model for each region:

$f (x) = \sum_{k = 1}^{K} f_{k} (x) 1_{ℛ_{k}} (x) .$

The f_k(x) may be a continuous regression model for region custom-character _k, parameterized by θ_k. Here, Gaussian process (GP) repressors may be used because this provides well-calibrated uncertainty quantification capability for active machine learning. The challenge here is that there may exist a large number of model parameters. The regional regression models {θ_k, k=1, K}, the space partition { custom-character _k, k=1, . . . , K}, and even the number K may be unknown.

FIG. 3 illustrates two conventional approaches for selecting probe locations that involve the use of partitioned regression. A first approach, shown on the left, involves Treed partitioning and GP regression. A second approach, shown on the right, involves Voronoi tessellation and GP regression. FIG. 4 shows the resulting image reconstructions when using these conventional approaches. Image (a) in FIG. 4 shows the sample image and selected probe locations. Image (b) in FIG. 4 shows an image reconstruction that is performed without using any partitioned regression. Images (c) and (d) in FIG. 4 show image reconstructions when the Treed partitioning and Voronoi tessellation approaches illustrated in FIG. 3 are used to select probe locations. As shown in images (c) and (d), the Voronoi tessellation approaches do provide for a higher quality image reconstruction than image (b) from the Treed partitioning, but still do not effectively capture the original shape in the sample image shown in image (a).

FIG. 5A illustrates an example local nonparametric estimation, according to one or more embodiments of the subject invention.

Particularly, FIG. 5A shows an example conventional local nonparametric estimation.

This conventional approach involves taking a small subset of probe data nearing a test location x_*(e.g., n-nearest neighbors).

custom-character
_n={(x_i,*, y_i,*): i=1, . . . , n}.

The weighted average of the local data is determined to make a prediction for f (x_*). This is advantageous because for many test locations, the local data may be from one region. This may be more adaptive to local trends. However, this approach may still be insufficient because the local data may be mixed from different regions, when x_*is around border lines of different regions of the sample.

FIG. 5B illustrates another example local nonparametric estimation, according to one or more embodiments of the subject invention.

Particularly, FIG. 5B shows example local nonparametric estimation associated with the methods described herein. This approach may involve bisecting local data custom-character _n(x_*) by a parametric curve g(x, w)=0 into two sides. In this example, Group 1 may be the same side as test point x_*. Group 0 may be the other side. The boundary g(x, w)=0 may be fine-tuned so as to have Group 1 include only data from the same region as the test point. Finally, a local regress or is fit to Group 1. This concept may be referred to as “jump GP” or “JGP” herein. For example, FIG. 6 shows an example in which linear boundaries are used and FIGS. 7-8 show examples where quadratic boundaries are used. Table 1 presented below provides some examples of distinguishing features between existing piecewise models and the “JGP” described herein.

TABLE 1

Existing piecewise models
Jump GP model

parameters for K regional models
parameter for one local

{f_k(x), k = 1, . . . , K}
model f(x)

needs to estimate K
no need to know K

global noise parameter σ²
local noise parameter σ²

need to estimate complex global
simple local bisection

partitioning { custom-character

_k, k = 1, . . . , K}
g(x, w) = 0

Complex Bayesian model
simple bimixture model

expensive MCMC calculations
expectation-maximization

(EM) algorithm

FIG. 9 illustrates a comparison between image construction performed using JGP and conventional methods, according to one or more embodiments of the subject invention.

Image (b) of FIG. 9 shows an image reconstruction based on a local GP estimate. Image (c) shows an image reconstruction based on the Treed regression estimate shown in FIG. 3. Finally, image (d) shows an image reconstruction based on the JGP estimate described herein. As shown in FIG. 9, the image reconstruction performed using the JGP estimate better resembles the original sample image shown on the left with the example probe locations.

As shown in FIG. 9, the use of JGP alone provides more accurate image reconstruction for given data custom-character . However, in many cases, the data acquisition process may be controlled in order to elect for achieving specific machine goals, such as scanning coil control in Scanning Transmission Electron Microscope (STEM), for example. Thus, active learning (AL) (or sequential design of experiments) may also be used in associated with the JGP to further optimize the probe locations in compressive imaging.

AL attempts to make a virtuous cycle between data collection and model learning. AL may involve beginning with a small seed data custom-character _N={(x_i, y_i), i=1, . . . , N} from a space filling design. The data _Nmay be augmented with a new data point (e.g., a new probe location) (x_N+1, Y_N+1) and repeating this process to add additional probe locations. One approach to placing X_N+1is the maximum error criterion:

$x_{N + 1} = \underset{x_{*} \in χ}{\arg \max} Err [\overset{⋀}{f} (x_{*}; 𝒟_{𝒩})]$

The error criterion may include of two parts: the model bias and variance. The bias may be the average discrepancy of the model prediction {circumflex over (f)}(x_*; custom-character _N) from the true response f (x_*). The variance may be the variance of the model prediction {circumflex over (f)}(x_*; _N), depending on the choice of data _N. All existing AL criteria consider only the variance, assuming no bias. This draws an ineffective choice of the probe locations. The method described herein instead takes both the bias and the variance into account to optimize the probe locations.

FIG. 11 is a plot depicting an estimation error based on probe location selection, according to one or more embodiments of the subject invention.

On the left is shown a same image with indications of selected probe locations. The red dots indicate initiate probe locations and the green boxes indicate probe locations actively selected by the JGP described herein. The plot on the right shows mean squared estimation error as a function of the number of AL stages. The plot shows that the mean squared error drops significantly as the number of AL stages increases.

FIGS. 12-16 provide additional implementation details about conventional methods, the improved methods described herein, and the distinguishing factors between the two approaches.

AL of Gaussian process (GP) surrogates is useful for optimizing experimental designs for physical and/or computer simulation experiments, and for steering data acquisition schemes in machine learning. Described herein is a method for active learning of piecewise, JGP surrogates. JGPs may be continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design and configuration of smart factory systems. AL schemes may additionally account for model bias, as opposed to the usual model uncertainty, which may be useful in the JGP context. Toward that end, an estimator may be used for bias and variance of JGP models.

One goal of machine learning in general is to create an autonomous computer system that may learn from data with minimal human intervention. In many machine learning tasks, the data acquisition process may be controlled in order to select training examples that target specific goals. AL, or sequential design of experiments, is the study of how to select data toward optimizing a given learning objective. AL for piecewise continuous GP regression models may be described herein.

A motivating application is surrogate modeling of modern engineering systems, to explore and understand overall system performance and ultimately to optimize aspects of their design. A particular focus here is on engineering systems whose behaviors intermittently exhibit abrupt jumps or local discontinuities across regimes of a design space. Such “jump system” behaviors are found in many applications. For example, carbon nanotube yield from a chemical vapor deposition (CVD) process may vary depending on many design variables. Changes in dynamics may be gradual, but process yield can suddenly jump, depending on chemical equilibrium conditions, from ‘no-growth’ to ‘growth’ regions. Specific boundary conditions dictating these regime shifts may depend on experimental and system design details. Such jump system behaviors may be universal to many material and chemistry applications owing to many factors (i.e., equilibrium, phase changes, activation energy). Jump behaviors are also frequently seen in engineering systems operating near capacity. When a system runs below its capacity, performance is generally sufficient and exhibits little fluctuation. However, performance may suddenly break down as the system is forced to run slightly over its capacity.

Suitable surrogate models for jump systems may accommodate piece-wise continuous functional relationships, where disparate input-output dynamics can be learned (if data from the process exemplify them) in geographically distinct regions on input/configuration space. Most existing surrogate modeling schemes make an assumption of stationarity, and thus may not be well-suited to such processes. AL strategies paired with such surrogates are, consequently, sub-optimal for acquiring training examples in such settings. For example, Gaussian processes may be a favorable choice for surrogate modeling of physical and computer. Gaussian processes are flexible, nonparametric, nonlinear, lend a degree of analytic tractability, and provide well-calibrated uncertainty quantification without having to tune many unknown quantities. However, the canonical, relative-distance-based kernels used with GPs result in stationary processes. Space-filling designs, and their sequential analogues, are inefficient when input-output dynamics change across regions of the input space. Intuitively, we need a higher density of training examples in harder-to-model regions, and near boundaries where regime dynamics change.

Regime-changing dynamics may be inherently non-stationary. Both position and relative distance information (in the input configuration space) is required for effective modeling. Conventional non-stationary GP modeling strategies exist, however, these approaches are often too slow, in many cases demanding enormous computational resources in their own right, or limited to two input dimensions.

In contrast, deep GPs may provide a more effectively alternative. Input dimensions may be larger, and fast inference may be provided by doubly stochastic variational inference. However, such methods may be data-hungry, requiring tens of thousands of training examples before they are competitive with conventional GP methods. An ALC-type active learning criterion has been developed for deep GPs, making them less data-hungry, but computational expense for Markov chain Monte Carlo (MCMC) inference may still be a bottleneck.

A class of methods built around divide-and-conquer strategies may offer the best of both worlds (computational thrift with modeling fidelity) by simultaneously imposing statistical and computational independence. The best-known examples include treed GPs and Voronoi tessellation-based GPs (shown in FIG. 3). Partitioning facilitates non-stationarity almost trivially, by independently fitting different GPs in different parts of the input space. However, learning the partition may be challenging. Sequential design and/or AL criteria have been adapted to some of these divide-and-conquer surrogates. ALM and ALC, for example, have been adapted for treed GPs. However the axis-aligned nature of the treed GP is not flexible enough to handle the complex, nonlinear manifold of regime change exhibited of many real datasets, as illustrated below.

In contrast, JGP seeks a local approximation to an otherwise potentially complex domain-partitioning and GP-modeling scheme. Crucially, direct inference for the JGP enjoys the same degree of analytic tractability as an ordinary, stationary GP. However, the methods described herein extend conventional AL strategies to consider both model bias and variance. The addition of bias may be in a non-stationary modeling setting. In particular, ordinary stationary GP surrogates can exhibit substantial bias for test location nearby regime changes. The JGP may help mitigate this bias, but may not completely remove it. Consequently established AL strategies that don't incorporate estimates of bias are limited in their ability to improve the sequential learning of the JGP. The method described herein may estimate both bias and variance for JGPs and parlay these into novel AL strategies for nonstationary surrogate modeling.

With respect to stationary GP regression, X may denote a d-dimensional input configuration space. A problem of estimating an unknown function f: X→ custom-character relating inputs x_I∈X to a noisy real-valued response variables y, lid(f(x_i), σ²) though examples composed as training data, D_N={(x_i, y_i), i=1, . . . , N}. In GP regression, a finite collection f_N=(f₁, . . . , f_N) of f(x_i)=f_ivalues is modeled as a Multi-Variate Normal (MVN) random vector. A common specification may involve a constant, scalar mean μ, and N*N correlation matrix C_N:f_N˜ custom-character _n(μ1_N, C_N).

Rather than treating all custom-character (N²) values in C_Nas “tunable parameters,” it is common to use a kernel c(x_i, x_j, θ) defining correlations in terms of a small number of hyperparameters, θ. Kernel families may be decreasing functions of the geographic “distance” between its arguments x_iand x_j. The method described herein, however, may be agnostic to these choices. An assumption of stationarity is common, whereby c(x_i, x_j, θ)≡c(x_i−x_j; θ), i.e., only relative displacement x_i−x_jbetween inputs, not their positions, matters for modeling.

Integrating out latent f_Nvalues to obtain a distribution for y_Nmay be straightforward because both are Gaussian. This leads to the marginal likelihood y_N˜ custom-character _N(μ1_N, C_N+σ²_N) which may be used to learn hyperparameters. MLEs {circumflex over (μ)} and {circumflex over (σ)}²may have closed forms conditional on θ. In some instances, {circumflex over (μ)}=1_N^T({circumflex over (σ)}²I_N+C_N)⁻¹y_N/1_N^T({circumflex over (σ)}²I_N+C_N)⁻¹1_n. Estimates for θ may depend on the kernel and generally requires numerical methods.

Analytic tractability may extend to prediction. Basic MVN conditioning from a joint model of Y_Nand an unknown testing output Y(x_*) gives that Y(x_*)|y_Nis univariate Gaussian. The distribution for the latent function value {circumflex over (f)}(x)≡f(x_*)|y_Nmay be presented below. This distribution is also Gaussian, with:

mean: μ(x_*)={circumflex over (μ)}+c_N^T(σ²I_N+C_N)⁻¹(y_N−{circumflex over (μ)}1_N), and

variance: s²(x_*)=c(x_*, x_*; {circumflex over (θ)})−c_N^T({circumflex over (σ)}² custom-character _N+C_N)⁻¹c_N, (1)

where c_N=[c(x_i, x_*; {circumflex over (θ)}): I=1, . . . , N] is a N*1 vector of the covariance values between the training data and the test data point. Evaluating these prediction equations, like evaluating the MVN likelihood for hyperparameter inference, may involve decompositing the N*N matrix C_N. Although there is a high degree of analytic tractability, there are still substantial numerical hurdles to application in large-data settings.

With respect to divide-and-conquer GP modeling, partitioned GP models, generally, and the Jump GP, specifically, may consider an “f” that is piecewise continuous.

$\begin{matrix} f (x) = \sum_{k = 1}^{K} f_{k} (x) 1_{χ_{k}} (x) . & (2) \end{matrix}$

where custom-character ₁, ₂, . . . , _kare a partition of . Above, 1_k(x) is an indicator function that determines whether x belongs to region _k, and each f_k(x) is a continuous function that serves as the bases of regression model on region _k. Although variations abound, in one or more embodiments, each functional piece f_k(x) may be taken to be a stationary GP.

Typically, each f_kis taken to independent conditional on the partitioning mechanism. This assumption is summarized below for easy referencing later.

Independence: f_kis independent of f_jfor j≠k. (3)

Consequently, all hyperparameters describing f_kmay be analogously indexed and may be treated independently, e.g., μ_k, σ_k², and θ_k. Generally speaking, the data within region custom-character _kmay be used to learn these hyperparameters, via the likelihood applied on the subset of data _N, whose x-locations may reside in _k. Although it is possible to allow novel kernels c_kin each region, it is common to fix a particular form (i.e., a family) for use throughout. Only its hyperparameters θ_kvary across regions, as in c(.,.; θ_k). Predicting with {circumflex over (f)}(x_*), conditional on a partition and estimated hyperparameters, is simply a matter of following Equation 2 with “hats.” That is, with {circumflex over (f)}_kdefined analogously to Equation 1, i.e., using only y-values exclusive to each region. In practice, the sum over indicators in Equation 2 may be bypassed and one simply identifies the custom-character _kto which x_*belongs and uses the corresponding {circumflex over (f)}_kdirectly.

Popular, data-driven partitioning schemes leveraging local stationary GP models include Voronoi tessellation or recursive axis-aligned, tree-based partitioning. These “structures,” defining K, and within-partition hyperparameters (μ_k, θ_k, σ²) may be jointly learned, via posterior sampling (e.g., Markov Chain Monte Carlo sampling). In so doing, one is organically learning a degree of non-stationarity. Independent GPs, via disparate independently learned hyperparameters, facilitate a position-dependent correlation structure. Learning separate σ_k²in each region can also accommodate heteroskedasticy. Such divide-and-conquer can additionally bring computational gains, through smaller-N calculations within each region of the partition.

With respect to local GP modeling, although there are many example settings where such partition-based GP models excel, their rigid partitioning structures may be a mismatch to many important real-data settings. The Jump GP is motivated by such applications. The idea may be best introduced through the lens of local, approximate GP modeling. For each test location x_*, select a small subset of training data nearby x_*: custom-character _n(x_*={(x_i*, y_i*)}_i=1ⁿ⊂_n. Then, use a conventional, stationary GP model _n(x_*) via {circumflex over (f)}_n(x_*). This is fast, because (n³) is much better than (N³) when n<<N, and massively parallelizable over many x_*∈. It is has a nice divide-and-conquer structure, but it is not a partition model (2). Nearby custom-character _n(x′_*) might have some, all, or no elements in common. In some cases, LAGP can furnish biased predictions because independence (3) may be violated: local data _n(x′_*) might mix training examples from regions of the input space exhibiting disparate input-output dynamics.

A JGP differs from basic LAGP modeling by selecting local data subsets in such a way as a partition (2) is maintained and independence (3) is enforced, so that bias is reduced. Toward this end, the JGP may introduce a latent, binary random variable Zi∈{0, 1} to express uncertainties on whether a local data point x_i,*belongs to a region of the input exhibiting the same (stationary) input-output dynamics as the test location x_*, or not:

$Z_{i} = {\begin{matrix} 1 & if x_{i, *} and x_{*} belong to the same region \\ 0 & otherwise . \end{matrix}$

Conditional on Z_ivalues, i=1, . . . , N, the local data custom-character _nmay be partitioned into two groups: _*={i=1, . . . , N: Z_i=1} and _o={1, . . . , n}\_*, lying in regions of the input space containing x_*and not, respectively.

Complete the specification by modeling custom-character _*with a stationary GP (for example, as described above), _owith dummy likelihood p(y_i,*|Z_i=0)∝U for some constant, U, and assign a prior for the latent variable Z_ivia a sigmoid x on an unknown partitioning function g(x, w),

p(Z_i=1|x_i,*,ω)=π(g(x_i,*,ω)),

where ω is another hyperparameter. Specifically, for Z=(Z_i,i=1, . . . , n), f_*=(f_i,*,i=1, . . . , n) and Θ={w, m_*, θ_*, σ²}, the JGP model may be summarized as follows.

$\begin{matrix} p (y_{n} ❘ f_{*}, Z, Θ) = \prod_{i = 1}^{n} {𝒩_{i} (y_{i, *} ❘ ?, σ^{2})}^{Z_{?} U^{1 - Z_{?}}}, \\ p (Z ❘ w) = \prod_{i = 1}^{n} π * {(g (x_{i, *}, w))}^{Z_{?}} {(1 - π (g (x_{i, *}, w)))}^{1 - Z_{?}}, \\ p (f_{*} ❘ m_{?} θ_{*}) = 𝒩_{?} (f_{*} ❘ m_{?} 1_{?}, C_{?}), \end{matrix}$

$? indicates text missing or illegible when filed$

where y_n=(y_i,*,i=1, . . . , n) and C_nn=[cx_i,*,x_i,*; θ_*:i, j=1, . . . , n] is a square matrix of the covariance values evaluated for all pairs of the local data custom-character _n(x_*).

Conditional on Θ, prediction {circumflex over (f)}(x_*) follows Equation (1) using local data custom-character _n(x_*). Inference for latent Z may proceed by expectation maximization. However, a difficulty arises because the joint posterior distribution of Z and f_*is not tractable, complicating the step. As a workaround, a classification EM variation which replaces the E-step with a pointwise maximum a posteriori (MAP) of {circumflex over (Z)}.

AL attempts to sustain a virtuous cycle between data collection and model learning. Begin with training data of size N, custom-character _N={(x_i, y_i), i=1, . . . , N}, such as a space-filling Latin hypercube design. Then, _Nmay be augmented with a new data point (x_N+1, y_N+1) chosen to optimize a criterion quantifying an important aspect or capability of the model, and this process may be repeated. Mean square prediction error (MSPE) may be used comprising of squared bias and variance.

Many machine learning algorithms are equipped with the proofs of unbiasedness of predictions under regularity conditions. When training and testing data jointly satisfy a stationarity assumption, the GP predictor (1) is unbiased, and so the MSPE is equal to s²(x_s). Consequently, many AL leverage this quantity. For example, the active learning may maximize this quantity directly: X_N+1= custom-character s(x_s). In repeated application, this AL strategy can be shown to approximate a maximum entropy design.

An integrated mean squared prediction error (IMSPE) criterion considers how the MSPE of GP is affected, globally in the input space, after injecting new data at x_N+1. Let S_N+1²(x_*) denote the predictive variance (1) a test location x_*, when the training data custom-character _Nis augmented with one additional input location x_N+1:

s
²(x_*; x_N+1)=c(x_*, x_*; {circumflex over (θ)})−c_N+1^T({circumflex over (σ)}²I_N+1+C_N+1)⁻¹c_N+1,

where c_N+1=[c(x_i, x_*; {circumflex over (θ)}): i=1, . . . , N+1], C_N+1analogousy via {circumflex over (θ)} custom-character _N. Then,

IMPSE(x_N+1)= custom-character s(x_s; x_N+1)dx_*,

which has a closed form, although in machine learning an quadrature-based version called may be used.

Such variance-only criteria makes sense when data satisfies the unbiasedness condition, i.e., under stationarity, which can be egregiously violated in many real-world settings. In Bayesian optimization contexts, acquisition criteria have been extended to account for this bias, but this may not exist for AL targeting overall accuracy. Thus, the method described herein involves bias and variance estimates for JGP and exploit in order to improve their AL performance.

Bias-variance decomposition for JGPs may also be performed, for example, using Equation (1) with custom-character _N(x_*). For convenience, these are re-written here, explicitly in that JGP notation. Let {circumflex over (Z)}_irepresent the MAP estimate at convergence (of the CEM algorithm) and let _n,*={i=1, . . . , n: {circumflex over (Z)}=1} denote the estimate of _*with n_*being the number of training data pairs in the set. Conditional on {circumflex over (Θ)}, the posterior predictive distribution of {circumflex over (f)}_*(x) at a test location x_*is univariate Gaussian with

mean: μ_J(x_*)={circumflex over (m)}_*+c_*^T(σ²I_n_*+C_**)⁻¹(y_*−{circumflex over (m)}_*1_n_*), and

variance: s_J²(x_*)=c(x_*, x_*; {circumflex over (θ)}_*)−c_*^T({circumflex over (σ)}²I_*+C_**)⁻¹c_*, (5)

where y_*=[y_i:i∈ custom-character _n,*is a n_*x1 vector of the selected local data, c_*=[c(x_i,*, x_*; θ_*):i∈_n,*is a column vector of the covariance values between y_*and f(x_*), and C_**=[c(x_i, x_j; θ_*): i, j∈_n,*is a square matrix of the covariance values evaluated for all pairs of the selected local data. Here, {circumflex over (σ)}²and {circumflex over (θ)}_*represent the MLEs of σ²and θ_*respectively, and {circumflex over (m)}_*is the

MLE of m_*, which has the form: