The local adaptive fusion regression (LAFR) algorithm is designed to facilitate onsite chemical analysis with a handheld spectrometer and by default, dedicated in-line process analyzers and benchtop instruments. It is an understandable process based on a Beer’s law-like linear relationship where a calibration model (mathematical relationship) is made that linearly relates the analyte amount, e.g., concentration, to the measured spectral responses. The calibration model is then used to predict (quantitate) the analyte amounts present in new samples. Unlike other calibration methods, LAFR takes advantage of unique features present in chemical spectral data where each measured sample spectrum stems from not only the analyte property amount but from all other spectrally responding factors responsible for the sample measured values. These factors include the sample-wise unique molecular relationships relative to the respective physicochemical, and if applicable, physiochemical conditions, measurement conditions, and instrumentation effects. The composite of these measurement influences goes by many names, but in analytical chemistry, it is termed the sample matrix effects. The usual chemical analysis approach by calibration is to always correct for the matrix effects, thereby causing current calibration methods to fail in new sample situations. The LAFR thesis is to instead use the inherent matrix effects as information for training sample sets (calibration samples) to successfully predict new samples. Matrix effects can be considered hidden variables, and thus, are impossible to explicitly detail. The LAFR algorithm searches (mines) through a library of spectral samples with corresponding analyte values and identifies subsets of samples with similar matrix effects. No other algorithm is currently able to do this with chemical data. The objective of LAFR is for the final local training set to be composed of samples with the analyte amount highly similar to the unknown analyte amount in the target sample needing prediction, i.e., calibration sample analyte amounts closely bracket the unknown target sample analyte amount. Because the target sample analyte amount is not known, the inability of other methods to identify calibration samples accomplishing the LAFR objective is why they fail.
For LAFR to successfully identify matrix-matched samples, a novel computational tool termed indicator of system uniqueness (ISU) was developed to assess the degree of matrix matching between reference samples and the target sample. It includes a novel sample-wise difference approach to matrix-match spectra (ISUX) and matrix-match actual and predicted analyte reference amounts (ISUy). It is the ISUy that allows matching the unlabeled (unknown) target sample analyte amount to the known analyte reference samples. Without this measure, other local modeling algorithms fail. The ISU (1) holistically characterizes similarity by fusion of multiple similarity merits; (2) does not require advanced optimization processes; and (3) is used throughout the many LAFR steps. Note, the concept of matching by fusion of similarity measures was used in the first 2015 LAFR disclosure, but the current process is much different now and many similarity measures in the first fusion version have been removed because they were found to be detrimental to LAFR. Additionally, some new similarity measures, also referred to as reliability measures, are now included without which, LAFR could also not succeed.
Aspects and advantages of the presently disclosed subject matter will be set forth in part in the following description, or may be apparent from the description, or may be learned through practice of the presently disclosed subject matter.
Broadly speaking, the presently disclosed subject matter relates in some respects to the LAFR algorithm using the ISUX and ISUy sample-wise similarities to mine a library of field sample spectra with reference amounts for a local training set explicitly matrix-matched to the target prediction sample. A model is formed with this training set and is used to predict the target sample. While it has many adjustable parameters, all parameters are self-optimized.
Since LAFR includes an algorithm, the steps leading to a final local calibration set to form a prediction model for a particular target sample are now outlined:
Handheld spectral devices expand chemical analysis to now be possible onsite, i.e, bringing the laboratory to the sample. However, being able to make spectral measurands in the field is not useful unless a real-time accurate local calibration/prediction method exists. In order to obtain a quantitative prediction of a target sample analyte, the calibration set must properly span the target sample in terms of all hidden matrix effects, e.g., physicochemical properties, relative to both spectral measurements (X) and amounts of all spectral responding factors (Y). This matching constraint confounds automatic calibration and prediction limiting chemical analysis by handheld devices. Local modeling is a framework that produces calibration sets matched to target samples. Local modeling requires mining a library composed of thousands of analyte reference spectra with a vast array of implicit matrix effects with the intent of identifying a unique calibration set specifically matrix matched to the particular target sample being predicted. Rather than forming one local model, the presented approach, termed local adaptive fusion regression (LAFR), forms hundreds of linear local models from a library with each model representing a distinct combination of hidden matrix effects. The LAFR approach considers local modeling as a classification situation where a target sample is classified into the local model with the most similar matrix effects. Unique to LAFR are many new concepts not used in any other local modeling method currently available. Key innovations are the indicator of system uniqueness (ISU), a hybrid fusion algorithm to characterize X and Y similarities including sample-wise differences, and each sample prediction amount receives a membership value relative to the predicting calibration set.
The computer algorithm and device allows any person, including the typical consumer, with an appropriate handheld measuring device to perform chemical analysis for the amount of substance present in a sample. Such a capability allows the person to perform the chemical analysis in the field. The algorithm will also perform equally well with a laboratory-based device.
There are plenty of computer algorithms that perform quantitative chemical analysis with handheld and laboratory based instruments. Typical current jargon is “machine learning” and “artificial intelligence.” These algorithms require big data type training data sets to build prediction models. Where all local modeling algorithms fail is when the chemical analysis is performed on new samples that are slightly or greatly different from the training set. Unfortunately, this predicament is the usual situation. The LAFR algorithm solves the problem stifling local modeling from going forward to real world applications. Unique to the algorithm is the ability to identify from a big data library base, those reference samples best mimicking the new sample requiring quantitative analysis. No other algorithm can accomplish this. With this key sample set identified, it becomes the training set to form the model to predict the particular sample.
As with all local modeling algorithms, the process is repeated for each new sample. With LAFR in hand, it now becomes possible for immediate onsite analysis with a handheld device. For example, forest personal could quickly characterize the health or net worth of trees by for example the pulp content. Another example is a farmer could immediately assess the nutrient amounts in their soil or a rancher could instantly assess the health of their livestock from their feces. Lastly, it may be possible to now finally solve the multi-decades old search for an algorithm that would allow non-invasive glucose monitoring, i.e., without any requirement of blood samples or implanted electrodes. The potential uses of LAFR are abundant.
A large market for LAFR is the everyday consumer where LAFR would make it possible for the consumer to perform immediate chemical analysis on the spot. The algorithm may also be applicable to non-invasive glucose monitoring, thereby expanding the market. Industries requiring onsite chemical analysis with handheld devices or using dedicated inline sensors could use LAFR such as agriculture and pharmaceutical applications.
With a large enough data base and a purchased spectrometer, access to the LAFR algorithm would be regulated by a smartphone app with a charge for each analysis depending on the product design. With a large library, the costly expense of sending samples to a laboratory for analysis would no longer be needed and results would essentially be instantaneous. With LAFR, the time expense for a wet chemical laboratory-based analysis would be eliminated. Currently, there is a large enough NIR soil library that agriculture would be the first target area.
A fast and efficient strategy is proposed – the representative approach – for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values.
Considered another way, presently disclosed methodologies and corresponding systems for a local adaptive fusion regression (LAFR) process are able to search (data mine) a large library of spectral measurement (such as near infrared (NIR), Raman, nuclear magnetic resonance (NMR), or another form sample of collected measurements) for a linear calibration (training) set. The training set is not only spectrally matrix matched to a target sample spectrum, but also tightly brackets the “unknown” prediction property (analyte) for the target sample. Using a matched calibration set, the likelihood of an accurate prediction by the selected calibration set is greatly enhanced. The LAFR process integrates multiple spectral similarity information with contextual considerations between source analyte contents, model, and analyte predictions. LAFR facilitates onsite chemical analysis such as with a handheld spectrometer, dedicated in-line process analyzers and benchtop instruments. LAFR is based on a Beer’s law-like linear relationship where a calibration model (mathematical relationship) is made that linearly relates the analyte amount, e.g., concentration, to the measured spectral responses. The calibration model is then used to predict (quantitate) the analyte amounts present in new samples.
In one exemplary embodiment disclosed herewith, a methodology for searching a large library of field sample spectra (Library X, y) uses a generalized local adaptive fusion regression (LAFR) process for quantitative analysis of molecular-based spectroscopic data (xnew) from a target sample of analytes. Such methodology preferably comprises defining LAFR process parameters, including the number of library samples to retain in a decimation step, the number of calibration clusters to form, and the number of fundamental parameters to use; applying a decimation step to the library to reduce the library to most N spectrally similar to target sample, and to perform an outlier check to remove reduced library components for which the target sample is an outlier; forming linear calibration sets defined by the LAFR process parameters; performing an outlier check to remove linear calibration sets for which the target sample is an outlier; using ISUx and ISUy sample-wise similarities to mine the library of field sample spectra with reference amounts for a local training set explicitly matrix-matched to the target prediction sample; forming a prediction model formed with the local training set; and using the prediction model to predict the quantitative analysis of a target sample. Preferably per such methodology, ISUx and ISUy sample-wise similarities comprise indicators of system uniqueness (ISU) to assess the degree of matrix matching between reference samples and the target sample.
Another exemplary embodiment disclosed herewith relates to methodology for searching a large library of spectral measurement (Library X, y) using a generalized local adaptive fusion regression (LAFR) process for quantitative analysis of molecular based spectroscopic data (xnew) from a target sample of analytes. Such methodology preferably comprises defining process parameters and obtaining all possible hyperparameter combinations (HPPC’s) thereof; for each HPPC: reduce Library to N most spectrally similar to the target sample; form calibration sets (CalSets) by clustering analyte ranged windows; remove all CalSets for which the target is an outlier to produce approved CalSets; use matrix matching to identify a selected CalSet from the approved CalSets which best matches to the target sample; store the selected CalSet for each HPPC; use matrix matching to select best N sets from the stored Selected CalSets; use matrix matching to select best K samples from the N selected CalSets; form a calibration model with the K samples; and apply the LAFR process to a new target sample to predict the analysis thereof.
Yet another exemplary embodiment disclosed herewith relates to methodology for predicting the quantitative analysis of a target sample, preferably comprising searching through a library of spectral samples with corresponding analyte values (Library X, y) using a generalized local adaptive fusion regression (LAFR) process for identifying subsets of samples with similar matrix effects; forming linear training sets defined by the LAFR process identified subsets of samples; forming a final local training set from the linear training sets; forming a prediction model with the final local training set; and using the prediction model to predict the quantitative analysis of a target sample, where the final local training set is composed of samples with the analyte amount highly similar to the unknown analyte amount in the target sample to be predicted.
It is to be understood that the presently disclosed subject matter equally relates to associated and/or corresponding apparatuses
Other example aspects of the present disclosure are directed to systems, apparatus, tangible, non-transitory computer-readable media, user interfaces, memory devices, and electronic devices for analysis of molecular based spectroscopic data. To implement methodology and technology herewith, one or more processors may be provided, programmed to perform the steps and functions as called for by the presently disclosed subject matter, as will be understood by those of ordinary skill in the art.
One presently disclosed exemplary embodiment may relate to a handheld spectral device operating according to any of the methodologies disclosed herewith, for making spectral measurements of a target sample in the field and predicting the quantitative analysis thereof. Per some embodiments there, such handheld device may comprise a smartphone remotely accessing an app for operating according to any of the methodologies disclosed herewith.
Additional objects and advantages of the presently disclosed subject matter are set forth in, or will be apparent to, those of ordinary skill in the art from the detailed description herein. Also, it should be further appreciated that modifications and variations to the specifically illustrated, referred and discussed features, elements, and steps hereof may be practiced in various embodiments, uses, and practices of the presently disclosed subject matter without departing from the spirit and scope of the subject matter. Variations may include, but are not limited to, substitution of equivalent means, features, or steps for those illustrated, referenced, or discussed; and the functional, operational, or positional reversal of various parts, features, steps, or the like.
Still further, it is to be understood that different embodiments, as well as different presently preferred embodiments, of the presently disclosed subject matter may include various combinations or configurations of presently disclosed features, steps, or elements, or their equivalents (including combinations of features, parts, or steps or configurations thereof not expressly shown in the figures or stated in the detailed description of such figures). Additional embodiments of the presently disclosed subject matter, not necessarily expressed in the summarized section, may include and incorporate various combinations of aspects of features, components, or steps referenced in the summarized objects above, and/or other features, components, or steps as otherwise discussed in this application. Those of ordinary skill in the art will better appreciate the features and aspects of such embodiments, and others, upon review of the remainder of the specification, and will appreciate that the presently disclosed subject matter applies equally to corresponding methodologies as associated with practice of any of the present exemplary devices, and vice versa.
These and other features, aspects and advantages of various embodiments will become better understood with reference to the following description and appended claims. The accompanying figures, which are incorporated in and constitute a part of this specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the related principles.
A full and enabling disclosure of the present subject matter, including the best mode thereof to one of ordinary skill in the art, is set forth more particularly in the remainder of the specification, including reference to the accompanying figures in which:
Repeat use of reference characters in the present specification and figures is intended to represent the same or analogous features or elements or steps of the presently disclosed subject matter.
Reference will now be made in detail to various embodiments of the disclosed subject matter, one or more examples of which are set forth below. Each embodiment is provided by way of explanation of the subject matter, not limitation thereof. In fact, it will be apparent to those skilled in the art that various modifications and variations may be made in the present disclosure without departing from the scope or spirit of the subject matter. For instance, features illustrated or described as part of one embodiment, may be used in another embodiment to yield a still further embodiment.
In general, the present disclosure is directed to technology which is a generalized local adaptive fusion regression (LAFR) process for quantitative analysis of molecular based spectroscopic data. Local target-adaptive calibration and prediction allows for onsite analysis by handheld devices. In certain respects, various embodiments of presently disclosed subject matter relate to building simple machine learning models adapted to the deployment domain for quantitative chemical analysis.
Hand-held measurement (spectral) devices make onsite chemical analysis in the field possible. However, the full potential of hand-held devices is still undeveloped due to the absence of a robust real-time training/prediction regression process. Specifically, in order to obtain a quantitative prediction of a property needed for a target sample analyte from the deployment domain, such as the amount of pulp content in a tree for potential harvesting, the training set must match (span) the particular target sample in terms of all sample specific hidden effects (e.g., variances due to physicochemical and measurement effects). In other words, in order to obtain a quantitative prediction of a target sample analyte, the calibration set must properly span the target sample in terms of all hidden matrix effects, e.g., physicochemical properties, relative to both spectral measurements (X) and amounts of all spectral responding factors (Y).
Matching the training sample set to the new deployment domain variance is a common problem to all machine learning disciplines. Hidden effects typically vary from sample to sample making it difficult to match one training set to all possible onsite target samples thereby confounding onsite analysis by hand-held devices. One machine learning framework to rectify the situation is local modeling. Local modeling is a framework that produces calibration sets matched to target samples. However, local modeling requires mining a database composed of thousands of training samples with varying hidden effects for a unique training set specifically matched to each new target sample. In other words, local modeling requires mining a library composed of thousands of analyte reference spectra with a vast array of implicit matrix effects with the intent of identifying a unique calibration set specifically matrix matched to the particular target sample being predicted.
Rather than forming one local model, the presently disclosed approach, termed local adaptive fusion regression (LAFR), forms hundreds of linear local models from a library with each model representing a distinct combination of hidden matrix effects. The LAFR approach considers local modeling as a classification situation where a target sample is classified into the local model with the most similar matrix effects. In other words, target samples are classified to the best matched linear training set.
Developed for presently disclosed LAFR is a measure termed the indicator of system uniqueness (ISU) that is a hybrid fusion algorithm based on over a hundred similarity measures using a novel cross-modeling procedure, with X and Y matches, and each sample prediction amount receives a membership value relative to the predicting calibration set. Results are presented for multiple near infrared (NIR) spectral datasets including a difficult soil library with nearly 100,000 reference samples. All datasets demonstrate the suitability of LAFR for handheld spectral devices.
An objective of the presently disclosed Local Adaptive Fusion Regression (LAFR) technology is to develop a local modeling method to form calibration sets from large or non-linear datasets. Such calibration sets (CalSets) should be matched to each target sample by both spectra and analyte amount. Data fusion is utilized to make robust each step of the process.
The following frames various concepts, beginning with Local Modeling. The theory behind the approach is to identify a calibration set from a greater library to increase prediction accuracy of a target sample. Common use cases include such as (1) Large datasets: local modeling mines through datasets of 1000+ samples to find the closest matched samples, and (2) Non-linear datasets: most non-linear data is sufficiently linear within a local range of a target point.
Examples of existing methods include:
There are shortcomings of standard methods. For example, most only use spectral similarity measures and cannot match the target’s analyte concentration precisely.
Theory of the presently disclosed Data Fusion approach is that Data fusion combines multiple similarity metrics to obtain a holistic understanding of the similarity between samples. Local modeling is concerned with sample selection. Samples differ by spectral magnitude, shape, and analyte concentration. Samples selected need to be similar in all 3 categories. Many similarity measures are calculated, and fused for robusticity.
Capstones of presently disclosed approach include that it considers all possible user-defined variables. Local modeling depends on input hyperparameters (e.g. similarity merits, number of samples). Our presently disclosed approach considers all possible hyperparameter combinations (HPPCs), and selects the best overall calibration set (CalSet).
Other present capstones include utilizing Data Fusion. Fusion across many similarity merits robustifies the sample selection process. Also, for some embodiments of presently disclosed subject matter, match is based on analyte amount. Our presently disclosed matrix matching protocol harnesses the local regression vector to match target analyte magnitude.
Additional information which may be involved in presently disclosed embodiments relate to Hyperparameter Combinations (HPPCs). Nineteen user defined input hyperparameters define the LAFR protocol (CalSet size, quality control thresholds, etc.). LAFR determines all possible combinations of these 19, and determines the strongest CalSet defined by each. Further information can relate to Quality Controlling CalSets. CalSets should contain the predicted target analyte amount within their own distribution ( encapsulation), and CalSets should predict their own calibration samples well (e.g. strong R2, RMSEC, etc.). Still further information may relate to Matrix Matching. Twenty-nine similarity measures assess the degree of similarity between the target sample and each calibration set. “Cross-Modeling” is used to prevent chance matching.
The following relates to applicable performance metrics for consideration. Each target sample is predicted using the full LAFR protocol, and four metrics can be used to assess the overall prediction accuracy:
The following provides one dataset description example, involving Meat. Specifically, 170 samples of ground pork meat were used as the library to local model for 43 target samples. Spectra are measured in the near-infrared range with 100 wavelengths, with analytes protein, moisture, and fat.
Results for the Meat Fat, may be considered both by Global Model and Local Model data, as shown by the following Table 1:
Results for the Meat Protein, may be considered both by Global Model and Local Model data, as shown by the following Table 2:
A similar dataset example involved ground pork meat. Specifically, 156 samples of ground pork meat were used as the library to local model for 37 target samples. Spectra from 850-1050 nm were measured in the near-infrared range with 100 wavelengths, again with analytes protein, moisture, and fat.
Further efforts discussed herein relate to analysis and optimization of LAFR to new datasets, particular interest in large datasets, and novel CalSet formation protocol. Part of the presently disclosed subject matter relates to the concept of what does it mean for samples to be similar. Subject matter relates to comprehensively characterizing similarity for spectral data. As a form of introduction, one may consider for example chemical data structure. In such context, objects may be regarded as being similar if many of their important properties are similar. From the perspective of spectroscopic samples, one may consider for example:
Matrix effects in this context relates to the confounding relationship between spectrum and analyte. Generally, sample conditions may relate to non-analyte chemical composition and all responding species. Measurement conditions may involve such as instrument novelties or baseline shift.
Characterizing similarity generally may involve similarity assessment. The above three properties (analyte values, spectra, and matrix effects) are mutually dependent. If two are similar then all three are similar. Thus, it is straightforward to assess spectral similarity, such as Euclidean distance, Mahalanobis distance, Q-residual or other known approaches. It is more difficult to determine analyte or matrix effect similarity.
Similarity criterion characterizes the similarity between two samples to give a numerical indicator of similarity. Many applications need to quantify the similarity between samples, or between a sample and a calibration space.
Similarity applications may involve, for example, applicability domain/outlier detection. If target sample is similar to calibration samples then the model is likely applicable and accurate. Per calibration transfer, it can be possible to transfer a calibration from one set of matrix effects to encompass a different set. The degree of matrix effect dissimilarity determines which mechanism to use. Regarding matrix matching, one should preferably choose the best calibration set for each target sample. Per classification, one should preferably classify a target sample into the most similar class.
Generally speaking, the present objectives involve:
The following relates to various aspects of theory concerning presently disclosed subject matter. In particular, for assessing analyte similarity, a sample i with chemical composition y, matrix unaffected pure component spectra K, and matrix effects m is represented by:
Therefore for two samples i and j:
The y factor is analyte matching while the m is matrix effect matching. Although confounded, there is still a rough assessment of analyte and matrix effect similarity. Generally, there can be preferably four requisite decisions to develop a similarity criterion:
Relating to the presently disclosed Indicator of System Uniqueness (ISU), an example of the four decisions to be made include as follows:
The following is an example dataset involving corn.
Results in the applicability domain show that dataset combinations with a low ISUx and ISUy are predicted poorly. Per the following referenced illustrations, all samples from a source-target combination are of the same color/designation.
Regarding results in the context of matrix matching, CalSet-sample pairs with high ISU are well matrix matched to each other.
Regarding results in the context of calibration transfer,
Conclusions of the foregoing may be considered in the context of accomplishments. These may include sample similarity characterized comprehensively using many spectral and analyte similarity merits, an optimization scheme which is simple and user-invariant, and readily and easily applied to four multivariate spectral applications. Considered in the context of takeaways, it is observed for samples to be similar, they must be similar in all regards, thus similarity criteria must tackle this, and in applications regarding similarity, the ISU is a robust, user-friendly, and powerful tool that should be used. In particular, as disclosed herein, use of the ISU is pertinent for a novel local modeling scheme.
The following more particularly relates to aspects of automatic unlabeled target-adaptive spectral models with target prediction, including (1) Local Adaptive Fusion Regression (LAFR) and (2) Null Augmented Regression (NAR). The following equations may be pertinent to presently disclosed spectral multivariate calibration/prediction:
where
Biased regression solutions such as PLS, PCR, Tikhonov regularization (TR) including RR, and others; require meta-parameter (tuning parameter) selection
In consideration of possible calibration problems, it is recognized herewith that new target samples are often outliers to calibration data. Calibration samples must span expected target variances and correlations (matrix effects). This includes consideration of physicochemical properties (joint action of both physical and chemical processes). Further to be considered are secondary analyte correlations, environment and instrument measurement conditions, along with the nature of the specific agriculture products (for example, food brands, species, geographical region, growing seasons), medical diagnostics (inclusive of subject physiochemical dependent properties), and possible other hidden variables.
Relative to possible solutions, one (or a first solution) could be regarded as local modeling. As represented by
A second solution could be referred to as a bucket of models (or a form of ensemble learning approach). First, models may be from calibration sets with similar implicit matrix effects, as graphically represented by
A third solution may be referenced as model updating. One considers in particular combinations of (Xsource, ysource) matrix effect differences relative to (Xtarget, ytarget) matrix effects. From the perspective of domain adaptation, only Xsource and xtarget have shifted. Per a three-step approach, first retain all or part of historical primary source data, secondly modify model to predict target samples from new target conditions, and thirdly require model selection for multiple tuning parameters.
Per presently disclosed Local Adaptive Fusion Regression (LAFR), one may combine solutions 1 and 2:
One problem is the need to target y information. “Localization should be carried out with respect to both spectrum and the analyte property” in accordance with Anderson RG, Osborne BG, Wesley IJ. J. Near Infrared Spectrosc. 11 (2003) 39-48, and Williams P. NIR News 30 (2019) 9-11.
A macro overview of LAFR may be regarded as follows. Local modeling is treated as a classification problem. A target sample is classified by hidden matrix effects relative to each calibration set. In a first instance, the spectral library is broken into a collection of linear calibration sets, with sets that maintain similar implicit sample matrix effects identified by indicators of system uniqueness (ISUX and ISUy). Then, calibration sets are evaluated relative to target sample by ISUx and ISUy, with the best ISU matched calibration model identified for final prediction.
Decomposing a Sample Spectrum x, one may assume a linear Beer-Lambert law type relationship, which has a common representation:
A calibration set (henceforth ignoring e, i.e. random noise) is represented by the mathematical representations of
For consideration of further analysis of a sample spectrum:
Where ε = quantum mechanical matrix effect free spectrum (isolated molecule).
Further, for sample dependent terms (diagonal matrices):
Then, for Eq. 8 and Eq. 9 above, the following holds:
Where m = Pma + ∑Pmi and mtotal = catch all of any matrix effect altering a spectrum; instrument, measurement, chemical and physical interactions, etc.
When considering a calibration set:
where Mtotal spans the calibration set of matrix effects.
For a specific calibration sample:
For a new target sample:
The matrix is matched when:
If mtotal,cal ≈ mtotal,new, ya,cal ≈ ya,new, yi,cal ≈ yi,new, then the two samples are matrix matched and ŷa,cal ≈ ŷa,new
If ŷa,cal ≈ ŷa,new for the calibration set, then:
As to potential problems with ŷa,cal ≈ ŷa,new, predictions for one calibration sample can be similar but not matrix matched samples. One can obtain chance prediction equivalency, as follows:
Linear combinations of different concentration and matrix effect values can produce ŷa,cal ≈ ŷa,new . Confounding matrix effects which can occur include many combinations give similar x, and similar x’s does not mean similar matrix effects. Chance prediction equivalency reduces with multiple calibration samples. Not all samples in a calibration set can have chance prediction equivalency. Presently disclosed LAFR uses a novel cross-modeling to further reduce chance prediction equivalency.
The following concerns presently disclosed Indicators of System Uniqueness (ISUx). ISUx in present context is a holistic characterization of implicit X sample-wise differences between target and calibration sample matrix effects. Values do depend on sample y values (x = Ey + m). Comparing 15 similarity measures may be represented:
Some are based on SVD singular vectors (PCs), with Windows of PCs used to eliminate optimization, with for example Mahalanobis distance: all distances used from 1 PC up through the 99% rule, or with 8 similarity measures using PC windows.
The following relates to applications of X matching with Windows (ISUX). For example, outlier detection without tuning parameter selection is represented by the graphical illustration of
The following relates to Indicators of System Uniqueness (ISUy). ISUy in present context is holistic characterization of implicit y sample-wise differences between target and calibration sample matrix effects, where values do depend on sample X values (x = Ey + m). Measures involve the interaction of
with X and xnew, where
has magnitude and direction and
carries sensitivity, selectivity, net analyte single (NAS) information. 17 similarity measures compare variations of prediction value:
Models are automatically selected using U-curves.
The following relates to cross-modeling for robust matrix matching, including description of Self- and Cross-model calibration sets with the new target sample. For self-model differences: 1. For a calibration set, remove a calibration sample and form a model with remaining samples; 2. Compute ISUx and ISUy for the target and removed calibration samples relative to the remaining calibration samples and model; 3. Replace the calibration sample and repeat 1-3; 4. Compute the sum of mean ISU differences between the target and calibration samples as follows:
; and 5. Repeat 1-4 for each calibration set.
For cross-model differences: 6. Use a calibration sample set and its model as the primary source to compute ISUx and ISUy for each other calibration sample set and new sample as the target samples; 7. Compute the sum of mean ISU differences between respective target calibration set samples and the new target sample; and 8. Repeat 6 and 7 with each calibration set acting as the primary source calibration set.
The following relates to best matrix matched calibration set as having minimum self- and cross-modeling ISU differences. For Self- and Cross-Model ISU differences, reference is made to the mean of ISU differences across 6 calibration sets, as graphically represented by
The overall parameter sets may include the number of PCs and the Partial Least Squares (PLS) model selection. The Decimation facets involve spectral similarity measures and the number of samples post-decimation. Outlier cleaning relates to outlier detection measures, the number of outliers to remove each iteration, whether to remove when checking unlabeled xtarget, and self-prediction thresholds.
The subject set formation involves the number of calibration sets, the first cluster sample and initial distance measure, and ISUX/ISUy/E weighting on set formation. Present matrix matching involves similarity measures and use of the fusion rule. Set and sample selection involves N CSWiM sets and K SWiM samples.
A typical or exemplary parameter set for soil may include:
One exemplary summary of unique components of presently disclosed LAFR may be listed as follows:
A further presently disclosed approach may be referred to as Solution 3: Model Updating. For such approach, in conjunction with domain adaptation: only Xsource and Xtarget have shifted. This results in retaining all or part of historical primary source data, and modifying model to predict target samples from new target conditions, as represented in part by Table 3 below.
indicates text missing or illegible when filed
The presently disclosed process requires model selection for multiple tuning parameters, as represented for example by the block diagram illustrations of
For model selection by MDPS (Markov decision processes), model diversity is shown in part by cosine of the angle between two models, as represented by Eq. 19:
Models in the range 0.3 ≤ cos(θ) ≤ 0.5 are retained. Regarding prediction similarity, there are unlabeled prediction differences, and use of an overfitting safeguard (mean model 2-norm), per Eq. 20:
An underfitting safeguard is mean source RMSEC as a calibration source. For final PS, for superscript RS, there are range scaled values, and ω weights the U-curve (0.4 is used), per Eq. 21:
In the exemplary case of soy seed samples, the following may be applicable:
The overall goal is to form and select NARE models without labels that perform equivalently to LMC models with labels.
In conjunction with SPS (Secondary predicting secondary),
The following relates to exemplary soy seed MDPS model selection histograms. MDPS selects models correlated to lower RMSEV and greater R2 values for all methods up to 3 tuning parameters. MDPS can be used to select primary source models for PLS and RR. The following summarizes Components of NAR with MDPS (as represented by
In an ongoing context, one may mine historical data for a “target like” primary source sample set (Combine LAFR with NAR for LAF-NAR, and RS-LOCAL: Lobsey CR, et al. Eur. J. Soil Sci. 68 (2017) 840-852). Using automatic analysis with presently disclosed LAFR and NAR, field analysis is more possible in a number of areas, for example, such as cosmetics, clothes, flora, soil, jewels, oils, plastics, pharmaceutical, medical diagnostics, and others, as variously represented respectively by
Per the present disclosure, local modeling may be achieved by classification of matrix effects. The following relates to inaccuracy in multivariate calibration. Multivariate calibration succeeds when the target and source library have equivalent sample and measurement conditions, with calibration represented by:
and prediction represented by
It is generally understood that matrix effect differences between source and target produce inaccuracy. The following notations are applicable as listed in Table 5:
Examples of changes in sample or measurement conditions can include:
Various sources of inaccuracy may exist such as the target sample is not spanned by the matrix effects of the calibration samples, or the prediction is inaccurate.
The following relates to potential library expansion. It is viable to expand the library. For example, batch processing, pharmaceuticals, soil scientists all have libraries spanning vast matrix effects. Regarding linear regression fails, this particular model is compromised by too many matrix effects. Prediction is therefore untrustworthy.
The following relates to generic local modeling. Per the local modeling process, one would find samples which are highly spectrally (x) similar to the target sample, and predict the target sample using only those similar samples. Method examples may include (1) CARNAC: Davies A et al. Mikrochim. Acta 96 (1988) 61-64; and (2) LOCAL: Shenk JS, Berzaghi P, Westerhaus MO, J.Near Infrared Spectosc. 5 (1997) 223-232. Drawbacks may include that samples are only similar in spectra, not necessarily analyte nor matrix effects.
The local modeling process involves (1) using the global model prediction to grossly characterize the target analyte, (2) finding local samples which are spectrally (x) and analyte (y) similar to the target, and (3) predicting the target sample using only those similar samples. One method example may include LWR: Naes T, Isaksson T, Kowalski B, Anal. Chem. 62 (1990) 664-673. Drawbacks may include that it (1) uses an often poor global model to characterize analyte similarity and (2) final calibration samples are not similar to one another, only similar to target sample. Additional drawbacks may be that (3) it requires user-dependent input parameters to obtain best results, and (4) the best similarity criterion for multivariate data is not known.
The disclosure herewith includes a new paradigm for local modeling. The ideal local model should have (1) dense analyte distribution tightly spanning target true analyte (similar y) and (2) matrix effects consistent and equivalent to target matrix effects (matrix effects similar). To consider a classification approach: (1) suppose all matrix effects each sample in a dataset are known, (2) construct classes out of different matrix effects, (3) classify the target sample into the best matched matrix effect, and (4) predict the target sample using this class of consistent matrix effects. However, the matrix effects are not labeled. The presently disclosed Local Adaptive Fusion Regression (LAFR) process solves this by:
When considering how to cluster by matrix effects, one may use Beer’s Law: a calibration set with consistent matrix effects is linear between X and y. Thus, one may use linear clustering to group samples by matrix effects. There may be some initial issues with this approach. For example, there are as many predictions as there are clusters, so one must decide which prediction to trust. Furthermore, one needs a similarity criterion to assess which matrix effect cluster the target sample is matched to.
The following relates to presently disclosed assessment of sample similarity. There generally is necessity for a similarity criterion. For example, LAFR needs to assess similarity for clustering and classification, and real spectral data is multivariate making similarity assessment complex.
The following relates to the presently disclosed Indicator of System Uniqueness (ISU). As disclosed herein, ISU:
Advantages of the ISU Criterion may be regarded as follows. In the context of spectral similarity measures, outlier detection may be achieved without tuning parameter selection, as graphically represented by
Figure Library info from: Borggard C, Thodberg HH. Anal. Chem. 64 (1992) 545-551,
Regarding the protein content of the target samples, there were 10 minutes per sample, 12 parameter sets, and the LAFR calibration sets are yprotein matched to yprotein in each xnew.
The following relates to a soil data based example, figure from the Rapid Carbon Assessment Project (RaCA), USDA. Spectra was from 350 - 2500 nm (and involved 308 wavelengths), relating to SOC (soil organic carbon), using 98,836 library samples (global calibration) and 50 random target samples. Figure library source from Wijewardane Yufeng Ge, NK, Wills S, Loecke T. Soil Sci. Soc. Am. J. 80 (2016) 973-982,
US Calibrations samples tend to be from US regions, as respectively represented by
The following relates to a corn data based example, figure as a library source from: Wise B M, Gallagher NB. Eigenvector Research, Manson, WA. http://www.eigenvector.com/data/index.htm The corn data involved target mp5 instrument out of library, with spectra from 1100 - 2500 nm (700 wavelengths). Data related to moisture, oil, protein, and starch. Data involved 160 library samples (global calibration), involving instruments m5 and mp6. The 30 random target samples involved instrument mp5. The presently-disclosed LAFR process should: (1) form mp6 and m5 clusters (calibrations sets) and (2) select mp6 calibration sets and samples as best matched.
Another set of corn data was considered, with all Instruments in Library. Data spectra was from 1100 - 2500 nm (700 wavelengths), relating to moisture, oil, protein, and starch. The 240 library samples used for global calibration involved instruments m5, mp5, and mp6. The 30 random target samples drew 10 from each of such three instrument. The presently-disclosed LAFR process should: (1) form mp6/mp5 and m5 clusters (calibrations sets) and (2) select calibration samples from respective instrument of origin.
Another presently disclosed example is based on consideration of cattle feces data. Library data source was drawn from Coates DB, Dixon RM. J. Near Infrared Spectrosc. 19 (2011) 507-519. Data involved North Australia 10 year collection with 3 sampling methods: (1) Penned cattle fed freshly harvested pasture, (2) Penned cattle fed forage hays, and (3) Grazed pasture. Spectra data was from 700 - 2492 nm (225 wavelengths), and focus of this exercise was on crude protein. A total of 1172 library samples were used for global calibration. There were 30 random target samples.
As discussed herein, the presently-disclosed Local Adaptive Fusion Regression (LAFR) Process involves steps of Decimate, Cluster, Classify, and Predict, and as represented by the flowchart of
In accordance with presently-disclosed subject matter, there can be many clustering options. For example, there are multiple parameters to form matrix-effect-cognizant clusters. It is also to be understood that the best parameter combination may be rarely known to the user. For example,
As shown herein, the presently-disclosed Local Adaptive Fusion Regression (LAFR) paradigm or process solves issues faced by other local methods. For example, selected calibration set analyte distributions are dense. Analyte similarity is characterized without using a global model. The presently-disclosed indicator of system uniqueness (ISU) provides a holistic characterization of sample similarity. LAFR self-optimizes to find the best clustering parameters. Further, LAFR is shown to be robust to many unique datasets. Additionally, the presently-disclosed clustering-classification approach produces clusters of actual matrix effects. Per future variations, for example, the process of Cluster-Classify-Regress may become Cluster-Classify-Update to handle more complex data situations, such as combination of LAFR with NAR for LAF-NAR.
The following outlines an overview and/or summary of one particular embodiment of the presently disclosed Local Adaptive Fusion Regression (LAFR) process or algorithm, which is extraordinarily complicated in comparison to other contemporary chemometrics modeling methods. This section of disclosure attempts to completely describe the implementation of the LAFR process to a degree that results could be replicated without referring to the original source code. We lay out this section in the following manner.
One important piece of terminology is the concept of a hyperparameter. Hyperparameters control the flow of the LAFR algorithm; they specify how each of the protocol should be carried out, from the similarity merits to use, to the data fusion method, to the number of samples or sets each process should complete with. The LAFR process or algorithm can have a number of hyperparameters (for example, 31) that the user has control over. Despite being possible to alter, however, most of these parameters are kept fixed throughout all the analysis and are generally not expected to be changed by anyone except the most ambitious user.
We define the hyperparameter combination as the unique set of all 31 hyperparameter values which describe one particular runtime protocol for the LAFR algorithm. For example, if one wishes to alter the number of samples per cluster from 20 to 30, there would be two hyperparameter combinations, one associated with the hyperparameter configuration with 20 samples per cluster, and one which is unchanged except for having 30 samples per cluster. LAFR is founded on the principle of iterating over many (generally 10-20) hyperparameter combinations to locate the most optimal configuration for each target sample. The LAFR process or algorithm is set up to accept the variable hyperparameters a user would like to include in their search space, then sets all the other hyperparameters to their default values. A detail which is important to the computational speed but not to the results is that the hyperparameter combinations are organized such that two hyperparameter combinations which differ only in the runtime of the later stages of the algorithm will be grouped together, so that the early stages of the algorithm do not have to be recomputed. Recomputing all these statistics will simply increase the computational time and has no affect on the algorithm outcome.
This section of disclosure also judiciously refers to sample similarity analysis and similarity merits. The foundation for this work is in the Physicochemically Responsive Integrated Similarity Measure (PRISM). It groups the assessment of sample similarity into two categories: spectral similarity and model-informed or analyte-related similarity. All similarity merits in this work are grouped into one of those two categories. In the most general sense, these similarity merits measure some form of similarity between a sample and a sample subspace (note, though, that in some cases only a single sample describes this subspace). If one uses this similarity merit to compare one sample to a subspace, and then another sample to that same subspace, they can characterize the difference in these measured values, which can be interpreted as the difference in how two samples respond to a particular similarity analysis. In the PRISM sense, we term this type of similarity characterization as Δ-similarity. Both standard similarity and Δ-similarity are used many times in the LAFR algorithm. Similarity merits can be further sub-grouped as to whether they require a subspace decomposition to solve. This includes such merits as Mahalanobis distance and Q-residual, that both use principal component decomposition to measure. These merits are naturally more computationally expensive than their simple vector-to-vector counterparts.
Implementation Details: The LAFR process or algorithm can in some embodiments be coarsely grouped into seven stages. These are: truncation, set formation (or clustering), quality checking, set selection 1, set selection 2, sample selection (or cherry-picking), and prediction. The first four of these occur for each hyperparameter combination; that is, for every hyperparameter combination there are many calibration sets formed, then the best matrix matched set is selected. Once a best set is identified from each hyperparameter combination, the set selection 2 chooses a subset of those to be passed forward into the sample selection, where only the best samples from the input sets are then assembled into a calibration set. This calibration set then predicts the target sample. This entire program flow is depicted in
Though the overarching idea for the LAFR process or algorithm is fairly straightforward, the details of its implementation require quite a bit more explanation. For example, there is an intermediary outlier detection step between the truncation and clustering, the clustering requires quite a bit more theory, and the quality checking process has many sub-steps. The following sections are dedicated to describing these intricacies in detail; however, this flowchart in
Truncation: The idea behind truncation is to reduce the size of the library by a sufficient amount so that clustering is computationally feasible. It aims to eliminate the samples from the library which have negligibly minimal likelihood to be matched to the target sample, and so they are simply cumbersome to the analysis. This is, however, only a concern in medium and large datasets. In spectral libraries with less than 500 samples, it is likely that clustering is computationally viable on the whole library and could be performed as such. Library reduction can be performed in one or both of two ways: categorical truncation and similarity-based truncation.
Categorical truncation uses known information about the target sample to restrict the local modeling search space only to library samples which are matched in that known information. Take for example the case of known geographical region of origin of a set of samples. A new target sample likely should be predicted best by the samples within its same geographical region; this is a category which one could truncate by. However, categorical truncation can only occur if the same categories are measured both on the target sample and on the library samples, and if this known categorical information is strongly correlated with true sample similarity-not restricting the analysis only to samples which are not necessarily any more similar.
Categorical truncation in LAFR is simple: identify library samples which are matched in all the same categories as your target sample. It is possible that there are multiple different categories one could measure; the truncated set should be matched by all of these. Hyperparameters associated with categorical truncation are “TruncCategorical” which determines whether categorical clustering should take place, and “TruncCategories” which specifies which categories to truncate according to. Once this categorical truncation is carried out, then the user can either move forward into the next stage or truncate even further by a similarity-based analysis.
Whether categorical truncation was carried out or not, the next step which can occur is similarity-based truncation. The basic idea is to use computationally inexpensive similarity merits to identify which library samples are similar to the target sample. Of the similarity merits used in this work, the least computationally expensive ones are the spectral similarity merits which do not use principal components.
The exact implementation of this similarity-based truncation in this exemplary embodiment is as follows. Calculate the vector-to-vector similarity directly between the target spectrum and each of the spectra in the sample library. Normalize the similarity merit measurements such that the vector containing the similarity values for all the library samples for a particular merit has unit magnitude (i.e., divide each value in the vector by the magnitude of the whole vector). Sum over all the similarity merits for each sample so that every sample now has one score containing aggregate information from all the similarity measurements. Finally, choose the samples with the lowest fusion score (i.e., the samples which are most similar to the target sample) to continue on to the next stage. The number of samples selected is a hyperparameter which is also described in more length in the appendix.
Library Outlier Cleaning and Detection: Another one of the guiding principles in LAFR in this exemplary embodiment is that, at certain times in the algorithm after particularly tumultuous operations, the data should be swept for outliers. The idea is that having outliers constantly within the data can lead to propagated error in the similarity analyses in particular. By analyzing the data for outliers, cleanliness of data throughout the algorithm can be assured. The outlier analysis in this section is two-fold: first, clean the library (i.e., remove any outliers from the truncated library), then detect whether the target sample is an outlier to this new truncated library. It is quite possible (and frequently observed) that the target sample conditions are not spanned by the conditions of the library samples and is thus an outlier to the entire library. This section of the disclosure aims to detect this situation.
Cleaning the library and detecting whether the target sample is an outlier is fundamentally the same problem as each other. One must simply determine whether a sample of interest, whether it be a library sample or the target sample, appears to be an outlier with respect to the rest of the library samples. Firstly, the sample of interest is separated from the rest of the library. All the spectral similarity merits, including the ones which require a principal component decomposition, are used to assess the similarity between the sample of interest and the remaining library samples as a whole. For the vector-to-vector similarity merits, this means calculating the similarity between the spectrum of interest and the centroid (mean spectrum) of the remaining library samples. For the subspace similarity merits, such as Mahalanobis distance, this similarity between a sample set and spectrum of interest is already well defined and is performed here. The resulting similarity data from this type of measurement is three dimensional: number of similarity merits by number of library samples (plus one for the target sample) by the number of principal components. To fuse this data, first average over all the principal components, then normalize each set of measurements at one similarity merit to unit magnitude as was done in the truncation stage. Finally, sum over the normalized similarity measurements for each sample. Resulting, then, is a collection of numbers describing how each library sample and target sample is similar to the remaining library samples when it is removed from the set and compared.
A simple way to use this fusion similarity data in detecting outliers is as a z-score. The measurements of the library samples with respect to the remaining library is used to find a standard deviation and a mean. Samples with a low fusion score are highly similar to the remaining space, and samples with a high fusion score are very dissimilar to the space (outliers). A hyperparameter controls the number of standard deviations from the mean before a sample is considered to be an outlier. This value is by default -3, meaning that outliers have a fusion score that is more than three standard deviations greater than the mean of the library samples in general had. It should be noted that this z-scaling is based on the standard deviation and mean of the library samples and did not include the target sample. This is since skew due to the target sample being an outlier could have an undesirable effect, and library samples are already considered to be reasonably outlier-free compared to the target sample effect.
If the target sample is identified to be an outlier (i.e., more than the specified number of standard deviations from the mean), then the sample is deemed “non-predictable” and LAFR will decline to provide a prediction for this particular target sample. If a user strongly desires to achieve a prediction, likely resulting in much inaccuracy, then they can alter the outlier detection z-threshold to facilitate a prediction.
Since some of the similarity merits use a principal components decomposition of the library spectra, it is necessary to discuss how the number of principal components is selected. Generally, the selection is carried out using a 99% variance rule; the library spectra are mean-centered, and a singular value decomposition (SVD) is taken. The maximum number of principal components is determined by the cumulative sum of the singular values to achieve 99% of the total sum of the singular values. Similarity analysis is calculated using the window of principal components (from only one principal component all the way to the maximum number), then averaged over as described above. This will be the case for all the applications in this algorithm which use principal components analysis.
Set Formation: The set formation (or clustering) in this exemplary embodiment of the LAFR process or algorithm is described as grouping the samples by matrix effects, or homogenizing the calibration sets with respect to their matrix effects. We know that sample clusters which are sufficiently homogeneous in their matrix effects will have a linear (i.e., Beer-Lambert type law) relationship between their spectrum and analyte amount. This can be reverse-engineered to realize that linear calibration sets are more likely to be homogeneous in their matrix effects. The task of creating matrix-effect-grouped calibration sets is then mapped onto the problem of creating linear clusters, which is an active field of research across different implementations of machine learning.
We employ a modified cluster-wise linear regression approach that simultaneously optimizes calibration sets to be spectrally similar, analyte similar, and maintain linearity. Linearity is ensured by using prediction error of the sample. If it is not self-evident why this is the case, then this research is all laid out in other linear clustering literature [referenced in the introduction]. The overarching idea is that clusters are initialized, then iterated, reassigning all samples on every iteration to the cluster which they are most similar to by a weighted combination of spectral similarity, analyte similarity, and ability for that cluster to predict the sample accurately.
The first step is to initialize the clusters. Alterations in cluster initialization processes substantially impact the outcome of the converged clusters. The simplest and most widely employed cluster initialization tactic is to randomly assign samples to all the clusters. This option is possible in our algorithm but is not the default as specified by the controlling hyperparameter. Instead, we opt for a Kennard-Stone approach in which the cluster centroids are the center sample of the library, and then the remaining cluster centers are the outermost samples. These centroids should have maximal intra-cluster variance possible within the dataset. The Kennard-Stone algorithm typically uses only Euclidean distance, however we modify the algorithm to weight with equal importance the Euclidean distance between samples as well as the actual analyte difference. This Kennard-Stone process provides centroids to use, however the next part of the clustering process requires that these be entire clusters, not simply centroids. To accomplish this, the remaining samples are classified onto the cluster they are most similar with according to a spectral and analyte similarity measure. For the spectral analysis, all the vector-to-vector spectral similarity merits (the same as truncation) are used. The analyte similarity analysis is simply the difference in true analyte between the unassigned sample and the centroid. Each unassigned sample is compared to the centroids using the spectral similarity merits and the analyte difference. All the measurements at a particular merit are normalized to unit length, as always. Then the spectral similarity measurements are averaged over for the samples, then this amount is averaged with the analyte similarity measurement to produce an overall similarity score for each unassigned sample to each cluster. Each unassigned sample is then assigned to the cluster which it has the lowest similarity score to.
Now that the initialized clusters exist, either by randomly assigning samples or by a Kennard-Stone plus simple similarity analysis, the k-means based linearization can occur. All the samples are up for reassignment, meaning that they will go from their current cluster potentially to a new cluster. The similarity merits compare the samples up for reassignment to the overall subspaces defined by the prior sample clusters (i.e., the clusters from the last iteration). For the spectral vector-to-vector merits, this means comparing the spectrum up for reassignment to the mean spectrum of the cluster. For the analyte merits, it compares the samples up for reassignment to the mean analyte of the cluster. Many similarity merits are involved in this portion of assessing which cluster the samples are most similar to. The spectral scores for each of the samples compared to each of the clusters, as well as the corresponding analyte similarity scores, are fused the same way as in the other cases (average over principal components or latent variables, normalize merits to unit length, then average over the similarity merits). The last measurement is the linearization condition, which is the prediction error of the cluster model when used on the sample up for reassignment. This is straightforward to calculate, and is also normalized to unit length for all the samples, just as was the case for all the other similarity merits. At this point, there are measurements for spectral similarity, analyte similarity, and prediction error of each of the library samples with respect to each of the clusters from the prior iteration. The spectral similarity, analyte similarity, and prediction error scores are fused together into one composite score using a user defined (hyperparameter) weighting scheme. For weights α1, α2, and α3, this weighted fusion obeys Eq. 24.
Each sample is then assigned to whichever cluster it has the smallest composite similarity score to. This readjusts the clusters, and then the process can be iterated once again to attempt to converge towards a solution in which all samples are in clusters that they are highly similar to. Convergence, which is defined in this context to mean that no samples change clusters when they are up for reassignment for an iteration, is the premier goal. However, it is typically computationally infeasible to always iterate towards complete convergence since the many similarity calculations required for each step are quite computationally expensive. Instead, a hyperparameter controls the maximum number of iterations (default 30) before those clusters are exported to the next stage. Even without complete convergence, the clusters can still satisfy the linearity constraints to a very high degree.
A debilitating problem with this process is that it almost always encourages the samples to be clustered into calibration sets of uneven numbers of samples. This causes some calibration sets to be so small that they cannot even develop a regression vector and thus cannot be used in the iteration process. Another hyperparameter controls the minimum number of samples required in each cluster for every iteration. If a cluster falls below this minimum number of samples, then the algorithm will split the largest cluster in half (into high analyte and low analyte) and append the high analyte set to the cluster which had insufficient size. This process applies recursively until all the clusters are of sufficient size. Then the next iteration stage can occur as normal.
Set Outlier and Quality Controls: This step exists in this exemplary embodiment to ensure that the formed calibration sets are outlier free, are relatively linear between their spectrum and analyte, and the target sample is fairly similar to them. There are five substages which ensure these conditions are met: Grubb’s test, labeled analyte clean, spectrum and labeled analyte clean, linearity quality control, and unlabeled PRISM check.
The Grubb’s test simply checks whether any of the samples within the calibration set have analyte amounts which are far from the rest of the samples. The confidence interval associated with this Grubb’s test is a controllable hyperparameter, but its value defaults to a 95% confidence interval. If any samples are detected as outliers, they are removed.
The labeled analyte clean uses many similarity merits to assess whether any of the samples in the calibration set are dissimilar enough to be outliers. This analysis uses the PRISM-style of similarity checking which is Δ-similarity. A sample is taken out of the calibration set, call this the sample out. This sample is compared to the remaining calibration subset using all the similarity merits. Also, though, a sample from within the remaining calibration subset is also compared using these similarity merits to the remaining calibration subset-call this the sample left in. Now we have similarity measurements for the sample out to the remaining subspace and the sample left in to the remaining subspace, take the absolute value of the difference of these two measurements and we are left with the Δ-similarity describing the similarity between the sample out and the sample left in. Iterate over all the possible samples left in to get the full data picture. Then, repeat this process for all possible samples left out. What results is four dimensions of similarity data. All samples out against all samples in against all similarity merits against all latent variables. To fuse, first average over the latent variables, then over all the samples left in. Now normalize the similarity merit rows to unit length, and finally sum over the similarity merits. Just as we had earlier for the library outlier detection, there is one similarity value for each sample left out describing its similarity to the rest of the sample space. Just as before, scale this value according to the z-distribution and reject any samples with scores outside the standard deviation threshold.
That outlier detection process only considers the analyte similarity. To more holistically characterize the outliers, then we repeat that exact process with both analyte similarity merits and the spectral similarity merits, with the same fusion process as well as outlier rejection.
The quality controls come next. The idea is that poor calibration sets should be rejected before they even have the chance to go through the similarity analysis in the next stage. The simplest rejection criterion is the calibration set size. If the outlier cleaning has left them with few enough samples (hyperparameter default 10 samples) that a regression is likely inaccurate, then the calibration set is rejected. Next, if the predicted analyte value of the target sample does not fall within both the actual analyte range of the calibration set samples as well as the predicted values of the calibration set samples, then the calibration set is also rejected, since these types of models are generally not optimal for extrapolation. The next few conditions are related to the dataset linearity and are based on fitting a regression line to the univariate plot of the predicted analyte values for the calibration set to the actual analyte values. The R2 correlation should optimally be 1.00, the slope of this fit should be 1.00, and the intercept should ideally be 0. The adherence to these optimal conditions is set by a hyperparameter, but generally the only easy one to strictly control which is dataset independent is the R2, which we confine to be strictly greater than 0.80 for the default hyperparameter value. Finally, the RMSEC can be constrained by the quality controls, but again this is not dataset independent.
The final part of the checks is to see whether the target sample is an outlier to the calibration set. This is performed the same as the PRISM algorithm, and the same as the aforementioned outlier cleaning mechanism except with some different similarity merits. The merits must be different since analyte amounts are not known for the target sample. Like always, if the target sample is outside of the allowed standard deviation threshold, then it is identified as an outlier.
Set Selection 1: The first set selection of this exemplary embodiment takes the generated calibration sets from this particular hyperparameter combination and identifies the set which is most similar to the target sample. This similarity analysis is relatively straightforward compared to some of the earlier ones. However, it brings up the topic of self-modeling and cross-modeling in set selection, or matrix matching. Self-modeling is the typical use case of similarity merits in the Δ-similarity setup. To calculate the self-modeling similarity, one sample is taken out of the calibration set, and compared to the remaining subspace of the calibration set using the similarity merits. Then the target sample is also compared to this remaining calibration set. The difference between the two measurements for these different samples with respect to the same underlying subspace is recorded as the Δ-similarity, and indicates one measurement of similarity between the target sample and the calibration sample out. This can be performed for all the source samples out, for all the calibration sets. However, recent work introduces the idea of cross-modeling, wherein the sample out and target sample are compared to not the calibration set that the sample out originated from, but rather a different calibration set. For example, if there are two calibration sets, then one can calculate the similarity between a sample pulled from the first calibration set to the entire second calibration set, and also the target sample to the entire second calibration set, then difference them to get a Δ-similarity. This cross-modeling not only expands the volume of accessible data to draw similarity conclusions based on, but also aids in providing a separate view of the internal physicochemical similarity structure between the samples.
So, in order to pick the best matched set, LAFR calculates the Δ-similarity between the sample out and the target sample with respect to not only the set the sample out originated from, but also the domain of each of the other calibration sets. The number of effective similarity merits is expanded multiplicatively by the number of calibration sets, so that there is effectively a “Mahalanobis distance with respect to the first calibration set”, a “Mahalanobis distance with respect to the second calibration set”, etc. To fuse over this data, first it is averaged over principal components and latent variables, then averaged over all the calibration samples left out. Each effective similarity merit is normalized to unit length, then summed over all the similarity merits. At this point, there is one spectral similarity score and one analyte similarity score for each of the calibration sets.
The method to fuse the spectral and analyte similarity scores is also a controllable hyperparameter. The simple “mean” fusion just sums the spectral score and the analyte score. The default value, which is “plot” fusion, instead takes the 2-norm of the coordinate vector defined by each sample’s spectral similarity score and analyte similarity score.
The set with the lowest composite similarity score is selected and passed forward to the second set selection. However, it is at this point that the LAFR algorithm goes back to iterate over the other hyperparameter combinations, yielding a “best” calibration set for each of them. The aggregate of all these calibration sets, one for each hyperparameter combination, will be the input for the second set selection.
Set Selection 2: The second set selection per this exemplary embodiment is almost exactly functionally equivalent to the first one, in terms of implementation. The first difference, though, is that cross-modeling is not used for the second set selection, only self-modeling is analyzed. This is since nearly the same sets can be created by the clustering algorithm, and thus it is frequent that the second set selection has nearly duplicate calibration sets, which severely advantages the sets which happen to be duplicated in the cross modeling. Using only self-modeling abates this issue. The second small difference between the two set selection stages is that the second set selection chooses multiple best calibration sets instead of only one. The idea behind this is that the sample selection which occurs next works most optimally if it selects samples from multiple different calibration sets and aggregates them, rather than selecting all the samples from an already relatively small single calibration set.
Sample Selection: Now that the duplicate calibration sets have ideally been removed by the second set selection, it is appropriate to return to using the cross-modeling statistics. Thus, the self- and cross-modeling measures are computed for the new subset of calibration sets. Recall, though, that earlier the set selection process averaged over the samples left out. For sample selection, however, this fusion will not take place. Instead, when the merits are normalized to unit length and summed over, there will be one spectral similarity value and one analyte similarity value for each sample in the calibration sets, rather than only for each calibration set. As before, the plot fusion in Eq. 25 can be used to combine the spectral similarity score with the analyte-based similarity score.
Another hyperparameter controls the number of unique samples selected by the algorithm to be in the final local calibration set. The parameter specifies that the samples must be unique because equivalent samples often arise in different calibration sets passed to the final sample selection and are chosen. Since some samples will be duplicated in the final calibration set, this works as a de facto weighting scheme whereby samples which are repeated more often will have additional influence on the formation of the model regression vector.
One substantial concern with the last piece of the LAFR process or algorithm is that these final hyperparameters cannot self-optimize because they arise outside of the loop over all hyperparameter combinations. It is certainly possible that minor changes to the latent variable selection method or the number of final chosen samples could substantially affect the predictive capabilities of the LAFR model. To abate this issue, we analyze the variation of the model predictions when the number of unique samples and the number of latent variables is altered. We vary the number of unique samples from 15 less than the hyperparameter value to 15 more and look at the latent variables from 2 less than the chosen amount to 2 more than the chosen amount. This defines a parameter space in which we can analyze all the predictions and determine whether they vary to a degree which is unacceptable for the required predictive reliability. To make this minimally restrictive, the LAFR default hyperparameter for this is that if the standard deviation of the varying predictions is greater than 40% of the total span in the analyte amounts for the whole original sample library, then the sample is identified as non-predictable. This case is almost never triggered, but if the standard deviation is this large then the sample certainly is not reliably predicted by the calibration set.
Prediction: Having determined a local calibration set from the processes prior to this, the prediction of the target sample is straightforward. For the purposes of this exemplary embodiment of the process, partial least squares regression is used, but a variety of linear regression techniques are viable for generating predictions.
While certain embodiments of the disclosed subject matter have been described using specific terms, such description is for illustrative purposes only, and it is to be understood that changes and variations may be made without departing from the spirit or scope of the subject matter.
The present application claims the benefit of priority of U.S. Provisional Pat. Application No. 63/304,856, titled “Generalized Local Adaptive Fusion Regression Process Based on Physicochemical and Physiochemical Underlying Hidden Properties for Quantitative Analysis of Molecular Based Spectroscopic Data,” filed Jan. 31, 2022, which is fully incorporated herein by reference for all purposes.
The presently disclosed subject matter was made with Government support under Grant Nos. CHE-1506417 and CHE-1904166, awarded by the National Science Foundation. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63304856 | Jan 2022 | US |