1. Field of the Invention
The present invention relates to the development of optimized multivariate models through a calibration set of empirical data. More particularly, the present invention relates to the automatic selection of a data sub-set from a larger set of potential calibration data that provides improved performance (accuracy) and robustness.
2. Description of Related Technology
In general, near-infrared (NIR) diffuse reflectance spectroscopy involves the illumination of a spot on the body with low energy near-infrared light (700-2500 nm). The light is partially absorbed and scattered, according to its interaction with chemical components within the tissue, prior to being reflected back to a detector. The absorbance of light at each wavelength is a function of the structural properties and chemical composition of the tissue. Tissue layers, each containing a unique heterogeneous particulate distribution, affects light absorbance through scattering. Chemical components such as water, protein, fat, and analytes absorb light proportionally to their concentration through unique absorption profiles or signatures. The measurement of glucose is based on detecting the magnitude of light scatter and attenuation related to its concentration as spectrally manifested through the use of a calibration.
A calibration is a mathematical model, g(?), that relates a set of M independent variables, xεM×1, to a dependent variable, y through
ŷ=g(x)
where ŷ is an estimate of the dependent variable. In the linear case,
ŷ=xG+b
where GεM×1 is a regression vector and b is an offset. The process of calibration involves the determination of g(?) on the basis of an exemplary set of N paired data points or samples, called the “calibration set”. Each sample consists of a measurement of the independent variable, x, and an associated measurement of a dependent variable, y. The method for designing the structure of g(?) is through the process of system of identification [L. Ljung, Systems Identification: Theory for the User, 2d.ed., Prentice Hall (1999)]. The model parameters are calculated using known methods including multivariate regression or weighted multivariate regression [N. Draper, H. Smith, Applied Regression Analysis, 2d.ed., John Wiley and Sons, New York (1981)], principal component regression [H. Martens, T. Naes, Multivariate Calibration, John Wiley and Sons, New York (1989)], partial least squares regression [P. Geladi, B. Kowalski, Partial least-squares regression: a tutorial, Analytica Chimica Acta, 185, pp.1-17, (1986)], or artificial neural networks [S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River N.J. (1994)].
As indicated above, a primary use of a calibration is for the estimation of a dependent variable on the basis of an independent measurement. In the case of the non-invasive measurement of glucose through near-infrared spectroscopy, the dependent variable is the subject's glucose concentration and the independent variable is a near-infrared spectrum, after suitable processing. However, the use of calibrations is not limited to non-invasive measurement of glucose but, rather, applies to any application in which an indirect measurement of a property value (dependent variable) is required on the basis of more than one independent variable.
The design and collection of the calibration set is of great importance because the performance of the resulting model is intimately linked to the quality of the calibration data [see, for example, T. Isaksson, T. Naes, Selection of samples for calibration in near-infrared spectroscopy. Part I: general prinicples illustrated by example, Applied Spectroscopy, Vol. 43, No. 2, pp. 328-335, 1989 and T. Isaksson and T. Naes, Selection of samples for calibration in near-infrared spectroscopy. Part II: selection based on spectral measurements, Applied Spectroscopy, Vol. 44, No. 7, pp. 1152-1158, 1990]. A minimal requirement is that the data in the calibration set must comprehensively represent the potential variation in x and y. However, this criterion does not guarantee a calibration set will be sufficient. In particular, two significant problems related to the calibration set can adversely effect the determination of g(?). First, individual paired data points can contain errors in x or y as a result of measurement error, poor instrument performance and other anomalies. Such data points, often referred to as “outliers”, should be removed to avoid a poor estimate of g(?).
Second, interfering variables or constituents in x that are present at the time of data collection introduce the potential for unintended or ancillary correlations between the dependent variable and other unrelated variables. If this correlation is manifested in x and is consistent throughout the calibration set, a false calibration will result that fails when this correlation is absent. In the case of noninvasive glucose measurement, the potential for false correlations is a consequence of the complexity of the sample and the measurement process [see M. Arnold, J. Burmeister, G. Small, Phantom glucose calibration models from simulated noninvasive human near-infrared spectra, Analytical Chemistry, vol. 70:9, pp. 1773-1771 (May 1, 1998)]. The multifaceted matrix of blood and tissue constituents introduces the potential for unintended correlations between glucose and other analytes.
In addition, the glucose levels of subjects may move relatively slowly throughout the course of a data collection period and may correspond consistently with other variables such as time, sample order, instrument drift, room temperature, patient skin temperature and skin hydration. Therefore, experimental conditions can lead to spectral aberrations that fortuitously vary consistently with glucose. Models based on data containing fortuitous and spurious correlations between glucose and other variables are erroneous and therefore not suitable for directing insulin therapy in diabetics.
Therefore, the creation of a suitable calibration set is generally performed on the basis of an experimental design and subsequent execution of the experiment [see H. Martens, T. Naes, Multivariate Calibration, John Wiley and Sons, New York (1989)]. However, there are often circumstances that prohibit a comprehensive experimental design and/or involve uncontrollable samples. For example, when the target apparatus involves the measurement of an attribute of a biological system, such as near-infrared measurement of glucose in humans, absolute control of the diversity of factors affecting calibration is difficult. As reported by S. Malin, T. Ruchti, An Intelligent System for Noninvasive Blood Analyte Prediction, U.S. Pat. No. 6,280,381 (Aug. 28, 2001), commonly-owned with the current application, uncontrollable chemical, structural, and physiological variations occur in tissue that produce dramatic and nonlinear changes in the optical properties of the tissue sample.
In such circumstances, an additional step of selecting a suitable subset of calibration data from a larger data is desirable. Several methods have been reported that base the selection of a calibration subset on the basis of the independent variable [see, for example D. E. Honigs, G. M. Hieftje, H. L. Mark and T. B. Hirschfeld, Unique-Sample Selection via Near-Infrared Spectral Subtraction, Analytical Chemistry, Vol. 57, No. 12, pp. 2299-2303, 1985; E. Bouveresse, C. Harmn, D. L. Massart, I. R. Last, and K. A. Prebble, Analytical Chemistry, Vol. 68, pp. 982-990, 1996; and Isaksson, et al., supra (1)]. Such methods fail to make use of the dependent variables and do not guarantee to produce unbiased models. In addition, the problem of fortuitous correlations remains unaddressed in these reports.
J. M. Brown, Method for optimizing multivariate calibrations, U.S. Pat. No. 6,233,133 (Apr. 24, 2001) describes a method for selecting a subset of samples for calibration on the basis of a larger set to minimize the bias in the y-block while ensuring that the x-block range is adequately spanned by the calibration set. However, the method fails to address the problem of ancillary correlations that, in certain applications, are pervasive within the larger data set. In addition, the method of selection is undirected and based upon a fitness function that depends upon the results from a calibration model that is calculated for each potential subset. Consequently, the results may vary significantly on the basis of the method of calibration and the determination of the suitable rank of the calibration model.
Fundamentally, no method has been reported to automatically select calibration samples that minimizes the potential for a calibration model that includes spurious correlations. In addition, no automated process has been designed to enhance the accessibility of the target signal within the calibration set while minimizing the correlation to interfering variables. Finally, no method has been reported that automatically identifies and removes invalid samples from a calibration set.
In view of the problems left unsolved by the prior art, there exists a need for a method to optimize the calibration set in a manner that reduces the likelihood of spurious correlations. Further, it would be beneficial to provide a method of selecting calibration samples that enables the efficient extraction of the target signal. Finally, it would be a significant advancement if the method were automatic and, as part of its operation, removed invalid samples.
The invention provides a process for enhancing a multivariate calibration through improved design of a calibration data set. The process of the invention operates on a large calibration set of samples consisting of measurements, x, and associated reference values, y, to automatically select an optimal or near-optimal sub-set of samples that will lead to the determination or calculation of an improved calibration model. The general methodology for sample selection is based on maximizing the causal variation represented in the measurement, x, over the calibration set that is related to the target variable, y while reducing the correlation between y and unrelated variables. The process is automatic and bases sample selection on two basic criteria:
In addition, the process can be performed with both criteria or with either criterion singly for the improvement of the calibration set.
The method includes two fundamental steps:
Evaluation is a method for assigning a measurement of calibration suitability, herein referred to as a figure of merit, to a particular subset of data. The means of evaluation may take the form of a fitness or cost function that provides a figure of merit associated with a subset of data or with each individual sample. The figure of merit represents (positively) at least the presence of the analytical signal and (negatively) the existence of a correlation to ancillary variables. In addition, the range of the dependent and independent variables, the spectral variation and number of samples can be included in the function.
Subsequently, a method of optimization selects an optimal or near-optimal subset of data as directed by the cost function or figure of merit. The invention provides several examples of an iterative approach to optimization. However, one skilled in the art will recognize that other methods of optimization are applicable including, for example, genetic algorithms [see Goldberg, D. E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison Wesley Publishing Company, 1989].
The following process is explained specifically for application to calibration of a near-infrared spectrometer for noninvasive glucose measurement. However, one skilled in the art will recognize that this process is applicable to any instrument and any target analyte or variable.
While the invention ideally finds application in enhancing and automating the calibration process for the non-invasive measurement of glucose as disclosed in S. Malin, T. Ruchti, An Intelligent System for Noninvasive Blood Analyte Prediction, U.S. Pat. No. 6,280,381 (Aug. 28, 2001); and T. Ruchti, S. Thennadil, T. Blank, A. Lorenz, S. Monfre, Noninvasive measurement of glucose through the optical properties of tissue, PCT Application No. US02/02288, filed on Jan. 25, 2002, the entireties of which are hereby incorporated by reference, one skilled in the art will appreciate that the invention can be applied in any system involving the calculation of multivariate models from empirical data sets.
Evaluation
The invention utilizes either a figure of merit or a cost function for determining the quality of a particular subset of calibration samples. The basic elements of the cost function include:
In addition, the cost function may optionally include any of:
In one implementation, the cost function involves the assignment of a cost or performance measurement for an overall set. For example, the cost function can be written
where J is the cost associated with a particular subset of samples, {dot over (J)}k(x,y) represents the kth cost associated with the sub-set of samples represented by the matrices xs and ys, and αk represents the weight or emphasis associated with the kth figure of merit. In an alternate implementation, an individual cost is computed for each sample that is present within a particular subset according to
where Jp is the cost associated with the pth sample, αk represents the weight associated with the kth cost function, {dot over (J)}k,p(xp,yp) represents the kth sub-cost function associated with sample s, and m represents the total number of sub-cost functions. The overall cost is the summation of the individual figures of merit associated with the samples of a particular subset.
Optimization
Optimization of the calibration set involves the determination of the subset of data that either minimizes a cost function or maximizes a figure of merit. In one embodiment, the process is performed in an iterative manner, beginning with the set of all samples. The cost of each sample or the combination of m samples is used at each iteration to remove one or more samples until a particular level of performance is achieved. Under this framework, hard inclusion and exclusion limits can be set to the final cost of each sample such that they are either included or excluded from the calibration set. Where it is possible for decisions to be made in an ordered manner according to a preset priority, a set of decision rules can be established in a hierarchical framework in which each individual cost or figure of merit is applied separately and in a particular order. Alternately, the calibration subset can begin with one or more samples and the cost function can be used to iteratively determine which sample, or combination of several samples, will lead to an improvement in overall performance. The process of adding one or more samples into the calibration set continues until a desired performance level is achieved.
Finally, the cost function can be optimized by global search techniques such as dynamic programming [Bellman, R. E. Dynamic Programming, Princeton University Press, Princetone, N.J., USA, 1957.], gradient search techniques [Gill, P. E., W. Murray and M. H. Wright, Practical Optimization, Academic Press, 1981.], random search techniques, genetic algorithms [Goldberg, D. E., Genetic Algorithm in Search, Optimization and Machine Learning, Addison Wesley Publishing Company, 1989.], or evolutionary programming [Fogel, D. B. An Introduction to Simulated Evolutionary Optimization, IEEE Trans. On Neural Networks, vol. 5, no. 1, January 1994.]. Given a cost function or method of evaluation as described previously, the sub-set of optimal or near-optimal samples are selected by application of any of these methods.
A genetic algorithm may be employed to select that sub-set from a set of samples that produces the best performance as indicated by the method of evaluation. The process involves the following steps:
The preferred embodiment of the invention evaluates each sample and iteratively removes samples from the data set to enhance the overall performance. This process continues until a particular level of performance is achieved. The process is shown through a block diagram in FIG. 1 and consists of these general steps.
1. Collect Data (101)
Data is collected in paired data points each consisting of a measurement of the independent variable(s), x, and a target or dependent variable, y, from a reference device. For example, x can consist of a set of tissue absorbance spectra collected on a near-infrared spectrometer and y can be the set of corresponding glucose measurements taken on a Yellow Springs Instruments glucose analyzer. The data set, consisting of a N-by-M matrix, {overscore (x)} and an N-by-1 matrix, {overscore (y)}, is termed the “calibration set” and is used in multivariate calibration to determine a model for calculating future analyte values, ŷ, given future device measurements.
2. Calculate Partner Variable (103)
The N-by-P partner variable, {overscore (z)}, is determined from the independent variable, {overscore (x)}, and is either the net analyte signal or is a feature related to the net analyte signal (NAS). The net analyte signal is the part of {overscore (x)} that relates uniquely to {overscore (y)} and is determined through a previously calculated calibration model [see A. Lorber, K. Faber, B. Kowalski, Net analyte signal calculation in multivariate calibration, Analytical Chemistry, vol. 69, pp. 1620-1626 (1997)]. Note that one or more partner variables can be defined when a multiplicity of prior calibrations exist or when more than one feature is required to reflect the net analyte signal. When a prior calibration model exists in the form of a regression vector, {overscore (w)}, then {overscore (z)}={overscore (xw)}. When a calibration does not exist but a feature is known that reflects the net analyte signal present in {overscore (x)} then it is used by itself or in combination with other features as {overscore (z)}. For example, in the application of noninvasive glucose measurement the critical points of the first and second derivative from the measured near-infrared spectrum are valuable features related to the net analyte signal of glucose.
In one embodiment of the invention, the absorbance spectra of water, fat and/or protein are used as features related to the net analyte signal. The features are resolved by calculating the first or second derivative of each spectrum and determining the magnitude at the appropriate wavelengths (e.g. the following wavelengths ±1-5 nm: 1095, 1125, 1155, 1180, 1200, 1308, 1338, 1380, 1414, 1443, 1458, 1550, 1593, 1607, 1625, 1661, 1687, 1706, 1727, 1747, 1783, 1874, 1784, 1910). These wavelengths are disclosed because variations that are consistently related to glucose can be detected through the absorbance spectrum, the first derivative absorbance spectrum and second derivative absorbance spectrum at or near these wavelengths.
Further methods for determining partner analytes in the form of physiological variation that is consistently related to a particular analyte have been previously described in Ruchti, et al., supra.
In the preferred embodiment, the regression vector from another subject, a variety of subjects or a group of subjects is used to determine P estimates of the net analyte signal present in each row of {overscore (x)}.
3. Determine the Interference (102)
The interference is any variable other than the dependent variable that is manifested in the independent variable (after processing and feature extraction) that may correlate with the target variable over time. For noninvasive glucose measurement the following interferences have been identified:
In the preferred embodiment, one or more of the variables above are contained in the vector {overscore (b)} and are identified as the interference.
4. Determine the Inaccessible Portion of the Target Variable (104)
Determine the inaccessible portion of the target variable, {overscore (y)}, by projecting it onto the null space of the partner variables through
nasnoise=∥(I−{overscore (zz)}+){overscore (y)}∥
where 51 is the identity matrix, {overscore (z)}+ is the pseudo inverse of {overscore (z)} given by ({overscore (z)}′{overscore (z)})−1{overscore (z)}′ and ∥?∥ is the norm. The scalar nasnoise is an estimate of the variance in {overscore (y)} that is not represented in {overscore (z)}. Reasons for higher values of nasnoise include poor sample reproducibly, noise and interference.
5. Determine the Accessible Portion of the Analyte (104)
Determine the accessible portion of the analyte, given the interference, according to
nassignal=∥(I−{overscore (b)}{overscore (b)}+){overscore (y)}∥
The scalar nassignal is an estimate of the independence of the analyte values from the interference or the information that can be accessed in the presence of the interference. When spurious correlations are present in the data a net reduction in nassignal is detected.
6. Estimate the Signal-to-Noise Ratio (SNR) (104)
Estimate SNR:
7. Estimate Correlation Coefficient Between nassignal and {overscore (z)} (104)
Estimate R, the correlation coefficient between nassignal and {overscore (z)} (if {overscore (z)} has been more than one column, R is the minimum correlation coefficient between the columns of {overscore (z)} and nassignal).
8. Iteratively Perform Steps 2-7 (105)
Iteratively perform steps 2-7 106 leaving one of the samples out (106) at each iteration. The result will be N-dimensional vectors for {overscore (SNR)}, {overscore (nas)}signal, {overscore (nas)}noise and {overscore (R)} corresponding to the respective calculation when each particular given sample is removed.
9. Select Sample to Remove (107, 108)
The sample to remove is selected as the one that maximizes the N-by-1 goodness measure {overscore (M)} where {overscore (M)} is a combination of one or more of the following:
In the preferred embodiment limits are set for the minimum value of {overscore (nas)}signal and the maximum value for {overscore (nas)}noise, denoted csignal and cnoise respectively. The goodness measure is determined 107 as follows:
where nassignal,k−1 and nasnoise,k−1 refer to the final nassignal and nasnoise from the prior iteration. Alternately, {overscore (M)} is given by
{overscore (M)}=a1{overscore (R)}2+a2{overscore (SNR)}+a3{overscore (nas)}signal
where ak are coefficients. The sample selected for removal corresponds to the element of {overscore (M)} with the maximum value 108). Note that the criteria above can be modified according to the information that is available. When a partner analyte is not available then a2=0 and the criteria above is used to reduce the correlation and magnitude of the interference. Alternately, if measures of interference are not accessible, then a1=0 and a3=0 and SNR is replace with nasnoise.
10. Steps 2-9 Continue (109)
Steps 2-9 continue until the target values for csignal and cnoise are obtained, a desired number of samples are removed the maximum number of samples are removed, or removal of further samples would fail to produce a statistically significant increase in {overscore (M)} through an F-test of max({overscore (M)}) versus max({overscore (M)}k−1).
At the conclusion of the sample selection process, the samples are evaluated to determine if csignal and cnoise have been obtained. If they have, then the measurement data are preprocessed and a calibration is automatically calculated using known methods including any of: linear regression, multiple linear regression, principal component regression, partial least squares regression, artificial neural networks or other method for calibration determination.
Several versions of current embodiment of the invention have been developed by modifying the manner in which samples are left out during for the calculation of {overscore (M)}. In one version, groups of samples are left out at each iteration, either ordered by sample or randomly selected. Alternately, {overscore (M)} can be determined by iteratively removing samples and iteratively putting previously removed samples back into the calculation. In this embodiment samples can be removed and put back into the final calibration set.
Other methods of optimization are readily applicable. For example, M can be calculated for every possible combination of the N samples or every possible combination of j samples where j is the desired number of samples in the final data set. The final calibration set is selected as the one associated with the highest value for M. In applications with many samples genetic algorithms or simulated annealing are applied to optimize M.
Alternate Embodiment
An alternate embodiment, shown in
where X is the variance scaled data points, x is the measured or computed property values, {overscore (x)} is the mean of the x, and σx is the standard deviation of x. The next step is to compute a difference between the two by subtracting one data set from the other 202.
The residuals are compared to specified parameter 203. The last step 204 is then to remove samples with absolute residuals greater or less than a specified parameter in order to increase or decrease the correlation to the target parameter respectively.
To demonstrate this method, three data sets containing 30 values each will be used. The first data set (d1) represents the property value that is being modeled. The second data set (d2) represents an ancillary parameter that is related to the NAS, and the last data set (d3) represents an ancillary parameter that is destructive to the NAS.
While this embodiment details the iterative evaluation and removal of single samples, an alternate implementation involves the iterative evaluation and removal of groups of samples. In this embodiment groups of two or more samples are evaluated together and removed as herein described.
Although the invention has been described herein with reference to certain preferred embodiments, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the claims included below.
This application claims benefit of U.S. Provisional Patent Application Ser. No. 60/310,033, filed on Aug. 3, 2001.
Number | Name | Date | Kind |
---|---|---|---|
5857462 | Thomas et al. | Jan 1999 | A |
6134458 | Rosenthal | Oct 2000 | A |
6135965 | Tumer et al. | Oct 2000 | A |
6280381 | Malin et al. | Aug 2001 | B1 |
Number | Date | Country | |
---|---|---|---|
20030109998 A1 | Jun 2003 | US |
Number | Date | Country | |
---|---|---|---|
60310033 | Aug 2001 | US |