This invention relates generally to quantitative analysis methods and, in particular, to a method of treating raw data as independent signal sources to reduce computational lag without adversely affecting signal-to-noise ratio (SNR).
A class of quantitative analysis methods involves collection of one or more raw data inputs, which are combined in a linear or nonlinear fashion to produce one or more outputs. For example, one method class is called a weighted sum and has the following equation for conversion of inputs to outputs:
R
1
I
1
+R
2
I
2
+R
3
I
3
+ . . . +R
n
I
n
=W Equation 1
where:
Rn equals the response factor, or input weighting, for input n
In equals input n, and
W equals the weighted sum of the inputs, i.e., the output
These inputs typically have their own statistical characteristics, such as error distribution and mean. In most cases, it is desirable to have a high signal-to-noise ratio in the output, which is normally produced by averaging multiple sequential outputs over time:
W
avg=Σp=1N(Wp)/N Equation 2
where Wavg equals the average output over the time period from point 1 to point N of the individual outputs Wp.
This approach, however, has the unwanted side effect of increasing the lag time for the output because each individual Wp must be measured before the first Wavg is available. While the signal-to-noise of Wavg is increased by a factor equivalent to the square root of N, (the number of points averaged), this conflict between increased signal-to-noise and increased lag time is of a general nature and is difficult to overcome.
If all of the inputs for these additional calculations are averaged identically, then the lag time will be equal to the time period of an individual measurement times the number of measurements averaged, and will be carried over into the calculation of the new output. This lag time—when all inputs are averaged equally—is identical to that which would result if the output were averaged directly.
Many literature and patent references exist with respect to a constant width moving-window average of output quantities, as well as weighted variants such as linear regression, Savitsky-Golay smoothing, and so forth. However, no work has been found in which the raw inputs to a calculation producing an output are treated independently.
This invention is broadly directed to a method of reducing computational lag without adversely affecting signal-to-noise ratio (SNR) in a system wherein raw data inputs are combined in a linear or nonlinear manner to produce one or more outputs. A plurality of raw data inputs are received at a computer processor, wherein one or more of the data inputs exhibits an inherently high SNR, and one or more of the data inputs exhibits an inherently low SNR. An averaging factor is applied to each data input on an independent basis that is a function of the SNR for that input, and the inputs are combined following the application of the averaging factor to produce one or more outputs. In a preferred embodiment, the computer processor does not average raw data inputs exhibiting the inherently high SNR.
Each data input may represent a plurality of data points having a fixed error distribution and mean, in which case the computer processor may apply a constant averaging factor to each data input as a function of the number of data points for that input. The data inputs may be averaged starting from the most recent value received, working backwards until a desired error is obtained or until a predetermined limiting averaging factor is reached. Alternatively, the averaging is carried out using an adaptive infinite impulse response filter, with the weight of each new input point being added to the running average input is determined by the difference between the new input point and the running average.
The data inputs having an inherently low signal-to-noise ratio represent material concentrations that are small or unchanging. In one disclosed example, the data inputs may represent Raman spectra. In particular, the data inputs may be averaged Raman spectra from which peak heights or areas are obtained through integration. The peak heights or areas may be obtained from unaveraged spectra then averaged before use in further calculations as inputs to produce one or more desired outputs. The output(s) may be linear or nonlinear combinations of the peak heights or areas, coupled with weighting factors which relate the raw inputs to a quantitative output such as concentration of a chemical species.
An alternative approach, which is the subject of this invention disclosure, is to treat each of the inputs as an independent signal source and apply averaging that is specific to each signal source in such a fashion as to produce both a high signal-to-noise and a minimum lag time in the output. This can be accomplished because some inputs have an inherently higher signal-to-noise and require little, if any, averaging. Other inputs may have inherently low signal-to-noise, but because their concentrations are small and/or unchanging, can be averaged with little effect on the overall lag time of the output.
In one embodiment of the invention, the error distribution and mean of each input are assumed to be fixed and thus a constant averaging factor, optimized for each input, can be used. This is often called a “boxcar average” or “moving-window average” and is common in the industry when used on the outputs. The characteristic descriptor of a moving-window average is the size (in points) of the window, i.e., the number of points to be averaged. In this invention, each moving-window average can have a different number of points that is appropriate for the specific input with the resulting goal of optimizing both the precision and lag time of the output.
In another embodiment, the averaging factor can be based on the current error distribution of the signal itself. In this embodiment, the inputs are averaged starting from the most recent value and working backwards until the desired error is obtained, or a predetermined limiting averaging factor is reached. Again, the averaging is optimized for each input. This embodiment would be preferred when the precision of the output is of more importance than the lag time.
In a third embodiment, the averaging is done by an adaptive infinite impulse response filter, where the weight of the new input point being added to the running average input is determined by the difference between the new input point and the running average. This is shown in Equation 4.
Where: E is an estimate of the error of the input, such as its standard deviation
LA is the last averaged point, i.e. the running average
CP is the current input point before averaging
D is the absolute difference between LA and CP
This type of average allows lag time to be reduced to zero when the signal is changing very slowly, and at the same time is very good at rejecting sudden movements in an input, such as a spurious signal or spike. When D is about the same as the expected error, about half the weight is given to the new point and half to the running average. When D becomes much smaller than E, the new point essentially becomes the new average. When D becomes much larger than E, the new point is essentially rejected. Algorithm means are used to catch the cases where D is zero or when a permanent step-change occurs that is beyond the normal expected error. Other averaging schemes, such as weighted moving-average, linear regression averaging, and so forth can also be used.
Spectroscopy involves generating the raw data, or inputs, as individual points consisting of some measure of light intensity vs. wavelength. For example, in absorbance spectroscopy, the light intensity is expressed as the log of the percent transmittance of light through a sample and the wavelengths may be expressed in nanometers, for near-infrared, or inverse centimeters, also called wavenumbers, for the mid-infrared range. For types of spectroscopy involving scattered light, such as Raman spectroscopy, the light intensity is measured as raw counts from the digitization of a detector signal. For Raman spectroscopy, the wavelength is expressed as a wavenumber shift from the incident light source which stimulated the Raman scattering. An example Raman spectrum is shown in
Enlarging a small region of the spectrum in
Application of Equation 1 to the raw inputs, where each Rn is equal to 1, will give the absolute peak area. Thus the peak area can be considered a type of output for spectroscopic quantitation. It is relatively easy to see that each Rn could be chosen as some number other than 1, such that the sum would be a concentration value instead of a peak area with units of counts.
The reader may notice that the peak shape appears to flatten as one gets closer to the edges of the plot in
In the case where peaks are very large with respect to the background (a high signal-to-noise), the effect of drawing the baseline slightly wrong has little effect on the area of the peak. This is shown in
It should be apparent that in the case of
In spectroscopy there may be an instance where molecule A has a peak close to a peak from molecule B (peak B1) such that the area of peak A always includes some area from peak B1, i.e. the peaks are overlapped (peak AB1. If there is an additional peak for molecule B, e.g, peak B2, this peak area can be used to calculate the correct area of peak A. Equation 1 is applied in such a manner that the R for peak AB1 is positive and the R for peak B2 is negative, which results in the true peak area for A being calculated by subtracting some area of the overlapped peak AB1. This technique is called multiple linear regression and is often used in spectroscopy to quantitate molecular composition when there are no unoverlapped peaks for a particular component.
Another common example of multiple peaks being used in a calculation is called mass balancing. In this case, the sum of the concentrations of all components in a mixture is known to add to 100%. However, for many reasons, the sum as measured may add to more or less than 100%. A simple correction is to normalize each concentration by dividing by the sum of the concentrations, which results in the new normalized sum adding to unity, or when multiplied by 100, adding to 100% (see Equation 3):
Where Ci is the concentration of an individual component before normalization. Because of the nature of the sum in the denominator of Equation 3, errors from every component are carried through the calculation and affect the error of every resulting normalized concentration. This is true regardless of whether spectra are averaged and then a peak area is calculated or whether peak areas are calculated from unaveraged spectra and then averaged afterward.
Another example of peak areas being used as inputs for calculation of another quantity is when the quantity is a physical property of the sample. In these cases the assumption is that the physical property of the sample can be related to some combination of the peak areas. Using multiple linear regression, R values are calculated for each peak such that the weighted sum of peak areas equals the physical property of the mixture. For example, in the Liquified Natural Gas industry, heating value is calculated by determining the molecular composition and then assigning a heating value for each molecule such that the total heating value is a weighted sum of the specific molecules heating value times the specific molecule's concentration. Since concentration is simply a weighted sum of peak areas we can see that peak area inputs can be used to determine outputs of physical properties such as heating value.
The technique was applied to spectra obtained from liquified natural gas (LNG) consisting of the approximate composition shown in Table I. Moving-window averaging was used on the spectra and compared with moving-window averaging on the areas from unaveraged spectra. In addition, results are shown for no averaging and the case where all components have the same averaging applied, which is mathematically the same as averaging the output, a practice well-known in the industry.
The peaks which could benefit the most from averaging are the four lowest concentration peaks: Isopentane, Pentane, Neopentane and Nitrogen.
The technique will be applied to the calculation of Mol % Methane in the sample, which is the major component.
Next we examine the same four approaches, except in this instance a different time period is displayed. In this time period the composition of liquified natural gas undergoes a sharp step change. The case where all components are averaged equally shows a lag of about 30 minutes (blue line). The other approaches almost perfectly overlap at this scale and show negligible lag (
Looking at a minor component (