This invention relates to methods of analyzing data obtained from instrumental analysis techniques used in analytical chemistry and, in particular, to methods of automatically identifying peaks in liquid chromatograms, gas chromatograms, mass chromatograms or optical or other spectra without input from or intervention of a user.
The various techniques of instrumental analysis used in the broad field of analytical chemistry have been developed and refined primarily over the last century. Many of these techniques—such as the spectroscopic techniques including atomic absorption spectroscopy, atomic emission spectroscopy, UV-visible spectroscopy, infrared spectroscopy, NMR spectroscopy and Raman spectroscopy, among others—involve complex interactions of electromagnetic radiation with samples, possibly containing unknown substances to be identified or characterized. Other techniques, such as liquid chromatography (LC), gas chromatography (GC), mass spectrometry (MS) and the hybrid techniques of liquid chromatography-mass spectrometry (LC-MS or HPLC-MS), gas-chromatography-mass spectrometry (GC-MS) and others involve the detection and possibly identification or characterization of various substances derived from mixtures of substances, possibly including unknown analytes, as these substances are separated from one another in a chromatographic column.
One common feature of all the above-listed instrumental techniques is the capability, in use, of generating possibly complex graphs—the graphs generally referred to as spectra—of detected intensity versus some other controlled or measured physical quantity, such as time, frequency, wavelength or mass. In this document, the terms “spectroscopy” and “spectrum” are used in a fashion so as to include additional analytical chemical techniques and data that are not strictly concerned with measuring or representing analytical spectra in the electromagnetic realm. Such additional techniques and data include, but are not limited to, mass spectrometry, mass spectra, chromatography and chromatograms (including liquid chromatography, high-performance liquid chromatography and gas chromatography, either with or without coupling to mass spectrograph instrumentation).
Atomic spectroscopic techniques may produce, for each detected element, spectra consisting of multiple lines representing absorption or emission of electromagnetic energy by various electronic transitions of the atomized element. Likewise, molecular spectroscopic techniques may produce spectra of multiple lines or complexly shaped bands representing absorption or emission of electromagnetic energy by various transitions of molecules among or between various excited and/or ground energy states, such energy states possibly being electronic, vibrational or rotational, depending upon the technique employed.
Still further, mass spectrometry techniques may produce complex spectra consisting of several detected peaks, each such peak representing detection of an ion of a particular mass unit. In mass analysis mode, several peaks, of different m/z values (where m represents mass and z represents charge), may be produced as for each ionized species, as a result of both isotopic variation and differing charges. In the various chromatographic techniques, including those techniques (for instance, GC-MS or LC-MS) in which eluting substances are detected by MS as well as those techniques in which detection is by optical spectroscopy, various possibly overlapping peaks of Gaussian or other skewed shapes may be produced as a function of time as the various substances are eluted.
Traditionally, analytical spectroscopy instrumentation has found its greatest use in specialized research or clinical laboratories in which instrument operation and data analysis is performed by personnel who are highly trained and or experienced in the operation and data collection of the particular employed instruments. However, as the use of analytical spectroscopy instrumentation has expanded, in recent years, from specialized research laboratory environments to general industrial, clinical or even public environments for, for instance, high-throughput screening, there has emerged a need to make instrument operation and data collection and interpretation accessible to less highly trained or experienced users. Thus, there is a need for instrument firmware and software to fulfill greater roles in instrument control and data collection, analysis and presentation so as to render overall turnkey high-throughput operation, with minimal user input or intervention.
Historically, in traditional instrumental analysis situations, collected data is analyzed offline with the aid of specialized software. A first step in conventional data analysis procedures is peak picking, so as to identify and quantify spectral peaks. Such chromatographic or spectroscopic peak detection is one of the most important functions performed by any data analysis system. Typically, chromatographic or spectroscopic peak detection software has employed various user-settable parameters, allowing the operator to provide input in the form of initial guesses for peak locations and intensities and subsequently, to “optimize” the results, after execution of some form of fitting routine that employs the operator's guesses as a starting point. Existing methods of peak detection have many adjustable parameters, requiring operator skill and patience in arriving at an acceptable result. Novice or untrained operators will very likely get an incorrect result or no result at all. This typically results in a very time-consuming process, and the “tweaking” by or inexperience of the user often results in data that is not reproducible and suspect. Further, such traditional forms of peak identification are not suitable for high-throughput industrial process monitoring or clinical or other chemical screening operations, in which there may be a requirement to analyze many hundreds or even thousands of samples per day.
Another critical feature in peak detection is integration of the peak area. With regards to many spectra, the area under a resolved peak is proportional to the number of molecules of a particular analyte. Therefore, assessment of the relative abundances of analytes in a sample requires accurate, robust algorithms for peak integration. Prior attempts at providing automated methods (that is, without input of peak parameters by a user or operator) of peak area calculation generally employ algorithms that calculate the area under the graph of the raw spectral data (e.g., by the trapezoidal method of numerical integration) and, as such, may have multiple or inconsistent criteria to determine how far to extend the numerical integration along the flanks of peaks. Also, such prior numerical integration methods handle overlapped peaks poorly, if at all.
From the foregoing discussion, there is a need in the art for reproducible methods of automated detection, location and area calculation of peaks that do not require initial parameter input or other intervention by a user or operator. The present invention addresses such a need.
Embodiments in accordance with the present invention may address the above-noted needs in the art by providing methods and computer program products for identifying peaks in spectral data that do not require parameter input or intervention by users or instrument operators. Methods in accordance with the present invention do not make a priori assumptions about the particular line shape of the chromatographic or spectroscopic peak(s) and may fit any individual peak to either a Gaussian, exponentially modified Gaussian, Gamma distribution or other form of shape. By not exposing any adjustable parameters to users, methods in accordance with the invention avoid the problems discussed above, and lend themselves to automated analysis. Further, methods in accordance with the invention avoid all the aforementioned problems associated with peak area integration in prior art automated analyses, since the peak area is known from the peak fitting parameters. The present methods may add together multiple Gaussian shapes to yield a final (complex) peak shape or can cleanly separate overlapping peaks that come from different components. Thus, overlapped peaks are easily handled and the integrals computed from the Gaussian (or other) widths and intensities.
According to first aspect of the invention, there is provided a method of automatically identifying and characterizing spectral peaks of a spectrum generated by a chromatography/mass spectrometry apparatus wherein the spectrum includes abundance data in terms of a time-related variable and a mass-related variable, characterized by the steps of:
(a) receiving a portion of the spectrum generated by the chromatography/mass spectrometry apparatus;
(b) automatically subtracting a baseline from the portion of the spectrum so as to generate a baseline-corrected spectrum portion;
(c) automatically determining whether at least a first spectral peak occurs in the baseline-corrected spectrum portion;
(d) if a first spectral peak occurs in the baseline-corrected spectrum portion, automatically determining location coordinates of the first spectral peak in terms of the time-related variable and the mass-related variable; and
(e) if the first spectral peak occurs in the baseline-corrected spectrum portion, reporting to a user or recording to electronic data storage location coordinates of the first spectral peak of the spectrum portion.
According to another aspect of the invention, there is provided a method of automatically identifying and characterizing spectral peaks of a spectrum generated by a chromatography/mass spectrometry apparatus wherein the spectrum includes abundance data in terms of a time-related variable and a mass-related variable, characterized by the steps of:
(a) receiving a region of the spectrum generated by the chromatography/mass spectrometry apparatus;
(b) setting a remaining region of the spectrum equal to the region of the spectrum;
(c) automatically subtracting a baseline from the spectrum so as to generate a baseline-corrected spectrum;
(d) dividing each remaining region of the spectrum into sub-regions;
(e) determining, for each sub-region, whether a respective first sub-region spectral peak occurs within the respective sub-region;
(f) automatically calculating and recording, for each sub-region for which a respective first sub-region spectral peak occurs, location coordinates of the respective first sub-region spectral peak in terms of the time-related variable and the mass-related variable;
(g) for each sub-region for which a respective first sub-region spectral peak occurs, automatically removing said respective first sub-region spectral peak and automatically determining whether a respective second sub-region spectral peak occurs within said sub-region;
(h) setting a respective new remaining region of the spectrum equal to each sub-region for which the respective second sub-region spectral peak occurs; and
(i) repeating steps (d) through (h), if any new remaining regions were set in the prior step (h).
According to another aspect of the invention, there is provided a method of automatically identifying and characterizing spectral peaks of a spectrum generated by a chromatography/mass spectrometry apparatus wherein the spectrum includes abundance data in terms of a time-related variable and a mass-related variable, characterized by the steps of:
(a) receiving a region of the spectrum generated by the chromatography/mass spectrometry apparatus;
(b) setting a remaining region of the spectrum equal to the region of the spectrum;
(c) automatically subtracting a baseline from the spectrum so as to generate a baseline-corrected spectrum;
(d) dividing each remaining region of the spectrum for which a width is greater than an instrument resolution into sub-regions by dividing a range of only a first one of the time-related variable and the mass-related variable;
(e) automatically calculating, for each sub-region formed in the most recent execution of step (d), a respective summed spectrum by calculating a sum over the first one of the time-related variable and the mass-related variable for each value of the second one of the time-related variable and the mass-related variable;
(f) determining, for each sub-region formed in the most recent execution of step (d), whether a respective first sub-region spectral peak occurs within the respective summed spectrum;
(g) setting a new respective remaining region of the spectrum equal to each sub-region formed in the most recent execution of step (d) for which a respective sub-region spectral peak occurs;
(h) repeating steps (d) through (g), if any new remaining regions were set in the prior step (h); and
(i) automatically detecting and calculating location coordinates of at least one spectral peak in the respective summed spectrum corresponding to each remaining region.
In embodiments, baseline model curve parameters are neither input by nor exposed to the user prior to the reporting step. In embodiments, peak model curve parameters are neither input by nor exposed to the user prior to the reporting step. Some embodiments may include automatically determining, for at least one chromatographic peak, a peak model curve that provides the best fit to said at least one chromatographic peak from among the group consisting of Gaussian distributions, Gamma distributions and exponentially modified Gaussian distributions. Further, various embodiments may include a step of reporting to a user or recording to electronic data storage a width, an intensity or a skew of a spectral peak; dividing a baseline-corrected spectrum portion into quadrants or the additional steps of receiving another portion of the spectrum, said other spectrum portion generated by the chromatography/mass spectrometry apparatus concurrently with the processing of a first portion; and, then, performing processing steps in conjunction with the other spectrum portion.
According to another aspect of the invention, there is a provided an apparatus characterized by:
(i) a chromatograph configured to receive a mixture of substances and to separate the substances;
(ii) a mass spectrometer configured to receive the separated substances from the chromatograph at respective times, produce a plurality of ion types from each of the substances and separate the ion types produced from each of the substances according to their respective mass-to-charge ratios;
(iii) a detector configured to receive and detect each of the separated ion types and to generate spectra comprising ion abundance data in terms of a time-related variable and a mass-related variable;
(iv) a programmable processor electrically coupled to the detector and configured to:
(a) receive a first portion of a spectrum generated by the detector;
(b) automatically subtract a baseline from the first portion of the spectrum so as to generate a baseline-corrected spectrum portion;
(c) automatically determine whether a first spectral peak occurs in the baseline-corrected spectrum portion; and
(d) if the first spectral peak occurs in the baseline-corrected spectrum portion, automatically determine location coordinates of the first spectral peak in terms of the time-related variable and the mass-related variable; and
(v) either an electronic data storage device or an information output device electrically coupled to the programmable processor configured so as to report or record location coordinates of a spectral peak of the spectrum portion.
The above noted and various other aspects of the present invention will become apparent from the following description which is given by way of example only and with reference to the accompanying drawings, not drawn to scale, in which:
The present invention provides methods of automated spectral peak detection and quantification that do not require any user input or intervention. The methods described herein can accommodate and model all types of spectral data, where the term “spectral data” is broadly defined as described above, and provide robust automatic detection, integration and reporting of spectral peaks. Any and even all model parameters utilized in these methods may be adaptively determined in a manner that is invisible to the user. The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the embodiments and examples shown but is to be accorded the widest possible scope in accordance with the features and principles shown and described. The particular features and advantages of the invention will become more apparent with reference to the appended
In embodiments of methods in accordance with the instant invention, the various steps may be grouped into an input step, three basic stages of data processing, each stage possibly comprising several steps, and a reporting step as outlined in the method 100 as illustrated in
The next step, step 120, of the method 100 is a preprocessing stage in which baseline features may be removed from the received spectrum and in which a level of random “noise” of the spectrum may be estimated, this step being described in greater detail in subsequent
Finally, in step 190, the parameters of the final model peaks are reported to a user. The reporting may be performed in numerous alternative ways—for instance via a visual display terminal, a paper printout, or, indirectly, by outputting the parameter information to a database on a storage medium for later retrieval by a user. The reporting step may include reporting either textual or graphical information, or both. This reporting step 190 may include the additional actions of comparing peak parameters (for instance, peak position) to a database and reporting, to a user, the identities of analytes that correspond to one or more peaks. Some methods of the invention may further include, in step 190, the action of extracting, from the model spectral parameters, information related to or inferred to be related to the physical functioning or operational state or an operational parameter of an analytical instrument that provided the spectral data and reporting such instrument-related information to a user.
The term “model” and its derivatives, as used herein, may refer to either statistically finding a best fit synthetic peak or, alternatively, to calculating a synthetic peak that exactly passes through a limited number of given points. The term “fit” and its derivatives refer to statistical fitting so as to find a best-fit (possibly within certain restrictions) synthetic peak such as is commonly done by least squares analysis. Note that the method of least squares (minimizing the chi-squared metric) is the maximum likelihood solution for additive white Gaussian noise. In other situations (e.g., photon-counting), it might be appropriate to minimize a different error metric, as directed by the maximum likelihood criterion. More detailed discussion of individual method steps and alternative methods is provided in the following discussion and associated figures.
A feature of a first, pre-processing stage of the new methods of peak detection takes note of the concept that (disregarding, for the moment, any chemical or electronic noise) a spectroscopic signal (such as, for instance, a chromatogram which is a signal obtained versus time) consists of signal plus baseline. If one can subtract the baseline correctly, everything that remains must be signal, and should be fitted to some sort of data peak.
Therefore, embodiments in accordance with the present invention may start by determining the correct baseline. Steps in the methods may apply a polynomial curve as the baseline curve, and measure the residual (the difference between the chromatographic data and the computed baseline) as a function of polynomial order. For instance,
To locate the plateau region 302 as indicated in
Once it is found that ΔSSR less than the pre-defined percentage of the reference value for c iterations, then one of the most recent polynomial orders (for instance, the lowest order of the previous four) is chosen as the correct polynomial order. The subtraction of the polynomial with the chosen order yields a preliminary baseline corrected spectrum, which may perhaps be subsequently finalized by subtracting exponential functions that are fit to the end regions.
Returning, now, to the discussion of method 120 shown in
From step 122, the method 120 proceeds to a step 124, which is the first step in a loop. The step 124 comprises fitting a polynomial of the current order (that is, determining the best fit polynomial of the current order) to the raw spectrum by the well-known technique of minimization of a sum of squared residuals (SSR). The SSR as a function of n, SSR(n) is stored at each iteration for comparison with the results of other iterations.
From step 124, the method 120 proceeds to a decision step 126 in which, if the current polynomial order n is greater than zero, then execution of the method is directed to step 128 in order to calculate and store the difference of SSR, ΔSSR(n), relative to its value in the iteration just prior. In other words, ΔSSR(n)=SSR(n)−SSR(n−1). The value of ΔSSR(n) may be taken a measure of the improvement in baseline fit as the order of the baseline fitting polynomial is incremented to n.
The iterative loop defined by all steps from step 124 through step 132, inclusive, proceeds until SSR changes, from iteration to iteration, by less than some pre-defined percentage, t %, of the reference value for a pre-defined integer number, c, of consecutive iterations. Thus, the number of completed iterations, integer n, is compared to c in step 130. If n≧c, then the method branches to step 132, in which the last c values of ΔSSR(n) are compared to the reference value. However, in the alternative situation (n<c), there are necessarily fewer than c recorded values of ΔSSR(n), and step 132 is bypassed, with execution being directed to step 134, in which the integer n is incremented by one.
The sequence of steps from step 124 up to step 132 (going through step 128, as appropriate) is repeated until it is determined, in step 132, that the there have been c consecutive iterations in which the SSR value has changed by less than t % of the reference value. At this point, the polynomial portion of baseline correction is completed and the method branches to step 136, in which the final polynomial order is set and a polynomial of such order is subtracted from the raw spectrum to yield a preliminary baseline-corrected spectrum.
The polynomial baseline correction is referred to as “preliminary” since, as the inventors have observed, edge effects may cause the polynomial baseline fit to be inadequate at the ends of the data, even though the central region of the data may be well fit.
At this point, after the application of the steps outlined above, the baseline is fully removed from the data and the features that remain within the spectrum above the noise level may be assumed to be analyte signals. The methods described in
The method 150, as shown in
The first step 502 of method 150 comprises locating the most intense peak in the final baseline-corrected spectrum and setting a program variable, current greatest peak, to the peak so located. In this discussion, the terms “peak” or “spectrum” are used to refer to curves (that is, either an array of x,y Cartesian coordinate pairs or, in reference to a synthetic peak, possibly a function y=f(x)) that may be considered as sub-spectra (and which may be an entire spectrum) and which may be defined on a certain subset (which may be the full set) of the available range of x-axis data. The variable x may represent time, wavelength, etc. and y generally, but not necessarily, represents intensity. Accordingly, it is to be kept in mind that, as used in this discussion, the acts of locating a peak or spectrum, setting or defining a peak or spectrum, performing algebraic operations on a peak or spectrum, etc. implicitly involve either point-wise operations on sets of points or involve operations on functional representations of sets of points. Thus, for instance, the operation of locating the most intense peak in step 502 involves locating all points in the vicinity of the most intense point that are above a presumed noise level, under the proviso that the total number of points defining a peak must be greater than or equal to four. Also, the operation of “setting” a program variable, current greatest peak, comprises storing the data of the most intense peak as an array of data points.
From step 502, the method 150 proceeds to second initialization step 506 in which another program variable, “difference spectrum” is set to be equal to the final baseline-corrected spectrum (see step 140 of method 120,
Subsequently, the method 150 enters a loop at step 508, in which initial estimates are made of the coordinates of the peak maximum point and of the left and right half-height points for the current greatest peak and in which peak skew, S is calculated. The method of estimating these co-ordinates is schematically illustrated in
In steps 509 and 510, the peak skew, S, may be used to determine a particular form (or shape) of synthetic curve (in particular, a distribution function) that will be subsequently used to model the current greatest peak. Thus, in step 509, if S<(1−ε), where ε is some pre-defined positive number, such as, for instance, ε=0.05, then the method 150 branches to step 515 in which the current greatest peak is modeled as a sum of two Gaussian distribution functions (in other words, two Gaussian lines). Otherwise, in step 510, if S≦(1+ε), then the method 150 branches to step 511 in which a (single) Gaussian distribution function is used as the model peak form with regard to the current greatest peak. Otherwise, the method 150 branches to step 512, in which either a gamma distribution function or an exponentially modified Gaussian (EMG) or some other form of distribution function is used as the model peak form. A non-linear optimization method such as the Marquardt-Levenberg Algorithm (MLA) or, alternatively, the Newton-Raphson algorithm may be used to determine the best fit using any particular line shape. After either step 511, step 512 or step 515, the synthetic peak resulting from the modeling of the current greatest peak is removed from the spectral data (that is, subtracted from the current version of the “difference spectrum”) so as to yield a “trial difference spectrum” in step 516. Additional details of the gamma and EMG distribution functions and a method of choosing between them are discussed in greater detail, partially with reference to
Occasionally, the synthetic curve representing the statistical overall best-fit to a given spectral peak will lie above the actual peak data within certain regions of the peak. Subtraction of the synthetic best fit curve from the data will then necessarily introduce a “negative” peak artifact into the difference spectrum at those regions. Such artifacts result purely from the statistical nature of the fitting process and, once introduced into the difference spectrum, can never be subtracted by removing further positive peaks. However, physical constraints generally require that all peaks should be positive features. Therefore, an optional adjustment step is provided as step 518 in which the synthetic peak parameters are adjusted so as to minimize or eliminate such artifacts.
In step 518 (
In step 524, the root-of-the-mean squared values (root-mean-square or RMS) of the difference spectrum is calculated. The ratio of this RMS value to the intensity of the most recently synthesized peak may be taken as a measure of the signal-to-noise (SNR) ratio of any possibly remaining peaks. As peaks continue to be removed (that is, as synthetic fit peaks are subtracted in each iteration of the loop), the RMS value of the difference spectrum approaches the RMS value of the noise. As each tentative peak is found, its maximum intensity, I, is compared to the current RMS value, and if I<(RMS×ξ) where ξ is a certain pre-defined noise threshold value, greater than or equal to unity, then further peak detection is terminated. Thus, the loop termination decision step 526 utilizes such a comparison to determine if any peaks of significant intensity remain distinguishable above the system noise. If there are no remaining significant peaks present in the difference spectrum, then the method 150 branches to the final reporting step 527. However, if data peaks are still present in the residual spectrum, the calculated RMS value will be larger than is appropriate for random noise and at least one more peak must be fitted and removed from the residual spectrum. In this situation, the method 150 branches to step 528 in which the most intense peak in the current difference spectrum is located and then to step 530 in which the program variable, current greatest peak, is set to the most intense peak located in step 528. The method then loops back to step 508, as indicated in
Now that the overall set of steps in the method 150 have been described, the process that is used to model individual spectral features is now discussed in greater detail. Traditional spectral peak fitting routines generally model spectral features using either a Gaussian or Lorentzian forms (commonly referred to as peak shapes or line shapes) and tend to either use one preferred line shape throughout the fitting procedure or to query a user as to which line shape to use. Although any arbitrary peak shape can be modeled with a sum of Gaussians (perhaps requiring some Gaussians with negative intensities), the inventors have observed that commonly occurring natural peak shapes (especially in chromatographic spectral data) include Gaussians or even Gamma-distribution-like functions with tailing or leading edges. Therefore, methods in accordance with the present invention may employ a library of peak shapes containing at least four curves (and possibly others) to model observed peaks: a Gaussian for peaks that are nearly symmetric; a sum of two Gaussians for peaks that have a leading edge (negative skewness); a and either an exponentially modified Gaussian or a Gamma distribution function for peaks that have a tailing edge (positive skewness).
The modeling of spectral peaks with Gaussian line shapes is well known and will not be described in great detail here. Methods in accordance with the invention may use a Gaussian functional form that utilizes exactly three parameters for its complete description, these parameters usually being taken as area A, mean μ a and variance σ2 in the defining equation:
in which x is the variable of spectral dispersion (generally the independent variable or abscissa of an experiment or spectral plot) such as wavelength, frequency, or time and I is the spectral ordinate or measured or dependent variable, possibly dimensionless, such as intensity, counts, absorbance, detector current, voltage, etc. Note that a normalized Gaussian distribution (having a cumulative area of unity and only two parameters—mean and variance) would model, for instance, the probability density of the elution time of a single molecule. In the three-parameter model given in Eq. 1, the scale factor A may be taken as the number of analyte molecules contributing to a peak multiplied by a response factor.
As is known, the functional form of Eq. 1 produces a symmetric line shape (skew, S, equal to unity) and, thus, step 511 in the method 150 (
Alternatively, the fit may be mathematically anchored to the three points shown in
If S>(1+ε), then the data peak is skewed so as to have an elongated tail on the right-hand side. This type of peak may be well modeled using either a line shape based on either the Gamma distribution function or on an exponentially modified Gaussian (EMG) distribution function. Examples of peaks that are skewed in this fashion (all of which are synthetically derived Gamma distributions) are shown in
The general form of the Gamma distribution function, as used herein, is given by:
in which the dependent and independent variables are x and I, respectively, as previously defined, Γ(M) is the Gamma function, defined by
and are A, x0, M and r are parameters, the values of which are calculated by methods of the invention. Note that references often provide this in a “normalized” form (i.e., a probability density function), in which the total area under the curve is unity and which has only three parameters. However, as noted previously herein, the peak area parameter A may be taken as corresponding to the number of analyte molecules contributing to the peak multiplied by a response factor.
The inventors consider that a chromatographic peak of a single analyte exhibiting peak tailing may be modeled by a four-parameter Gamma distribution function, wherein the parameters may be inferred to have relevance with regard to physical interaction between the analyte and the chromatographic column. In this case, the Gamma function may be written as:
in which t is retention time (the independent variable), A is peak area, t0 is lag time and M is the mixing number. Note that if M is a positive integer then Γ(M)=(M−1)! and the distribution function given above reduces to the Erlang distribution. The adjustable parameters in the above are A, t0, M and r.
Continuing with the discussion of
With the assumptions given above, the total retention time tR of an analyte is a random variable given by the expression tR=L/v+Σm=1M λm in which the summation is taken over a total of M independent exponentially distributed random variables. If M is an integer, then the summation shown in the above equation yields an Erlang-distributed random variable. In fact, the value of M would be Poisson distributed in a population of molecules, so the retention time would be modeled by the superposition of multiple Erlang random variables. A simple closed-form approximation can be constructed by replacing the distribution of values of M with a constant value that may be loosely interpreted as the mean value of M. The generalization of the Erlang distribution to real-valued M is the Gamma distribution (Eq. 2).
The Gamma distribution model as derived above does not specifically account for chemical diffusion. The presence of diffusion is accommodated by values of M which are, in fact, larger than the average number of desorption events. A different model, the exponentially modified Gaussian (EMG) distribution function, may be used to model the retention time as the outcome of one desorption event and the time required for an analyte molecule to diffuse to the end of the column.
The general, four-parameter form of the exponentially modified Gaussian (EMG) distribution, as used in methods according to the present invention, is given by a function of the form:
Thus, the EMG distribution used herein is defined as the convolution of an exponential distribution with a Gaussian distribution. In the above Eq. 3, the independent and dependent variables are x and I, as previously defined and the parameters are A, t0, σ2, and τ. The parameter A is the area under the curve and is proportional to analyte concentration and the parameters t0 and σ2 are the centroid and variance of the Gaussian function that modifies an exponential decay function.
The inventors consider that an exponentially-modified Gaussian distribution function of the form of Eq. 3 may be used to model some chromatographic peaks exhibiting peak tailing. In this situation, the general variable x is replaced by the specific variable time t and the parameter x0 is replaced by t0. The exponential portion may be taken to indicate a hypothetical distribution of residence times of analyte molecules on the stationary phase for a single (or small number of) of adsorption events per molecule and the Gaussian portion may be taken to represent the superimposed effects of diffusion in the mobile phase. The existence of an EMG-distribution best fit, as opposed to a Gamma-function best fit, may be taken to indicate a chromatic separation in which the analyte has lesser tendency to bind to the stationary phase.
When method 512 is entered from step 510 (see
From step 808, the method 512 (
Alternatively, the fit may be mathematically anchored to the three points shown in
The determination of the best fit peak from among several potential line shapes as discussed above with reference to
Returning, once again, to the method 100 as shown in
The refinement process continues until a halting condition is reached. The halting condition can be specified in terms of a fixed number of iterations, a computational time limit, a threshold on the magnitude of the first-derivative vector (which is ideally zero at convergence), and/or a threshold on the magnitude of the change in the magnitude of the parameter vector. Preferably, there may also be a “safety valve” limit on the number of iterations to guard against non-convergence to a solution. As is the case for other parameters and conditions of methods of the invention, this halting condition is chosen during algorithm design and development and not exposed to the user, in order to preserve the automatic nature of the processing. At the end of refinement, the set of values of each peak area along with a time identifier (either the centroid or the intensity maximum) is returned. The entire process is fully automated with no user intervention required.
Finally, the last step, step 190, in the method 100 (
The examples in the foregoing discussion have illustrated the synthesis of a model peak for each detected actual peak and the calculation of a variety of parameters (e.g., center, width, skew, intensity, etc.) for each such model peak. Thus, execution of a non-linear optimization method such as the Marquardt-Levenberg Algorithm (MLA) or the Newton-Raphson algorithm is performed for each peak. This step is frequently the slowest step in the above-described methods. In order to increase the speed of the calculations, it is possible, for some types of spectral data, to restrict full execution of the optimization routine to just the first detected peak and then to re-use, in the fitting of each subsequent peak, the peak shape parameters (e.g., width, skew or type of model function) determined from the first peak. In embodiments employing this strategy, only the parameters relating to the location and intensity of each peak after the first are determined or estimated for the second and all subsequent peaks. For instance, if the first peak is fit with a Gaussian (Eq. 1) and if it may be assumed that neither the Gaussian character nor the parameter σ vary significantly among peaks, then, for each peak after the first, the decision steps 509 and 510 in method 150 may be bypassed and only the parameters A and μ need to be determined. These latter two parameters may perhaps be calculated for each peak by a reduced or constrained version of the optimization algorithm, thereby increasing execution speed.
The foregoing description has mainly dealt with characterizing peaks in spectral data comprising just two variables (e.g., situations in which the spectral data may be plotted as a simple graph). Frequently, however, routine spectral data include three or more variables. One important and common example of such data is the measurement of chromatography-mass spectrometry data—for instance, as derived in the well-known techniques of liquid chromatography-mass spectrometry (LC/MS) and gas chromatography-mass spectrometry (LC/MS). In these latter two analytical techniques, various separated analytes whose elution from a chromatographic column is a function of the variable, time, are submitted for mass analysis by a mass spectrometer. The mass spectrometer, which is used as the detector for the chromatographic column, generates, in real time, detected relative ion abundance data for ions produced from each eluting analyte, in turn. Thus, such data is inherently three-dimensional, comprising the variables of time, mass (or mass-to-charge ratio) and ion abundance.
The data depicted in
For clarity, only a very small number of peaks are illustrated in
Although a full set of GC/MS or LC/MS data may be stored for post-experiment off-line analysis, the great speed at which data is acquired generally only allows a limited subset of the data to be available to an operator or to instrumental monitoring or control systems during the course of the experiment. The following discussion, together with the accompanying
The methods described below generally include performing a search, such as a binary search, of a range of the mass and time dimensions of a chromatographic-mass spectral data set in order to locate and optimize the unique mass and time ranges defining each peak, and subsequent utilization of the peak detection methods, as described above, to provide further detailed analyses of peaks within one or more of the mass ranges. The functionality and efficiency of the methods described above are increased, in some embodiments in accordance with the invention, by an additional procedure whereby an initial estimate of the number of peaks (e.g, determining whether there are no peaks, a single peak, or a plurality of peaks) in a mass and time range of the data set is made, this estimation being approximately ten times faster than actually detecting these peaks according to the methods described above.
Estimating peaks works as follows. First, baseline removal and noise level estimation proceeds as described above (e.g., see
For calculation speed, the estimation of the number of peaks within the given mass and time range may be simplified so as to provide one of only three outcomes—whether no peaks, a single peak or a plurality of peaks occur above the signal-to-noise level within the given range. If there are no peaks (above the signal-to-noise level) within the given range, then that range of data is excluded from further consideration and data processing proceeds to consider another subdivision of the full data range (if any other sub-divisions exist). If the peak number estimation procedure detects a first peak within the given range, then that peak is modeled by a synthetic fit peak as in method 150 (
In the next step 1104 of the method 1100, a baseline is automatically subtracted from the portion of the spectrum so as to generate a baseline-corrected spectrum portion. The baseline may be determined by the procedure outlined in
Optionally, the chromatography/mass spectrometry apparatus may be (see step 1103 of method 1100) generating another portion of the spectrum concurrently with the execution of steps 1102 through 1110. Referring once again to
The peak detection and number estimation routine investigates the data in each quadrant, in turn. This routine notes that peak p7 is the most intense peak in quadrant Q1, peak p8 is the most intense peak in quadrant Q2, peak p9 is the most intense peak in quadrant Q3, and peak p11 is the most intense peak in quadrant Q4. Accordingly, each of these four peaks is fit with a respective synthetic peak as in method 150 (
In the second pass through the data, the peak detection and number estimation routine investigates the data in each of the sixteen sub-quadrants shown in
In a third pass through the data, the two remaining sub-quadrants Q32 and Q42 are further subdivided as shown in
The calculations described above result in just two remaining peaks, p15 and p16, as shown in
The peak detection, enumeration and fitting routines described above for GC/MS or LC/MS or other multidimensional data improve calculation efficiency and speed by performing a structured peak search that rapidly identifies and eliminates from consideration data regions within which no peaks occur. To increase calculation speed still further, it is possible, for some types of spectral data, to restrict full execution of the optimization routine to just the first detected peak and then to re-use, in the fitting of each subsequent peak, the peak shape parameters (e.g., width, skew or type of model function) determined from the first peak. In embodiments employing this strategy, only the parameters relating to the location and intensity of each peak after the first are determined or estimated for the second and all subsequent peaks.
The model fitting of peaks which are functions of two variables may, in many situations, be modeled as simply the product of two functions, each such function of one variable only. For instance, for the GC/MS data illustrated in
In step 1160 of the method 1150, location coordinate parameters, which are determined as part of the above-described peak fitting procedure, are recorded and, in step 1162, the synthetic best-fit synthetic peak in each sub-region is subtracted from the baseline-corrected spectrum portion. This is referred to herein as “removing” the peak from the spectrum. This removal is performed for every first peak (if any) found in a respective sub-region. Subsequently, in step 1164, a determination is made, for each such sub-region for which a first peak was found, if another peak occurs within the respective sub-region.
In step 1166 of the method 1150, a set of remaining regions of the spectrum portion is defined as only those sub-regions created in the most recent execution of step 1154 for which a second peak was determined to occur in the most recent execution of step 1164 and for which time-related and mass-related ranges are greater than resolution of the apparatus. This definition of the set of all remaining regions effectively excludes, from further consideration, all regions or sub-regions within which no peaks occur or for which all peaks have been removed (e.g., see shaded regions of
Alternatively, the model fitting of multidimensional data may, in many situations, be simplified by considering the data as being a function of a reduced number of independent variables—for instance, a single independent variable. As one example,
When the data is fitted as a function of only one variable, then, in contrast to the quadrant divisional search scheme shown in
During each pass, a first peak of the projected data is fit as in method 150 (
After the approximate fitting of the projected data of each sub-region, each remaining non-empty sub-region is divided in half once again. For instance,
The approximate fitting routine described above will effectively locate sub-regions that do not contain any peaks above the signal-to-noise level (such as occurs for sub-region R22) as well as large contiguous non-peak-bearing time ranges within individual sub-regions. The data of such empty sub-regions (such as the sub-regions R22, R112, R115, R122, R125, R212, and R215, see
The iterative process described above halts naturally when the size of the mass range is equal to the mass resolution of the data. Alternatively, the iterative process described above may be continued until identical peak lists (quantity, as well as identical widths, positions, and SIN ratios) are produced for both newly created halves of a region or sub-region that has recently been divided according to mass. The m/z coordinates of peaks may be determined by the m/z coordinates of the respective sub-regions in which they are determined to occur.
In alternative embodiments, the peak locating, enumeration and fitting may be eliminated, during early iterations, and replaced by simple subtraction of the projected data obtained from one half of a region or sub-region from the projected data from the other half of the region or sub-region. In other words, after dividing a region or sub-region into two halves according to mass, the projected data of one half is simply subtracted from the projected data of the other half so as to yield a “difference spectrum”. For instance, returning to
In step 1216 of the method 1200 (
Still referring to
The programmable processor shown in
An interesting and useful feature of methods in accordance with the invention is the possibility of also reporting, in the case of chromatographic data, information related to operational state or physical characterization of the analytical instrumentation that provided the chromatographic data. For instance, derivation of parameters used in fitting Gamma distributions to peak features may provide information on fundamental properties of analyte interaction between analyte molecules and the mobile and stationary phases of a chromatographic column. Such information may include, for instance, the average number of times that molecules of a particular analyte are adsorbed on the stationary phase during their transit through the column and the average time for desorption from the stationary phase back into the mobile phase. The comparison between line shapes for different analytes (for instance, Gamma versus Gaussian versus exponentially-modified Gaussian) may provide a relative measure of the importance of adsorption versus simple diffusion in the elution of compounds from the column. A user may then use such information to adjust the composition or physical characteristics of the mobile or stationary phases so as to facilitate better chromatographic separation of certain pairs of compounds.
The end result of methods described in the preceding text and associated figures is a robust, exhaustive and general method to detect peaks and characterize spectral peaks without user-adjustable parameters. It makes no assumptions about peak shape or width, and thus can detect a wide variety of peaks, even in a single chromatogram. Since it requires no user input, it is suitable for automation, use in high-throughput screening environments or for use by untrained operators. Computer instructions according to any of the methods described above may be supplied as a computer program product or products tangibly embodied on any form of computer readable medium, such computer program product or products themselves being embodiments of the invention.
The discussion included in this application is intended to serve as a basic description. Although the present invention has been described in accordance with the various embodiments shown and described, one of ordinary skill in the art will readily recognize that there could be variations to the embodiments and those variations would be within the spirit and scope of the present invention. The reader should be aware that the specific discussion may not explicitly describe all embodiments possible; many alternatives are implicit. For instance, although various exemplary embodiments described herein refer to peak fitting with curves of Gaussian, exponentially-modified Gaussian and Gamma distribution line shapes and with pairs of Gaussian curves, any suitable form of line shape may be employed, depending on the particular needs of the artisan or on particular data formats or types of experiments employed. One of ordinary skill in the art would readily understand, from the discussions provided herein, how to employ the methods of the invention to fit various peak shapes using any suitable line shape. One of ordinary skill in the art would also readily understand how to modify equations presented in terms of positive and negative skew so as to fit peaks of negative and positive skew, respectively. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit, scope and essence of the invention. Neither the description nor the terminology is intended to limit the scope of the invention. Any publications or patent applications mentioned herein are hereby incorporated by reference herein in their respective entirety, as if set forth fully herein.
This application claims the benefit, under 35 U.S.C. §119(e), of United States Provisional Application Ser. No. 61/182,953, filed on 1 Jun. 2009 (Jan. 6, 2009), said provisional application incorporated by reference herein in its entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2010/036090 | 5/25/2010 | WO | 00 | 12/1/2011 |
Number | Date | Country | |
---|---|---|---|
61182953 | Jun 2009 | US |