Two-dimensional chromatography is a tool for analyzing complex samples, or for reducing sample complexity prior to analysis in mass spectrometry, for example. After baseline and/or noise removal, peak detection is a prerequisite to subsequent chemometric quantitation steps in one- or two-dimensional chromatography. While there are established methods for peak detection of chromatograms in one dimension, conventional methods for two-dimensional chromatography typically miss identification of highly overlapped peaks or are very complex, requiring a user to supervise the procedure and to interactively optimize several parameters.
For example, one conventional approach for detecting peaks in two-dimensional chromatograms includes a “watershed segmentation” algorithm. Generally, in the watershed segmentation algorithm approach, the two-dimensional signal of interest is effectively treated as a digital image and the measured signal intensity in two dimensions becomes the “pixel” value of the image. However, the watershed segmentation algorithm is not able to deal with highly overlapped peaks. For example, as shown in
Another conventional algorithm includes an extension of peak detection algorithms used in one-dimensional chromatography. In the one-dimensional approach, information from derivatives of the signal is used. For example, in a one-dimensional chromatogram, a peak can be detected as a point where the second derivative (the curvature of the signal of interest) has a local minimum. This is schematically illustrated in
This procedure has been extended to two-dimensional chromatography. For example, one-dimensional peak detections are performed along each of the one-dimensional data sets of points in the second dimension. Then, a peak merging procedure collapses the redundant peaks identified previously in order to eliminate redundancy and to identify peaks in two dimensions. However, the merging procedure uses a decision tree with four parameters that are specified by the user. Since the initial values of these parameters are not known a priori, the user must manually optimize the four variables, which is difficult, time consuming and unreliable.
The example embodiments are best understood from the following detailed description when read with the accompanying drawing figures. It is emphasized that the various features are not necessarily drawn to scale. In fact, the dimensions may be arbitrarily increased or decreased for clarity of discussion. Wherever applicable and practical, like reference numerals refer to like elements.
In the following detailed description, for purposes of explanation and not limitation, illustrative embodiments disclosing specific details are set forth in order to provide a thorough understanding of an embodiment according to the present teachings. However, it will be apparent that other embodiments according to the present teachings that depart from the specific details disclosed herein remain within the scope of the appended claims. Moreover, descriptions of well-known devices and methods may be omitted so as to not obscure the description of the embodiments. Such methods and devices are clearly within the scope of the present teachings.
In the various embodiments, positions of actual or true peaks are identified in two-dimensional separations, such as peaks generated by gas and/or liquid chromatography, for example, using an extension of differential geometry concepts applied to one-dimensional signal processing. The peaks are identified based on a parameter indicating an acceptable noise-to-signal ratio for the peaks. Although the parameter may be specified by a user in an embodiment, the process generally requires minimal user intervention. The position of the peaks detected can then be used as input for deconvolution routines to further determine the precise location and volume of each peak.
While pertinent to gas and liquid chromatography, it is to be understood that various embodiments may be applicable to signals arising from any two-dimensional separation, such as separations from two-dimensional gel electrophoresis and mass spectrometry, which may be performed by a microfluidic device, for example. Liquid chromatography/mass spectrometry (LC/MS), in particular, generally combines separation functionality of liquid chromatography with mass analysis ability of mass spectrometry. LC/MS measurements provide two-dimensional signals, having corresponding three-dimensional data sets that may be graphed along an x-axis indicating spectrum number or intensity, a y-axis indicating elution time, and a z-axis indicating mass-to-charge m/z ratio values, for example. LC/MS may be used to detect and identify molecules, chemicals, proteins and the like, in a complex mixture.
The peak detector 330 performs processing operations on the received signals, e.g., from sample separations, including the peak detection process, in accordance with various embodiments discussed below. In various embodiments, the peak detector 330 performs processing operations, and may be implemented as a microprocessor, a digital signal processor (DSP), or the like, or may be implemented at least in part by hard-wired logic circuits or customizable hardware. As stated above, although depicted separately, the peak detector 330 may be included within the signal generator 310, in various embodiments.
Referring to block 405
At block 405, the two-dimensional signal S(x, y) is received, where x and y represent variables in two dimensions. For example, in a representative embodiment, x represents a discrete variable associated with a first dimension, such as a fraction resulting from strong anion exchange chromatography, and y represents a discrete variable corresponding to a second dimension, such as time along a reversed-phase separation. The signal S(x, y) may be based on samples separated by the signal generator 310 of
At block 410, a process is performed for reducing noise in the signal S(x, y). Since derivatives of a noisy signal generally amplify noise with each successive derivation, it is usually necessary to de-noise the signal S(x, y). In order to reduce noise, filtering may be performed on higher frequencies of the signal S(x, y). Various methods of noise reduction include, for example, Fourier domain filtering and Savitzky-Golay filtering. In an embodiment, the high frequency filtering may include a Fast Fourier Transform (FFT) since implementation of the FFT operation can also be used should signal interpolation (e.g., block 430) be necessary.
At block 513, a cutoff frequencies indicated by s* and t* are determined based on the power spectrum P(s, t) as a direct generalization of a procedure performed in one-dimensional power spectra via a variance calculation. An example of a procedure to determine a cutoff frequency for the case of a one dimensional power spectrum P(s) is disclosed by Felinger, Data Handling in Science and Technology, E
An example of determining a cutoff frequency s* is schematically illustrated in
Using the cutoff frequencies indicated by s* and t*, a smoothing window, such as a Hamming window, is applied to filter the FFT, F(s, t) at block 514. Application of a Hamming window is described, for the case of a one-dimensional signal, by Felinger, Data Handling in Science and Technology, E
An inverse FFT is calculated at block 515 to obtain a smooth signal, Ssmooth(x, y). It is understood that other methods may be used to reduce the noise in the signal S(x, y) without departing from the scope of the present disclosure.
Once the smooth signal Ssmooth(x, y) is obtained at block 515, the process returns to
Also, the smooth signal Ssmooth(x, y), either with or without alignment along the second dimension, may be optionally interpolated along one of the two dimensions at block 430 of
In order to automatically detect uneven resolution along the two dimensions for any given peak, for example, when the cross section of the peak along one dimension shows a much lower number of points than a cross section of the peak along the other dimension, the following procedure may be used, in a representative embodiment. Standard deviations of the differences in consecutive values of the signal along the first dimension are calculated. The procedure is repeated for values of the signal along the second dimension. Then, the largest standard deviation is divided by the smallest standard deviation. Typically, a ratio below two indicates that both dimensions are resolved to an equivalent degree, while a ratio above two is indicative of uneven resolution.
When used, interpolation is performed in both dimensions simultaneously. When one dimension has higher resolution than the other, the data set corresponding to the dimension having the higher resolution is decimated, accordingly. For example, if y is the dimension having the higher resolution (where y=0, 1, 2, . . . m), a new signal S*smooth(x, y*) is created, such that y*=0, 2, 4, . . . m. The decimation can be carried out as many times as necessary until the two dimensions have similar resolutions. When both dimensions have poor resolution, decimation is not required.
However, when at least one of the dimensions has poor resolution, interpolation is performed using a standard procedure of zero filling, as described, for example, by Sundarajan, The Discrete Fourier Transform, W
ω2,max=kωmax
When the new Fourier spectrum is inversely transformed, the number of data points has increased k-fold with respect to the original signal. In the two-dimensional case, any uneven number of points (usually 1 or 3) can be intercalated in the original signal, thereby improving overall resolution. Any combination of these and other interpolation techniques may be used to obtain a usable two-dimensional smooth signal Ssmooth(x, y) having equal resolutions in both dimensions.
At block 440, baseline subtraction or removal may also be optionally performed on the smooth signal Ssmooth(x, y). Baseline subtraction may include methods that use regression to specific functions to the baseline (e.g. polynomials of different degrees and splines) and methods that analyze the signal in frequency space, like Fourier, or Wavelet transforms, as would be apparent.
Although shown following block 430 (signal interpolation) in
In the process indicated at block 450 of
According to the embodiment depicted in
By adding all the normalized vector products from each contiguous pair of tangent vectors (corresponding to each facet of the surface), and normalizing the final vector, the estimate of the unit normal vector at point (x, y) is obtained. A variant of this approach, described by Thürmer et al., Computing Vertex Normals from Polygonal Facet, J
At block 752, two principal curvatures are computed at each point (x, y) of the surface of smooth signal Ssmooth(x, y) using the corresponding normalized vector. The two principal curvatures are averaged at block 753 to define the mean curvature of the surface at each point (x, y). In an embodiment, one or more rounds of convolution of the mean curvature of the surface at each point (x, y) may be performed using a Gaussian filter, for example, in order to reduce noise in the curvature field. The two principal curvatures and the mean curvature at each point may be determined according to any appropriate technique, including the technique described by Batagelo et al. in the context of image processing, for example. That is, the intensity value of each pixel in the image processing application corresponds to the intensity value of the chromatography signal in the two-dimensional separation. In order to arrive at the mean curvature, a 2×2 matrix W forming the Weingarten equation must be diagonalized:
For example, referring to blocks 751 through 753, the Weingarten equation, above, shows how to obtain the normal to the surface at each point (nu, nv) by multiplying W by the normal basis vector (ru, rv). Once W is diagonalized, principal curvatures of the surface at that point are found as the eigenvalues of W, kmax and kmin, for example, where kmax and kmin are the principal curvatures. The mean curvature of the surface at that point, H, is defined as the average of the two principal curvatures, as follows:
Any other method of numerically estimating mean curvature may be used, as long as the estimation does not depart significantly from the actual value of the mean curvature. For example, the mean curvature may be estimated without calculating vectors normal to the surface and/or calculating principal radii of curvature.
At block 754, local maxima of H are determined on the surface Ssmooth(x, y) to generate candidate peaks on the surface. The local maxima may be determined, for example, by identifying a local neighborhood of each point (x, y) on the surface Ssmooth(x, y), and determining whether the value of mean curvature H at point (x, y) is greater than the value of mean curvature H in the other points included in the local neighborhood of the point (x, y). When the value of mean curvature H at the point (x, y) is higher than the values of mean curvature H at the other points, then point (x, y) is identified as a local maximum, and thus a candidate peak. As a result, multiple candidate peaks are detected on the surface Ssmooth(x, y), relative to neighboring points, providing flexibility and internal consistency in defining the candidate peaks.
The detected candidate peaks include true peaks (e.g., peaks 101-105 of
Returning to
Initially, a parameter is received at block 961, referred to as “stringency,” indicating a threshold for determination of the true peaks. In an example, the parameter is defined by a user, and may be received through an input device (e.g., input device 1245 and corresponding I/O 1235 of the representative system 300 shown of
Once the stringency value has been specified, selection of the true peaks is accomplished through iterative elimination. First, a list of all candidate peaks and their signals is built at block 962. In an example, the candidate peaks are in no particular order. The candidate peaks are indicated by the (x, y) coordinates of the corresponding surface points of the two-dimensional smooth signal Ssmooth(x, y). A variance of values of the smooth signal Ssmooth(x, y) at locations corresponding to all the candidate peaks in the list is calculated at block 963. The variance of the signals corresponding to all the candidate peaks will be referred to as an “aggregate variance.”
A candidate peak, indicated by index k (where k is a positive integer), is removed from the list at block 964 and the variance of signals is recalculated based on the signals corresponding to the remaining candidate peaks on the list at block 965. For example, during the first iteration, k=1, and thus a first candidate peak if removed from the list and the variance of signals is recalculated without the signal corresponding to the first candidate peak.
At block 966, the recalculated variance is compared with the aggregate variance previously calculated at block 963, and the difference between the recalculated and aggregate variances is compared with the stringency value at block 967. When the difference between the recalculated and aggregate variances is greater than the stringency value (block 967: Yes), the candidate peak k is labeled a true peak and removed permanently from list at block 968. The true peak k may also be added to a list of true peaks, which may be stored, for example, in results database 348 at block 969, for later reference. When the difference between the recalculated and aggregate variances is not greater than stringency value (block 967: No), the candidate peak k is placed at the end of the list at block 970. Notably, in this process, the larger the value of stringency, the fewer candidate peaks will be identified as true peaks, as shown for example in
At block 971, it is determined whether there are any candidate peaks remaining on the list. For example, the total number of candidate peaks may be represented by n, in which case a determination in made at block 971 of whether k=n. When there are no remaining candidate peaks (e.g., k=n), the peak selection process ends, and the list of true peaks is complete. When there are remaining candidate peaks (e.g., k<n), then the variable k is incremented by one at block 972, and blocks 963 through 971 are repeated. In an alternative embodiment, the peak selection process simply continues without determining whether any candidate peaks are remaining, as long as there are candidate peaks on the list.
Referring again to
The second candidate peak is removed from the list at block 964 and the variance of the remaining signals is recalculated at blocks 965. At block 966, the recalculated variance is compared with the second aggregate variance, and the difference between the recalculated and second aggregate variances is compared with the stringency value at block 967. When the difference between the recalculated and second aggregate variances is greater than the stringency value (968: Yes), the second candidate peak is labeled a true peak and removed permanently from list at block 968. The second candidate peak may also be added to the list of true peaks and stored at block 969. When the difference between the recalculated and aggregate variances is not greater than the stringency value (967: No), the second candidate peak is placed at the end of the list at block 970.
The process is repeated through subsequent iterations until k is equal to the total number of candidate peaks, and/or a full examination of all candidate peaks from the list does lead to new true peaks. For example, in an alternative embodiment, the process of
An example of true peaks which have been identified by the true peak selection of
In an alternative embodiment, each candidate peak may be identified as a true peak or ancillary peak at substantially the same time it is identified as a candidate peak (e.g., at block 755 of
As noted above, the description of
It is further assumed that the area of the chromatogram covered with peaks does not represent the majority of the support of the smooth signal Ssmooth(x, y). This is because there must be enough candidate peaks that do not become true peaks (i.e., the ancillary peaks) for the peak selection process of
Peak positions determined via local maxima of H are sometimes very close to, but not right at a true peak, which coincides with the maximum signal intensity for the case of isolated peaks. Therefore, the peak detection process is intended for subsequent quantitative analysis, where for instance, overlapped peaks are deconvolved by an optimization procedure. As a result of the optimization procedure, precise positions of the peaks are provided. However, in order for the optimization procedure to work, appropriate initial conditions, such as the number of peaks and their approximate parameters, are required, as provided by the peak detection process.
In particular,
In the depicted representative embodiment, the system 1200 includes a two-dimensional separator 1210 and a peak detector 1230. The two-dimensional separator 1210 receives samples, which may include various mixtures of molecules (e.g., peptides, proteins, or the like) to be identified. As stated above, the samples may be separated by the two-dimensional separator via various types of separation processing to reduce the complexity of the mixture and to isolate to the extent possible individual compounds contained within sample. Isolation may occur spatially or temporally, for example. The two-dimensional separator 1210 may perform separation processing in accordance with any appropriate separation system, including two-dimensional gel electrophoresis and LC/MS, and may be implemented in part using a microfluidic device, for example. One example of performing two-dimensional separations of tryptic digests using glass microfluidic devices is described by Ramsey et al., High-Efficiency, Two-Dimensional Separations of Protein Digests on Microfluidic Devices, A
The peak detector 1230 performs processing on the sample separations, including detecting peaks, in according with various embodiments discussed below. The peak detector 1230 may be a computer processor, for example, and includes central processing unit (CPU) 1231, internal memory 1232, bus 1239 and interfaces 1235-1238, and is configured to interface with the two-dimensional sample separator 1210 through a respective interface 1212, which may be a universal serial bus (USB) interface, an IEEE 1394 interface, or a parallel port interface, for example. As stated above, it is understood that, although depicted separately, the peak detector 1230 may be included within the two-dimensional separator 1210, in various embodiments.
With respect to the peak detector 1230, the internal memory 1232 includes at least nonvolatile read only memory (ROM) 1233 and volatile random access memory (RAM) 1234, although it is understood that internal memory 1232 may be implemented as any number, type and combination of ROM and RAM, and may provide look-up tables and/or other relational functionality. In various embodiments, the internal memory 1232 may include a disk drive or flash memory, for example. Further, the internal memory 1232 may store program instructions and results of calculations or summaries performed by CPU 1231.
The CPU 1231 is configured to execute one or more software algorithms, including the peak detection process of the embodiments described herein, in conjunction with the internal memory 1232. In various embodiments, the CPU 1231 may also execute software algorithms to control the basic functionality of the system 1200. The CPU 1231 may include its own memory (e.g., nonvolatile memory) for storing executable software code that allows it to perform the various functions. Alternatively, the executable code may be stored in designated memory locations within internal memory 1232. The CPU 1231 executes an operating system, such as a Windows® operating system available from Microsoft Corporation, a Linux operating system, a Unix operating system (e.g., Solaris™ available from Sun Microsystems, Inc.), or a NetWare® operating system available from Novell, Inc. The operating system may control execution of other programs, including collection and separation of samples and output of corresponding signals by the two-dimensional separator 1210.
In an embodiment, a user and/or other computers may interact with the peak detector 1230 using input device(s) 1245 through I/O interface 1235. The input device(s) 1245 may include any type of input device, for example, a keyboard, a track ball, a mouse, a touch pad or touch-sensitive display, and the like. Also, information may be displayed by the peak detector 1230 on display 1246 through display interface 1236, which may include any type of graphical user interface (GUI), for example. The displayed information includes the processing results obtained by the CPU 1231 executing the method of peak detection, described herein.
The processing results of the CPU 1231 may also be stored in the results database 1248 through memory interface 1238. The database 1248 may include any type and combination of volatile and/or nonvolatile storage medium and corresponding interface, including hard disk, compact disc (e.g., CD-R/CD/RW), universal serial bus (USB), flash memory, or the like. The stored processing results may be viewed, e.g., on the display 1246, and/or further processed at a later time. Also, the processing results may be provided to other computer systems connected to network 1247 through network interface 1237. The network 1247 may be any network capable of transporting electronic data, such as the Internet, a local area network (LAN), a wireless LAN, and the like. The network interface 1237 may include, for example, a transceiver (not shown), including a receiver and a transmitter, that provides functionality for the system 1200 to communicate wirelessly over the data network through an antenna system (not shown), according to appropriate standard protocols. However, it is understood that the network interface 1237 may include any type of interface (wired or wireless) with the communications network, including various types of digital modems, for example.
The various “parts” shown in
While specific embodiments are disclosed herein, many variations are possible, which remain within the scope of the invention. Such variations would become apparent after inspection of the specification, drawings and claims herein. The invention therefore is not to be restricted except within the scope of the appended claims.