The present invention relates to a method, a computer tool and a system for detecting peptide peaks in measurement signals generated by HPLC-MS (High Performance Liquid Chromatography—Mass Spectrometry) instruments, as defined in the preamble of claims 1, 13 and 14 respectively.
More particularly, the invention concerns the detection of peptide peaks, which are combined with a background noise typical of the measurement signals generated by HPLC-MS instruments.
As used herein, the term proteomics is intended to designate the study of the proteome, i.e. the time- and cell-specific complement of the genome.
As is known in the field of proteomics, one of the common methods for analysis of biological tissues and fluids consists in the combination of two instruments, i.e. a HPLC (High Performance Liquid Chromatography) chromatograph and a Mass Spectrometer MS.
Referring to
This is followed by a step 1C of excision of single protein spots from the 2D gel and a step 4 in which the protein is digested by a proteolytic enzyme 5, typically trypsin.
Otherwise, as shown in
Then, the acquisition and filtering chain 1 provides a first chromatographic separation 1D of the resulting peptides based on their hydrophobicity and a second separation 1E of such peptides based on their mass.
Particularly, the chromatographic separation step 1D is carried out using the HPLC chromatograph 2 and the second separation step 1E is carried out using the mass spectrometer MS 3.
Finally, the acquisition and filtering chain 1 provides a step 1F for identification of the biological sample proteins from the detected peptides, with a bottom-up approach known as PMF (Peptide Mass Fingerprinting) 7, as is known to those of ordinary, skill in the art.
Nonetheless, the combination of the HPLC chromatograph 2 and the MS spectrometer 3 introduces chemical noise peaks in mass spectra, which are morphologically undistinguishable from peptide peaks and are of such an intensity as to wholly hide low concentration peptides.
Such a situation is, for example, shown in
The causes of this noise are not wholly clear and have not been extensively investigated from a theoretical point of view.
One of the most probable origins seems to lie in the detection step 1E by the mass spectrometer 4, both of the movable phase used for chromatographic peptide separation 1D and of a fraction of ions produced by ESI ionization, which skips the mass analysis process and reaches the detector of the spectrometer at times that are stochastically unrelated to the spectrometer settings.
Nonetheless, regardless of the origin of the introduction of such noise peaks that can hide peptide peaks, little or nothing has been actually proposed heretofore to remove these peaks in the instruments of the acquisition and filtering chain 1.
Even on software level, noise removal approaches have been based on the analysis of the individual scans by the spectrometer 3, which involves the serious drawback of losing an important part of the information, i.e. the chromatographic peptide separation dimension.
In an attempt to obviate the above drawbacks, a number of solutions have been proposed, including those disclosed in the following patents:
WO200438602: this document discloses a method that describes the whole process of acquisition, compression, processing, analysis and visualization of spectroscopically observed chemical and biological compounds. Nevertheless, this method does not involve processing of chromatographic data.
WO200419003: this document discloses a typical use of wavelet decomposition for denoising and compression of individual mass spectra. No reference is made to chromatografic data.
US20030078739: this document discloses a typical approach to peptide peak detection, which is based on clustering of peaks of equal mass in successive scans. The limit of this approach is that chemical noise is also represented by peaks of equal mass in successive scans.
US20030130823: This document discloses the use of wavelet decomposition as a denoising and data compression method.
CA2388842: this document discloses the use of spectrographic data to retrieve chromatographic data. The approach for attenuating background noise from chromatograms is not based on wavelet decomposition.
U.S. Pat. No. 5,885,841: this document discloses a typical use of wavelet decomposition for removing the baseline from chromatographic data (obtained from spectrographic data).The approach described in this document simply consists in chromatographic signal smoothing, and is not significantly more effective than a low-pass filter.
In view of the prior art as described above, the object of the present invention is to improve detection of peptide peaks hidden below the noise threshold in measurement signals which result from the combination of a chromatograph and a spectrometer (HPLC-MS).
According to the present invention, this object is fulfilled by a method for detection of at least one peptide peak value combined with background noise in the data of a measurement signal generated by the combination of a chromatograph and a mass spectrometer, as defined in claim 1.
This object is fulfilled by a computer tool for carrying out the method, which is loaded in the computer memory and run in such computer, as defined in claim 13.
Finally, this object is also fulfilled by a processing system that can create filtered spectrographic files in most common formats such as “.txt”, “mzData”, “mzXML”, etc., as defined in claim 14.
This invention provides a method that can retrieve all Single Ion Chromatograms (SIC) of the experiment and filter the signal in the chromatographic dimension.
Furthermore, the method of the present invention allows identification of morphological differences between chemical noise and peptide peaks.
Also, thanks to the method of the present invention, noise heteroscedasticity can be accounted for, and noise can be filtered according to its intensity in each SIC.
Moreover, the method of the present invention is a post-data processing method that can be implemented with any mass spectrometer MS susceptible of being combined with HPLC chromatography instruments.
It shall be noted that the method allows no on-line implementation on spectrometers, because data processing requires all spectrographic scans to be first acquired and then processed in accordance with the inventive method.
Thus it has to be considered as a post-data processing method.
The characteristics and advantages of the invention will appear more clearly from the following description of a practical embodiment, illustrated by way of a non-limiting example in the annexed drawings, in which:
As described above, conventional approaches for removing noise peaks 12 overlapping the peptide peaks 9 in the measurement signal 8 (see
Nonetheless, this causes the loss of an important part of the information, i.e. the chromatographic peptide separation dimension obtained by the chromatograph 2 contained in the acquisition and filtering chain 1 (see
For best utilization of the information that can be retrieved from the chromatographic separation dimension, also referring to
Particularly, such reconstruction requires all the N successive scans to be arranged in adjacent positions to form a 3D map 11, showing all the chromatograms of an experiment:
In other words, in order to prevent any loss in the data generated by the acquisition and filtering chain 1, all the N scans are arranged in adjacent positions to form the 3D map 11.
Particularly, referring to
It shall be noted that the intensity I of peptide peaks 9 in one scan is a function of both the mass value and the elution time in the liquid column of the chromatograph 2.
Nevertheless, once again in the representation of
In prior art techniques, noise can be found in such 3D map 11 during a step in which an average spectrum is determined from multiple successive scans.
This step provides a highly regular pattern of equally spaced peaks 1 Th, as shown in
Such regular pattern can be confirmed by a transform operation, i.e. by applying a Fourier transform to the spectrum of
Particularly, further referring to
Considering that peptide peak values are not evenly distributed, these components can only represent noise 12, more particularly chemical noise 12a, also known as background b(t).
Nevertheless, the easily recognizable pattern in
Furthermore, prior art methods generate unacceptable artifacts in peptide signal reconstruction.
Such artifacts are most likely caused by the non-linearity of common filters, and by the changing morphology of peaks as mass changes, which is typical for spectra in which Time of Flight of ions is detected.
It should be noted that the spectrometer 3 of the acquisition and filtering chain 1 detects the Times of Flight (TOF) of peptides in the instruments and uses such values to determine their masses.
Particularly, there is a quadratic relation between times and masses, as shown below:
where:
t=time of flight of the ion;
m=mass of the ion;
z=charge of the ion;
e=charge of an electron;
d=actual length of the TOF analyzer;
Vs=acceleration voltage in the TOF analyzer.
The above formula clearly shows that the mass m to charge z ratio of the ion is proportional to the squared Time of Flight TOF, that is:
This quadratic relation involves wider peaks at high masses.
Nonetheless, when the 3D maps 11 as described above with reference to
Thus, as shown in
Obviously, the peptide peaks 9 that can be seen from
However, the purpose of the present invention is to allow detection of those peptide peaks that have such a low concentration as to be totally hidden below the elution profiles of chemical noise 12a.
Therefore, the 3D map 11 is a measurement signal generated by the combination of the chromatograph 2 and the mass spectrometer 3, which has at least one peak value 24 (see
Nonetheless, the detection of such peptide peaks 24 hidden below the elution profiles of chemical noise 12a is hindered by the problem that elution traces cannot be immediately retrieved from the data provided by the mass spectrometer 3, and have to be reconstructed off line.
Therefore, to allow evaluation of elution traces, the individual spectra obtained from the N scans of the mass spectrometer 3 (
Particularly, such reorganization consists in creating a matrix MN×P 14, in which the lines represent the N successive scans (i.e. specific elution times) and the P columns represent the intensity value of the ith scan at the jth clock tick (corresponding to a specific mass value).
The intensity value of each vector P depends on what is detected by the spectrometer 3 at a given time and at a given mass and can be either zero or related to noise, to the presence of a peptide or to the sum of the noise and the peptide signal.
This matrix is created through the following steps:
N=total number of scans;
P=total number of clock ticks;
i=scan number;
j=clock tick;
mi,j=intensity value in the scan i and at the clock tick j, where 1<i<N and 1<j<P;
mī,j=row vector, corresponding to the mass spectrum, where 1<i<N and 1<j<P ;
mi,
Therefore, the matrix MN×P 14 contains all the intensity values retrieved by the acquisition and filtering chain 1 throughout the N scans.
It shall be noted that the diagram of the column vector mi,
Once the matrix 14 has been obtained, also with reference to
Particularly, for each jth column vector of intensity values mi,
In a preferred embodiment, the wavelet decompositions 15 and 17 may coincide in a single wavelet decomposition step.
The wavelet decomposition is assumed herein to be known to those of ordinary skill in the art, and will not be described in greater detail below.
It should be noted that the step of performing a wavelet decomposition 15 for said jth vector SIC to generate a first processed vector 16 representative of the smoothed intensity values of the column vector mi,
Particularly, the first processed vector 16 is a vector with a dimension N, where N is the number of scans (or positions), in which each point of such processed value approximates the baseline of the specific SIC under examination.
More particularly, the wavelet decomposition 15 allows an estimate of the low frequency component of the measurement signal for the specific SIC under examination at the high scales of the decomposition approximations (
For this purpose, also referring to
The selection of the approximation level shall be related to the relationship between scale, frequency and amplitude of peptide signals within the RT domain of each SIC of the matrix 14.
It should be particularly noted that the scale level can be related to the frequency, and hence to the amplitude of signals in the time domain.
It shall be noted that too high an approximation level (e.g. A8, A9, not shown in
In a preferred embodiment of the present method, the preset approximation level ranges from the fifth A5 to the seventh A7 approximation levels.
Preferably, the sixth approximation level A6 (see
In other words, the sixth approximation level A6 best approximates the carrier of the SIC.
Concerning the step of providing a threshold value S to perform the wavelet decomposition 17 and thus generate the second processed vector 18, such step consists, for instance, in using the wavelet decomposition 15 not only for smoothing the jth SIC, but also for denoising it.
As an alternative, to generate the second processed vector 18, representative of the SIC signal cleaned of any oscillations caused by stochastic noise, the step of providing the threshold value S comprises the additional steps of:
Particularly, this additional step consists in determining whether the variance σ2 of the background noise 12 of the values of the intensity column vector SIC changes throughout the period of the measurement signal.
If this is the case, the values of the intensity column vector SIC will be divided in as many intervals o portions as there are changes in the variance σ2.
Furthermore, a measurement threshold g other than the threshold value S is assigned to each of these intervals or portions of values of the intensity column vector SIC.
Advantageously, the step of providing such additional threshold values S of each portion of values of said intensity column vector SIC in which the stochastic noise 12b has a non-stationary variance σ2 allows detection of any peptide peaks 24 (in the respective portions of the intensity column vector SIC) that would remain below the threshold if a single threshold value S were used.
A dynamic programming algorithm, known to those of ordinary skilled in the art and not described any further, may be used for determining the times of change of the variance σ2.
Particularly, with K being the maximum number of Change Points (i.e. the times of change of the variance σ2 of stochastic noise 12b in the values of the intensity column vector SIC) and D being the minimum distance between two Change Points, and assuming that K and D are much smaller than the time length of the values of the intensity column vector SIC, this dynamic programming algorithm provides the above mentioned times of change of the variance σ2 in the values of the intensity column vector SIC.
It shall be noted that the second processed vector 18 is also a vector having a dimension N, with N being the number of scans.
Particularly, the step of denoising the specific jth SIC is based on a so-called thresholding process, which is preferably implemented as hard thresholding, but may also be implemented as soft thresholding.
More particularly, in this thresholding process, for each detail D1, D2, D3, etc. (see
Then, the signal is anti-transformed with the new detail coefficients to generate the second processed vector 18 representative of the SIC signal cleaned of any oscillations generated by stochastic noise 12b.
It shall be noted that the thresholding process requires the threshold S (or any additional threshold values Ŝ that are set for each portion of values of the intensity column vector SIC) to be set for each detail of the decomposition, such threshold value S being calculated as a multiple of the standard deviation a of stochastic noise 12b.
As mentioned above, stochastic noise 12b is concentrated in the first detail level cD1 of the wavelet decomposition, wherefore the problem can be associated to the calculation of the standard deviation σ of cD1.
Such standard deviation a can be estimated by an estimator such as a MAD (Median Absolute Deviation), which is robust to the presence of outliers.
It should be further noted that MAD may be also used for estimating the variance σ2 of the possible ?time intervals of the intensity column vector SIC.
It shall be noted that, in this specific case, the signal of interest is the baseline, and the outliers are peptide signals, if any, whose possible presence in cD1 would be detected in few coefficients.
It shall be further noted that the MAD allows the heteroscedasticity of noise to be automatically accounted for.
In other words, during the wavelet transformation step 17 the heteroscedasticity of stochastic noise 12b shall be accounted for and its deviation a shall be determined.
Coefficients of the first detail of the wavelet decomposition CD1, which describe the high-frequency content of the signal and whose standard deviation can be estimated by means of the MAD estimator, are used for this purpose.
It should be noted that the graphical representation of a specific jth vector SIC of the matrix 14, as shown in
The jth column vector mi,
f(t)=s(t)+e(t)+[b(t)+kbe(t)]
where f(t) is the jth SIC under examination, s(t) identifies the peptide signal 8, b(t) identifies chemical noise 12a (or the baseline), e(t) identifies stochastic noise 12b, typically white noise and kb identifies a multiplicative constant, proportional to chemical noise b(t), that can account for the heteroscedasticity of stochastic noise 12b, whose variance a increases with chemical noise 12a, i.e. with the baseline.
Thus, the following four cases may be observed, also with reference to
Once the jth column vector mi,
As mentioned above, the threshold value S (or any additional threshold values Ŝ that that are set for each portion of values of the intensity column vector SIC) is a multiple of standard deviation a.
Thus, if the MAD is zero, then the jth column vector mi,
In this case, stochastic noise 12b may be considered to be equal to a Gaussian white noise N(0,1) with zero mean and standard deviation o equal to one.
However, if the MAD is other than zero, the jth column vector mi,
In the latter case, we find a chemical noise “hump” and stochastic noise may be considered to be equal to a Gaussian white noise that can be identified, for instance, by a normal standard distribution N(0, σ) with zero means and deviation σ calculated as follows
where 0.6745 is the 75° percentile of normal standard deviation.
Therefore the threshold S (or any additional threshold values Ŝ that are set for each portion of values of the intensity column vector SIC) required for thresholding in the wavelet decomposition step 17 is a multiple of the standard deviation a of stochastic noise, that is:
S=(n σ)
where n is a multiplicative factor.
Once the jth column vector mi,
Particularly, the above comparison step substantially consists in generating the sixth processed vector 19A which identifies a possible intensity value xi, with i ranging from 1<i<N. This intensity value xi identifies the point in which the first processed vector 16 and the second processed vector 18 differ.
It shall be noted that the first processed vector 16 and the second processed vector 18 may differ in certain points only, wherefore the difference vector is composed of zero values, excepting the points in which said first processed vector 16 and said second processed vector 18 differ.
In other words, the baseline of the specific SIC under examination is estimated by retrieving the information shared by the two processed vectors 16 and 17, and next performing an interpolation step 19B at the points xi, if any, in which the two vectors 16 and 18 differ.
For instance, the interpolation step 19B may be preferably carried out using the Piecewise Cubic Hermite Interpolating Polynomial PCHIP, or the spline function.
As described above, the filtered vector 20 is generated in a step in which the jth intensity vector SIC and said second 18 and third 19 vectors are processed.
Particularly, the above comparison step includes the additional steps of:
In other words, the filtered vector 20 can be obtained through the following steps:
The final result of the implementation of the above method is shown in
Once each column vector mi,
In other words, by the application of the method to all P vectors of the matrix 14, the matrix is wholly cleaned and can provide the P filtered elution traces SIC and allow reconstruction of the N filtered mass scans.
Thus, the acquisition and filtering chain 1 can create spectrographic files in most common formats such as “.txt”, “mzData”, “mzXML”, etc., such files being readable by a personal computer.
Advantageously, the inventive method is also useful for detection, as it avoids false positives.
Most of detection algorithms are based on the recognition of the isotopic distribution of a peptide, wherefore a peptide peak is recognized as such only when the peaks of its isotopes are also identified, at a distance that is compatible with the ionization charge of the peptide.
As shown in
Once the detection method has been implemented, the filtered signal 26 does not have the first peak 25A of the isotopic distribution, which has been recognized as chemical noise and filtered off.
Without the implementation of the detection method, the real peptide of mass 507.314, i.e. the peak 25B, would never have been detected, and a false peptide of mass 506.827 would have been detected instead.
The inventive method provides improvements in two orders of magnitude during detection of peptide peaks combined with the noise of the measurement signal.
Those skilled in the art will obviously appreciate that a number of changes and variants may be made to the arrangements as described hereinbefore to meet incidental and specific needs, without departure from the scope of the invention, as defined in the following claims.
Number | Date | Country | Kind |
---|---|---|---|
MI2007A001107 | May 2007 | IT | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IT2008/000360 | 5/30/2008 | WO | 00 | 3/2/2010 |