b is an example of a difficult to interpret spectrum.
The present invention provides methods for detection and transformation of multiply charged peaks into single charged mono-isotopic peaks, removal of heavy isotopes, random noise removal, and bad spectra recognition. The approach is based on numerical spectral analysis and signal detection methods. These methods may be implemented into a computer program useful for proteomics procedures. The methods rely on application of tools derived from numerical mathematics for the processing of MS/MS spectra with the goal to improve the signal-to-noise ratio.
200 μg of purified anti-human Smc2 rabbit polyclonal antibody, cross-linked to AFFI-GEL® Protein A beads (100 μl bed-volume, Bio-Rad Laboratories, Hercules, Calif.), was used to immunoprecipitate the condensin complexes from 10 mg of clarified interphase HeLa cell extract. Following extensive washing, immunoprecipitated protein complexes were acid-eluted from the beads, and 10% of the total eluate was analysed by SDS-PAGE and silver staining. After reduction and acetylation of cysteine residues using dithiothreitol and iodoacetamide, respectively, the condensin sample was proteolytically digested using Trypsin Gold (Promega, Madison, Wis.), and the digestion stopped with tetrafluoroacetic acid.
Tryptic peptides from condensin samples were separated by nano-HPLC on an UltiMate™ HPLC system and PepMap™ C18 column (Dionex-LC Packings, Sunnyvale, Calif.), with a gradient of 5-75% acetonitrile, in 0.1% formic acid. Eluting peptides were introduced by electrospray ionisation (ESI) into an LTQ linear ion trap mass spectrometer (Thermo Electron Corporation, Waltham, Mass.), where full-MS and MS/MS spectra were recorded. In another experiment, a mixture of tryptic peptides from standard, commercially acquired bovine serum albumin (bovine, BSA), alcoholdehydrogenase (yeast, ADH), or transferrin (human, TRF) were used for system optimization and testing. 100 fmol of each protein were injected into a NanoHPLC (Dionex-LC Packings, Sunnyvale, Calif.) and MS/MS spectra were acquired using a 3D ion trap mass spectrometer, model LCQ DECA XP (Thermo Electron Corporation, Waltham, Mass.).
The MS/MS output, in the form of an Xcalibur raw-file, was converted into dta-files using BioWorks software (Thermo Electron Corporation, Waltham, Mass. (53944 files in the case of the condensin sample)). The dta-files were merged to generate a single mgf-file (“MASCOT generic format”) using the merge.pl program (Matrix Science Ltd, London, UK). This original mgf-file was then processed using the IMP MS CLEANER program, using the default internal parameters, generating two mgf-files with cleaned and bad spectra respectively.
All three mgf-files (original and two processed) were used to perform MS/MS ion searches using MASCOT (Matrix Science Ltd, London, UK) on a local computing cluster, against the non-redundant database for the three test proteins, against a small curated protein database (146 sequences; 68753 residues), which includes components of the condensin, cohesin, and kinetochore complexes, as well as some common contaminants and trypsin, in the case of the codensin sample. The MASCOT search parameters were the same in all runs (enzyme: trypsin; fixed modifications: carbamidomethyl (Cys); variable modifications: oxidation (Met); peptide charges: 1+, 2+ and 3+; mass values: monoisotopic; protein mass: unrestricted; peptide mass tolerance: ±3 Da; fragment mass tolerance: ±0.8 Da; max. missed cleavages: 1). The MASCOT search results output html-file was formatted with standard scoring, a significance threshold of p<0.05, and an ion score cut-off for each peptide of 30.
As stated above, for raw protein tandem MS/MS spectra, the present invention provides four independent procedures (i.e., algorithms): (i) detection (or de-convoluting) of multiply charged peaks, (ii) the removal of latent periodic noise including de-isotoping, (iii) the removal of high-frequency random noise, and (iv) the detection of non-interpretable spectra.
Although ionization techniques such as electrospray ionisation (ESI) have the advantage of shifting heavy ions into lower, detectable mass-over-charge ranges by generating multiply charged fragment ions, they can pollute the spectrum by causing replicates of otherwise identical ions at different charge states. In the general case, these multiply charged signals occur as isotope clusters. For the purpose of spectrum interpretation, peak replicates originating from different charge states have to be unified.
The relative spectral intensities of isotope-variant peaks in a cluster are determined by the natural isotope distributions of carbon, hydrogen, oxygen, nitrogen, and sulfur, the predominant chemical elements in peptide fragments. This a priori known form of the intensity pattern from multiply charged replicates is used for searching its re-occurrence in the measured spectrum by correlational analysis. The algorithm is quite robust relative to inaccuracies in the experimental resolution of isotope clusters due to two artifices in processing the mass spectrum: (i) the removal of small peaks very close to major intensities and (ii) the procedure of interpolated peak densification in the mass range of comparison with the predefined pattern.
The algorithm includes several steps (see also
M is the mass corresponding to the first, mono-isotopic peak in the cluster (n=1). The relative intensity of this peak is assumed 1. A(n) and Bj(n) are fitting parameters taken from Wehofsky's work. Depending on charge state z, mass distance between peaks in the pattern is 1/z Da. The pattern length is (N−1)/z Da. Finally, the pattern is complemented, i.e., densified with 20(N−1)/ z−N+1 additional peaks (with a 0.05 Da mass step) where their intensity is linearly interpolated from the two surrounding pattern-defining peaks with masses M+(n−1)/ z and M+m/z. The intensity patterns have been tabulated with an accuracy of 100 Da.
Every peak of the experimental spectrum is considered a potential starting point of an isotope cluster pattern. The mass window with the length of the target signal following each peak is densified with linearly interpolated additional peaks (at 0.05 Da steps) up to the last experimental peak in the window. The addition of additional peaks (essentially a transformation to a semi-analogue signal) compensates for possible small inaccuracies in resolving the position of isotope-variant peaks by the instrument's software. The correlation coefficient of the observed intensities with those from the pre-computed pattern is calculated. Very high correlation (above 0.95 or even 0.99 (in the case of very accurate data)) indicates re-occurrence of the target signal in the pattern. Detected multiply charged peak clusters are removed and converted into a singly charged mono-isotopic peak that is added to the spectrum.
This procedure works adequately as long as no very low-intensity peaks close to major intensities of an isotope cluster interfere (distance below ˜0.2 Da, a measure of machine accuracy). These peaks are typically artifacts that can arise from random noise or from the transformation of the continuous MS/MS spectrum into the centroid form as a discrete signal. Prior to the spectrum densification, the small interfering peaks between main isotope cluster peaks have to be merged with the closest main peak in the cluster; i.e., this is essentially a procedure for reversing the creation of the small interfering peaks. For the peak-merging algorithm, a weighted directed graph G(V,E)is constructed. The set of vertices (V) is all mass-over-charge values in the window. An edge ei,jεE is added between two vertices vi,vjεV if the distance d between peaks vi, vj is less than a user-defined value (˜0.2 Da). The direction of the edge is defined to be from vi, to vj if Intensity(vi)<Intensity(vj). The weight wi of an edge ei,j is defined as distance between two vertices vi and vj (in 0.01 Da units). If a node vi giving origin to the edge ei,j is actively removed from the graph (and its intensity is added to the node vj), then edges to other nodes can also vanish. Via systematic enumeration (for example with topological ordering), an edge-free sub-graph can be computed without large computational cost that fulfills the condition that the sum of weights of actively removed edges is minimal.
In light of the foregoing, referring to
Correlation of the measured MS/MS spectrum with pre-calculated isotopic intensity distributions is efficient only for multiply charged peak clusters since the probability of finding additional, unrelated peaks in the spectrum with distance of 1 Da is high. Therefore, correlation analysis with pre-defined patterns is not really useful for de-isotoping. But if an MS/MS spectrum is treated as a set of signals in time domain where the mass-over-charge axis is the analogue of time and intensity of each peak in MS/MS spectrum as the intensity of a signal at certain time, the single-charged peak signals can be considered as a periodical function (with periodicity of ˜1 Da for singly charged peaks). This periodical function in time domain results in a periodical function in the power spectrum where the reoccurring elements can be recognized more easily.
Besides isotope variants, there can be other sources of spectral contamination with latent periodicity, for example, from the detection system or from accompanying chemical polymer contaminants such as silanes, etc. Re-occurring signals at quasi-constant mass shifts can be seen in the frequency domain, i.e., as characteristic reoccurrences of high amplitudes at multiples of a base frequency in the Fourier transform of the tandem mass spectrum. Performance of yet another Fourier transformation applied at the frequency domain level can be used to determine this base frequency. Suppression of intensities in protein tandem mass spectra arising from these periodicities effectively removes latent periodical noise including minor isotope variant peaks (
Converting to the frequency domain, the discrete Fourier transform Y of the MS/MS spectrum (S) is found by taking the N-point fast Fourier transform Y=FFT(S,N). The value N is calculated as N=2n+1, where n is the smallest integer larger than log2[(xmax−xmin)/0.05]. The values xmax and xmin are the largest and the smallest mass-over-charge values in the spectrum respectively. The first power spectrum PS, a measurement of the power at various frequencies, is PS=Y·Y*/N (see
In order to remove the reoccurring elements from the first power spectrum, a multi-band reject filter has to be created for each MS/MS spectrum. The filter is created by the Yulewalk method of autoregressive moving average (ARMA) spectral estimation. Yulewalk designs recursive infinite impulse response (IIR) digital filters using a least squares fit to a specified frequency response. Frequencies required by the Yulewalk method are calculated by applying a median filter to the power spectrum (over 300-500 discrete data points) and by computing a second power spectrum (PSPS) in order to get the most prominent frequency of the first power spectrum. The created IIR filter is used to filter the MS/MS spectrum in time domain. After filtering, the recovered MS/MS spectrum might contain some signals with negative intensity or some new signals with positive intensity. Also, some signals from the original raw spectrum loose considerable intensity (threshold of 95%; this number should be higher for very clean and regular spectra). All three types of signals are corrected to zero in a final step.
Examination of exemplary spectra has shown that suppression of latent periodicities in the MS/MS spectrum effectively also removes low-intensity peaks originating from higher mass isotopes in isotope clusters (see
In light of the foregoing, referring to
Assuming that the random noise in MS/MS spectrum exists as signals of high frequency of occurrence, a low-pass filter (i.e., Butterworth IIR) is applied to the spectrum in time domain. Normalized stop frequency of the filter is in the range from 0.5 to 0.9 (the best result was obtained with stop frequency 0.8). An empirical threshold of 99.99% is applied to remove all signals, which have lost intensity above this threshold, from the raw spectrum.
Power spectrum analysis of MS/MS spectra also indicates a criterion that can be used for the identification of bad spectra which are not useful for further study. Two types of irregularities are observed that coincide with hard-to-interpret protein MS/MS spectra: (i) the first power spectrum can exhibit very low amplitudes for low frequencies, and (ii) finding the most prominent frequency in the second power spectrum can be ambiguous (several similarly high peaks).
With the base frequency derived from the second power spectrum (PSPS), it is possible to compute the position of expected maxima and minima in the first power spectrum (PS) and determine whether the real minima and maxima within periods are, on average, closer to the expected positions or closer to the positions with the shift of half a period. If the spectrum is shifted (i.e., if the sum of distances of real maxima and minima from their expected positions is larger than that of the positions with a shift of half a period) away from the expected position of minima/maxima, the procedure for de-isotoping is halted because large spectral shifts away from expected minima/maxima often indicate bad spectra.
For making an appropriate decision, the periodicity of the spectrum is also tested with a similarly elementary criterion as the shift. This is tested with the coefficient of dispersion (Cd) of peak distances in the first power spectrum, calculated as a ratio of standard deviation (s) and the mean value (
A Cd close to zero indicates good coincidence of distances between maxima (and, respectively, minima) of consecutive periods with the expected distance (equal to the period length). Large values of Cd signal distorted periodicity in the power spectrum and a periodicity model appears not applicable. Such spectra are returned to further processing without removal of latent periodic noise.
The case of quasi-periodic but shifted spectra is more complicated. In such a situation, if the coefficient of dispersion is not larger than 3.3 (an empirically derived threshold), the algorithm predicts that the respective MS/MS spectra cannot be reliably analyzed with interpretation software. As will be shown below, spectra flagged with this criterion are indeed not well interpretable even with database search-based software (i.e., no protein hits are found or only hits with very low reliability).
Referring to
In contrast, referring to
Sequence ladder testing is a simple and efficient alternative with virtually no false positives. At the same time, the rate of spectra recognized as non-interpretable in form of peptide sequences increases up to the order of ˜70%.
Peptide samples that are to be analysed by tandem mass spectrometry often contain other compounds that are not of protein origin. These compounds are different polymers and other impurities as artefacts of the preparation methods. Although these compounds occur in small concentrations, the high sensitivity of modern mass spectrometers allows their detection. The presence of these unusable non-peptide spectra in a large number in the resulting set of all mass spectra inordinately consume CPU time trying to interpret them as peptide fragments.
The MS/MS spectra that originate from peptides can be distinguished from non-peptide spectra by the presence of a ladder of peaks with characteristic distances between them, namely the amino acid residue mass. If a spectrum doesn't contain a reliable number of peaks that form an amino acid sequence ladder, this spectrum can be considered as bad, and can be removed from the set of spectra that is to be used for interpretation.
Therefore, referring to
To test the algorithms of the present invention in large-scale practical applications, MS/MS spectra from protein samples with known composition were used. Such spectra are produced for the purpose of quality control of MS instrumentation with low concentrations (100 fmol) of BSA, ADH, or TRF. It should be noted that low concentrations of proteins are used in order to achieve limiting cases of mass spectra intentionally. The results of applying the background removal procedure are presented in Tables 4A and 4B hereinbelow. First, it is evident that protein hits are found from the cleaned MS/MS spectra with considerably increased scores. This is evident for the total protein score (between 10% and 15%, see Table 4A). Scores improve for the majority of all leading peptide hits (about 70%, see Table 4B), a decrease is observed in about 10% of cases but did not affect the interpretation except for one case (see below). In general, the likelihood of retrieving the sample protein and the sequence coverage improve (see Table 4A).
MS/MS spectra considered non-interpretable by use of the current invention are indeed bad spectra. In only one out of 626 cases was the original protein recovered by MASCOT. Here, MASCOT assigned a score of 64 (see Table 4A). This height appears unjustified upon visual inspection of the spectrum, because there are almost no significant peaks above background. In contrast, there are a considerable number of spectra (about 10%) that become interpretable for MASCOT only after background removal with our procedures (5 for BSA, 1 for ADH, 8 for TRF, see Table 4B).
An example is shown in
Referring again to Tables 4A and 4B, the MS/MS spectra were interpreted with MASCOT directly (“raw spectra”) or after processing with the background removal procedure (“cleaned spectra”) described in this article. The “score” is the MASCOT score from all successful searches, “match” is the number of searches that recover the peptides from the protein used, and “cov %” reports the sequence coverage. The line “bad spectra” reports the number of files that are considered non-interpretable by the criterion described in the text (n/a=non-applicable). In only one case could MASCOT recognize a peptide from the original protein in a bad spectrum, but with extremely low score.
As can be seen from the data in Table 5, the spectral-analytic criteria (removal of latent periodic and high-frequency noise) are most efficient in reducing the background since their share among the removed peaks is above 90%. In the BSA, ADH, and TRF applications, about 15% of all peaks in the original spectra get removed by our program and the file storage requirement is reduced by the same amount.
Four sources contribute to the peak removal: (1) At the start, all peaks with a spacing smaller than the user-defined accuracy are merged (default: 0.25 Da); (2) Number of peaks removed by the periodic noise detection procedure (including de-isotoping); (3) Number of peaks identified by the de-convolution of multiply charged replicates; and (4) Number of peaks found by the routine for high-frequency noise removal. Again, it can be seen that the spectral-analytic criteria are most efficient in background reduction. In the last three columns, there is presented the original spectra (5), the number of peaks removed (6), and the percentage from the total number of peaks (7). Some procedures identify the same peaks as noise. To assess this effect, in column 6, there is presented the arithmetic sum of the numbers from all noise reduction procedures (1-4) in parentheses.
The computational performance of the algorithms of the present invention (denoted IMP MS CLEANER) was tested on a stand-alone PC (under the WINDOWS XP operating system). For the BSA case, 2679 dta-files were cleaned in 4:52 min (0.11 sec per spectrum). The MASCOT time on the same machine reduced from 64 min (for the untreated data) to 57 min (cleaned files). The respective numbers for ADH (2325 files) and TRF (2608 files) are 5:36 (0.14 sec per file), 75, 64 and 4:15 (0.10 sec per file), 58, 50 (all values in minutes). Thus, savings of computational costs are considerable under the condition of increased reliability of spectrum interpretation.
For exemplifying the algorithm for recognizing non-interpretable spectra according to the present invention, the analysis of condensin complex mass spectra is an even more realistic application compared to the analysis of protein samples because, in the latter example, low concentrations of proteins are intentionally applied to achieve limiting cases of mass spectra.
So, for this purpose of analyzing condensing complex mas spectra, the condensing complexes were purified and analyzed from cultured human HeLa cells. Human cells contain two distinct condensin complexes, called condensin I and condensin II, which bind chromosomes specifically in mitosis and contribute to their condensation and structural integrity. Both complexes are hetero-oligomers composed of five subunits. Two ATPase subunits of the structural maintenance of chromosome (SMC) family, called Smc2 and Smc4, are shared between condensin I and condensin II. In addition each complex contains a set of distinct non-SMC subunits, called kleisin-y, CAP-G, and CAP-D2 in the case of condensin I, and kleisin-β, CAP-G2, and CAP-D3 in the case of condensin II. Both complexes were immunopurified simultaneously using antibodies to their common Smc2 subunit and analyzed the resulting sample both by SDS-PAGE and silver staining (
This MS/MS spectrum is from the condensin sample in shown in
A summary of the MASCOT search results for the same experiment are shown in Table 6 hereinbelow. Each of the eight condensin subunits showed an increase in MASCOT score (mean increase of 8.2%), and number of peptide matches (mean increase of 4.8%) following the cleaning procedure. As a rule, the percentage sequence coverage obtained was the same or higher for searches using the cleaned spectra than for those using the original spectra. The one exception from this list was kleisin-β, which showed a 2% reduction in the sequence coverage after cleaning. Closer inspection revealed that this reduction was due to one peptide match, which is generated by a single MS/MS spectrum that visually appears of low quality. This MS/MS spectrum has very few significant peaks above the baseline, and is classified as ‘non-interpretable’ by the IMP MS Cleaner. However, MASCOT generated a match between this spectrum and the peptide QGEVLASR (within kleisin-β). With a surprisingly high MASCOT score of 45, it was classified as a hit, although the majority of the significant hits do not contribute to this interpretation. Thus, in this case, the removal of a just single non-reliable peptide during the cleaning process resulted in a small reduction in sequence coverage, although the MASCOT score for the protein as a whole was increased as a result of background removal. It should be noted that all cases of peptide detection by MASCOT in spectra classified as non-interpretable by the algorithm of the present invention (14 out 1318 files) lead to low scores with marginal sequence coverage by MASCOT when there are very few significant peaks above an apparent noise.
The MS/MS spectra were interpreted with MASCOT directly (“raw spectra” from 53944 files with totally 460 MB) or after processing with the background removal procedure (“cleaned spectra” from 52626 files with totally 284 MB) described in this article. The “score” is the MASCOT score from successful searches; “match” is the number of searches that recover the peptides from the protein used. “cov %” reports the sequence coverage. The columns “bad spectra” report cases of files (among 1318 files with totally 7 MB) that are considered non-interpretable by the criterion described in the text (n/a=non-applicable) where MASCOT could, nevertheless, recognize the original protein but with extremely low score and sequence coverage.
In a practical setup, the computational efficiency is also important. IMP MS CLEANER processed the 53944 spectra from the condensin experiment in less than 4 hours on a single standard PC; i.e., in 0.25 seconds per file. The application of background removal procedure reduces the pure Mascot computing time for the body of 53944 dta-files in the condensin complex case by about 25%, even in the case of a small database of 146 sequences; the size of the cleaned mgf-file is decreased by 39%. Therefore, application of the IMP MS Cleaner significantly reduces consumption of computing time and storage.
The background from multiply charged replicates, isotope variants, sample-specific and systematic contaminations, and the noise from the electronic detection system create a considerable problem during mass spectrum interpretation. Computation time is wasted for non-interpretable spectra and background peaks occupy a significant share of the storage capacity for mass-spectrometric data. Background removal according to the present invention improves reliability of hit assignments by database search-based methods considerably.