The present invention relates to methods of and systems for obtaining and analyzing spectra of ion products generated from one or more precursor ions.
Structural elucidation of ionized molecules of complex structure, such as proteins, is often carried out using a tandem mass spectrometer, where a particular precursor ion is selected at the first stage of analysis or in a first mass analyzer, the precursor ions are subjected to fragmentation (e.g., in a collision cell), and the resulting fragment (product) ions are transported for analysis in the second stage or second mass analyzer. The method can be extended to provide fragmentation of a selected fragment, and so on, with analysis of the resulting fragments for each generation. This is typically referred to an MSn spectrometry, with n indicating the number of steps of mass analysis and the number of generations of ions. Accordingly, MS2 mass analysis (also known as an MS/MS mass analysis) corresponds to two stages of mass analysis with two generations of ions analyzed (precursor and products). A resulting product spectrum exhibits a set of fragmentation peaks (a fragment set) which, in many instances, may be used as a fingerprint to derive structural information relating to the parent peptide or protein.
There is currently a trend towards full-scan MS experiments coupled with “all-ions” fragmentation. Such full-scan approaches utilize high performance time-of-flight (TOF) or electrostatic trap (such as Orbitrap®-type) mass spectrometers—possibly coupled to UHPLC columns—and can facilitate rapid and sensitive detection and/or quantitative screening of analytes. The superior resolving power of the Orbitrap® mass spectrometer (up to 100,000 FWHM) compared to TOF instruments (10,000-20,000) ensures the high mass accuracy required for complex sample analysis.
An example of a mass spectrometer system 15 comprising an electrostatic trap mass analyzer such as an Orbitrap® mass analyzer 25 is shown in
The system 15 (
Higher energy collisions (HCD) may take place in the system 15 as follows: Ions are transferred to the curved quadrupole trap 18. The curved quadrupole trap is held at ground potential. For HCD, ions are emitted from the curved quadrupole trap 18 to the octopole of the reaction cell 23 by setting a voltage on a trap lens. Ions collide with the gas in the reaction cell 23 at an experimentally variable energy which may be represented as a relative energy depending on the ion mass, charge, and also the nature of the collision gas (i.e., a normalized collision energy). Thereafter, the product ions are transferred from the reaction cell back to the curved quadrupole trap by raising the potential of the octopole. A short time delay (for instance 30 ms) is used to ensure that all of the ions are transferred. In the final step, ions are ejected from the curved quadrupole trap 18 into the Orbitrap® mass analyzer 25 as described previously.
The mass spectrometer system 15 illustrated in
The system 15 shown in
An MS2 spectrum (a spectrum of fragment or product ions) can provide rich information about the covalent structure of an isolated precursor molecule. The information content is very high in that the MS2 spectrum from one isolated precursor is typically quite different from that of another isolated precursor; furthermore, the MS2 spectrum of a given precursor is highly reproducible. Therefore, in most cases, it is unlikely that (unintentional) experimental and measurement variations in acquiring MS2 spectra would cause one precursor to be mistaken for another. Precursors with similar product ion spectra can often be distinguished by precursor mass or chromatographic retention time. If these additional attributes are insufficient for discrimination, then acquisition of product ion trees (i.e. MS3 and beyond) would be required.
Despite the apparently very high information content of MS2 spectra, the success rate for identifying molecules is very low, ranging from 10-30%. A number of factors may explain the low success rate, including incomplete or poorly curated databases and inadequate software. The spectral decomposition method described herein tolerates an incomplete database and is capable of finding components in a product ion spectrum that exist in the database. A related method of automated database curation adds new entries to the database when it can be determined that the product ion spectrum contains additional components that have not been observed previously.
A more fundamental problem than database quality or completeness is that most software for interpreting product ion spectra begins with the assumption that the observed products are derived from a single isolated precursor. Recent studies have shown that, in typical proteomic studies, the vast majority of product ion spectra are derived from mixtures of precursor ions. Even software packages that address demultiplexing of multiple precursors are closely derived from single-precursor algorithms, with heuristic subtraction-based, computationally-intensive approaches to sequential discovery of multiple precursors. The state-of-the-art in demultiplexing product ion spectra is limited to identification of at most two or three precursors and with very limited abundance dynamic range.
Typical approaches to interpreting product ion spectra consider only the masses of the product ions, and not their relative intensities. Intensity information has been excluded from conventional analyses because it is difficult to predict relative product ion intensities de novo. For an approach where the product ion spectra of all potential precursors are assumed to be stored in a library, it is not necessary to predict intensities. Instead, the requirement is that product ion intensities be reproducible. On modern instruments, where acquisition parameters such as collision gas pressure and collision energy are standardized, product ion intensities are highly reproducible. High reproducibility allows quantitatively accurate interpretation of mixed product ion spectra. The inventors have discovered that the product- or fragment-ion intensities provide a very significant amount of information and are quite reproducible on a given instrument using invariant (i.e. standardized) acquisition parameters.
Disclosed herein is a linear-algebraic approach to analyzing and interpreting multiplex MS2 spectra (fragment- or product-ion spectra) using a spectral library. Also disclosed is a method for automatically constructing a spectral library. In some embodiments, the automatic spectral library construction may be accomplished as an adjunct of the automatic analysis and interpretation process. Specifically, an input spectrum may be decomposed into components of the library, such components comprising previously observed product ion spectra from the library. The residue, or product ions in the observed spectra that cannot be explained in terms of existing library entries, may comprise the basis for adding new entries to the library.
As the term is used herein, a “multiplex” MS2 spectrum contains fragments that arise from a mixture of precursors, in contrast to a “pure” MS2 spectrum in which the fragments come from a single isolated precursor. Further, in the context of this disclosure, “spectral interpretation” means estimating the relative abundance of each precursor represented in the multiplex spectrum. The precursors are assumed to be represented within entries in a database of observed MS2 spectra. Most product ion spectra are interpreted as linear superpositions of library spectra; however, some product ion spectra contain contributions from previously unrecorded precursors that can also be discovered during analysis.
Acquisition of multiplexed product ion spectra may be intentional or unintentional. In some cases, multiple precursors may be isolated and combined prior to fragmentation so as to exploit the channel bandwidth of a high-resolution mass-analyzer so as to increase analytical throughput. Multiplexing may be said to be unintentional when a single isolation window (e.g. 1 Da) happens to contains multiple precursor ions in addition to the one being targeted for selection and fragmentation. These additional precursors may be below the limit of detection in an MS1 (precursor ion) spectrum and yet be detectable in MS2 spectra as a consequence of isolation because isolating narrow mass ranges typically involves significantly longer ion accumulation. Alternatively, a single detected peak in an MS1 spectrum may be hiding multiple precursors with mass differences too small to be resolved, or in fact, structural isomers of identical mass. Therefore, the methods described herein make no assumption about the number of precursors. Instead, all candidate precursors within a given mass range are assumed to be present and their intensities (most often zero) are estimated. If, in fact, only one precursor is represented in the MS2 spectrum, the algorithm will work as expected: identifying that precursor by assigning (essentially) zero abundance to all other candidates.
Optimal estimates of precursor abundances are determined from an observed MS2 spectrum by solving a linear matrix equation. Typically, only a few precursor ion candidates are present. If a candidate is not present, its estimated intensity is expected to be near zero. In general, the threshold for discriminating low abundance precursors from precursors that are, in fact, absent, is determined by measurement noise, reproducibility of the database entries, and the similarity among these entries. In various embodiments, the disclosed methods may consider a precursor whose estimated intensity falls below this threshold to be absent, or more precisely, not detected. A candidate that is not represented will have an abundance estimate that is not significantly different from zero. An appropriately chosen threshold is used to eliminate such ions from consideration. In this context, the interpretation can be viewed as precursor ion identification, but with the generalization that it can make multiple simultaneous identifications as well as determining the relative abundances of the precursors. The utility of multiple identification increases dramatically in complex samples, such as proteomic digests.
Accordingly, in one aspect of the present teachings, there is disclosed a method of acquiring and interpreting data using (i) a mass spectrometer system operated according to a set of operating conditions and (ii) a mass spectral library having a plurality of library entries derived from data previously obtained using said mass spectrometer system operated according to said set of operating conditions, the data relating to a plurality of chemical compounds, said method comprising: (a) generating a multiplexed mass spectrum using the mass spectrometer system, the multiplexed mass spectrum comprising a superposition of a plurality of product-ion mass spectra comprising a plurality of product-ion types, each product-ion mass spectrum corresponding to fragmentation of a respective precursor-ion type formed by ionization of the plurality of chemical compounds, each precursor-ion type having a respective precursor-ion mass-to-charge (m/z) ratio and each product ion type having a respective product-ion m/z ratio; (b) decomposing the multiplexed product-ion mass spectrum so as to calculate relative abundances of previously-observed product-ion mass spectra within the multiplexed product-ion mass spectrum, the decomposing employing the mass-spectral library. Residual spectral components below a threshold level may be assigned an abundance of zero.
In another aspect, there is disclosed an apparatus comprising: (a) a mass spectrometer comprising: an ion source operable to generate precursor ions from a sample, the precursor ions comprising a plurality of precursor-ion types comprising respective mass-to-charge (m/z) ratios; a fragmentation device operable to fragment the plurality of precursor ion types so as to generate product ions comprising a plurality of respective m/z ratios; a mass analyzer operable to separate or discriminate the plurality of precursor-ion or product-ion types according to their respective m/z ratios; and a detector operable to detect the separated or discriminated product ion types and measure the detected intensities thereof; (b) a programmable electronic processor electronically coupled to the mass spectrometer; and (c) a data storage apparatus electronically coupled to the programmable electronic processor and storing thereon a mass spectral library, each entry of which corresponds to a respective precursor-ion type previously observed by the apparatus, wherein each entry comprises a plurality of intensity values corresponding to a previously observed product-ion spectrum produced by fragmentation of the respective previously observed precursor ion type, wherein the programmable processor is configured to: calculate a relative abundance of each previously observed precursor ion type within the plurality of precursor ion types using the m/z ratio and detected intensity of each product ion type and at least a portion of the entries within the mass spectral library.
The present teachings assume that a library—possibly an incomplete library—of observed MS2 spectra has been compiled previously. The library may be updated and, in fact, may be constantly updated as new precursors are discovered as an adjunct to the interpretation process. The methods taught herein do not require that all components in the library have been annotated with identifications. When a component identified in a product ion spectrum is matched to a library entry to which an annotation (such as a compound name) is attached, the component is said to be identified. Alternatively, if a component is matched to a library entry to which no annotation is attached, the component is said to be matched. In the case where the library entry is thought to be significant (for example, its abundance discriminates between two patient groups as a putative biomarker), an offline process can be used to identify the entry and to add an annotation.
The above noted and various other aspects of the present invention will become apparent from the following description which is given by way of example only and with reference to the accompanying drawings, not drawn to scale, in which:
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the described embodiments will be readily apparent to those skilled in the art and the generic principles herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the embodiments and examples shown but is to be accorded the widest possible scope in accordance with the features and principles shown and described. To fully appreciate the features of the present invention in greater detail, please refer to
Still referring to
The programmable processor shown in
In the hypothetical example illustrated in
In the novel algorithms described herein, the inventors present a mathematical model for multiplex MS2 spectra, pose a mathematical problem, and offer the solution to that problem. The model for a product ion spectrum resulting from a mixture of precursors is a corresponding mixture of product ion spectra from the isolated precursors. This model has the following linearity property. Let X denote a mixture of precursors A, B, and C in the following proportions: xA parts of A, xB parts of B, and c parts of C. We use the following notation to represent this mixture: X=xAA+xBB+xCC. Let s(X) denote the “ideal” product ion spectrum obtained from X. In our model, s(X)=xAs(A)+xBs(B)+xCS(C), where s(A), s(B), and s(C) represent “ideal” product ion spectra that would be obtained from isolated (pure) precursors s(A), s(B), and s(C). Each of the “ideal” product ion scans can be thought of as a vector, and so multiplication and addition have their usual intuitive meanings in the above equation.
Now suppose we are presented with a product ion spectrum S. We wish to interpret the product ion spectrum as a mixture of precursors A, B, and C. Then, we suppose that S=s(X), where X=xAA+xBB+xCC, i.e. an arbitrary mixture of these precursors. In this case, we do not have ideal product ion spectra s(A), s(B), and s(C). Instead, we assume that we have measured product ion spectra of isolated precursors A, B, C, which we will denote by d(A), d(B), d(C). We use the symbol d to suggest that these product ion spectra reside in a database. Given d(A), d(B), and d(C), we can generate arbitrary mixtures of product ion spectra: S′=x′Ad(A)+x′Bd(B)+x′Cd(C), by choosing arbitrary values for coefficients x′A, x′B, and x′C. Specifically, we wish to find coefficients x′A, x′B, and x′C that minimize the difference between observed product ion spectrum S and our model mixture spectrum S′. Considering S and S′ to be vectors, the difference has the geometric interpretation as the length of the difference between two vectors. Geometrically, the set of vectors S′ that can be produced from d(A), d(B), and d(C) can be thought of as a hyperplane, where (x′A, x′B, x′C) identify a point in this hyperplane. Let ({circle around (XA)}, {circle around (XE)}, {circle around (XC)}) denote the optimal values of (x′A, x′B, x′C), the coefficient values that minimize the difference between S and S′. According to our model, these optimal coefficients represent the estimates of the coefficients that characterize unknown X that gave rise to product-ion spectrum S as a mixture of precursors A, B, and C. The value of ({circle around (XA)}, {circle around (XE)}, {circle around (XC)}) is simply the projection of S onto the hyperplane determined by vectors d(A), d(B), d(C) and can be calculated by solving a linear matrix equation as shown below.
In general, we are presented with a product ion spectrum derived from a mixture of precursor ions. The mixture of precursor ions may be generated by applying a filter that selects all ions in a given mass range. These ions can be indirectly visualized in a precursor spectrum (e.g., window 50 of
Suppose that there are K candidate product ion spectra in the database corresponding to a given precursor isolation window. We construct a database matrix D containing N rows and K columns as shown in Eq. 1,
in which each column, dk (for k=1 . . . K) is a database (spectral library) spectrum represented as a (column) vector having N entries, dn,k (for n=1 . . . N). The entry dn,k is the observed intensity of the kth product ion spectrum at the nth m/z “position”, this intensity possibly being an integrated intensity over a narrow mass range. It is necessary that the product ion spectra be normalized so that the entries at corresponding positions in the vector represent equivalent mass positions.
Suppose that s denotes a product ion spectrum that we wish to interpret in terms of the spectral library D. In particular, we wish to find the weighted sum of database spectra that most closely approximates s. An arbitrary linear combination, s′, of the database spectra is given by Eq. 2, viz.
in which the column vector x′ is an arbitrary vector of K abundances, not necessarily optimal. The goal is to determine an optimal estimate of the abundances, which we denote by {circumflex over (x)}, where we interpret the product ion spectrum s as a mixture of database spectra. According to our model, {circumflex over (x)} is also an estimate of the abundances of the precursors, represented by these database spectra, that occur in the unknown mixture that gave rise to the observed product ion spectrum s. The optimality criterion used here is the sum of weighted squared differences between the components of s and s′, denoted by the scalar quantity e in Eq. 3, below. The sum is over the data samples in the observed spectrum s. The quantity e is the squared length of the vector difference between s and s′. The squared length of a vector can be written as the transpose of a vector times itself
Minimizing the squared length of the vector is equivalent to minimizing the length of the vector, and more convenient mathematically. We assume correspondence between the mass positions represented by the entries of s and the database product ion spectra so that a vector difference between the quantities is meaningful.
Determining the set of parameter values that minimizes the sum of squared differences between a model and observed data is equivalent to maximum-likelihood estimation in the special case where the observed data is assumed to be the outcome of a random process in which an “ideal” model is corrupted by additive, white Gaussian noise. This is a convenient model, but not perfectly applicable to the current problem. Most of the random variation seen in product ion spectra is due to ion counting statistics. Fortunately, the variations of measurements at different mass positions in the spectra are independent. However, the variations are not identically distributed. In counting statistics, the variance is equal to the count intensity. When the ion counts are relatively large (i.e. greater than 50 ions), the distribution of observed values can be accurately approximated by a Gaussian distribution where the variance is set equal to the count intensity. Because the underlying count intensity is unknown, we can approximate the variance by the observed intensity, rather than the count intensity, without introducing significant distortion in our error metric. To take into account, the differences in the variance across samples, we modify our error metric as in Eq. 4 as
where W is an N-by-N (N×N) diagonal matrix of weighting factors wherein each diagonal entry Wnn is defined as Wnn=1/σn2=1/sn and all off-diagonal entries are zero. Additional sources of variation can be taken into account by modifying σn appropriately, but in this application, σn is dominated by counting statistics. In any case, the variation is encapsulated in matrix W, which is constant with respect to the estimated abundances.
Next, we derive the optimal vector of estimated abundances, denoted by {circumflex over (x)}. To find the optimal value, we evaluate the derivative of e with respect to each component of x′. Because {circumflex over (x)} minimizes e, the derivative of e evaluated at {circumflex over (x)} must be zero. Therefore, we set the derivative of e evaluated at {circumflex over (x)} to zero (specifically, the null vector 0) as indicated in Eq. 5, below.
Rearranging Eq. 5 produces Eq. 6, the desired linear matrix equation for the optimal estimate of the abundance vector.
(DTWD){circumflex over (x)}=DTWs (6)
The above derivation is based on decomposing a single observed or unknown multiplex spectrum, represented by vector s into its various components as are recorded in a spectral library. It may be advantageous, in some computing architectures to perform the decompositions of multiple spectra at once. Let matrix S denote a collection of Q product-ion spectra, all acquired using the same precursor isolation window, formed by stacking the individual product ion spectra as column vectors, i.e. S=(s1, s2, . . . SQ). Then we replace vectors s and {circumflex over (x)} in Eq. 6 with matrices S and {circumflex over (X)} respectively in Eq. 7, so as to yield
(DTWD){circumflex over (X)}=DTWS (7)
in which {circumflex over (X)} is a matrix of Q column vectors, each with K entries. The entry {circumflex over (X)}kq contains the optimal estimate for the abundance in observed product ion spectrum sq of the precursor represented by in the database by product ion spectrum dk.
To simplify analysis, consider the special case where W=I, where I is the N×N identity matrix. In this case, we do not consider the differences in variance between observed values in the product ion spectrum s. Then Eq. 6 reduces to
(DTD){circumflex over (x)}=DTs (8)
Let A denote the K×K matrix DTD and let b the K-vector DTs. The entry Ak′k is equal to dk′Tdk, or equivalently the dot product between database product ion spectra dk and dk′. The entry bk is equal to dkTs, or equivalently the dot product between database product ion spectrum dk and the observed product ion spectrum s. If dk and dk′ are appropriately normalized, Akk′ is the correlation coefficient between vectors dk and dk′. A trivial example is when A is the identity matrix. In this case, there is no overlap between database entries—that is, the product ion spectra have no overlapping peaks. In this case, {circumflex over (x)}k=bk or, in other words, the optimal estimate for the abundance of the kth database spectrum, dk, in the observed product ion spectrum is the dot product between dk and s. In general, the estimated abundance of the kth database spectrum in the observed product ion spectrum depends not only upon its dot product with the observed spectrum, but also its dot product with the other spectra in the database. In the extreme case, suppose A has an entry whose value approaches one. In that case, product ion spectra dk and dk′ are nearly indistinguishable, and so the estimated values are highly sensitive to small amounts of noise. Therefore, to ensure stable estimates, it is important to construct a database that contains distinct product ion spectra, avoiding duplicate or very similar entries.
Error analysis of the abundance estimates is now discussed. If the observed ion spectrum is exactly a mixture of one or more product ion spectra, then the estimated abundances will be exactly the coefficients of the mixture. For example, consider s=Dx where x is a vector of the true abundances in the mixture. Then the vector of estimated abundances {circumflex over (x)} is equal to the vector of true abundances, wherein
(DTD){circumflex over (x)}=DTs=DT(Dx)=(DTD)x (9)
If we assume that the product ion spectrum is the outcome of a random process in which a mixture of product ion spectra from the database is corrupted by additive white, Gaussian noise, then we have Eq. 10, as follows.
(DTD){circumflex over (x)}=DTs=DT(Dx+n)=(DTD)x+DTn (10)
The vector of estimated abundances can be written as the vector of true abundances plus an error term, as given by Eq. 11.
{circumflex over (x)}=x+(DTD)−1DTn=x+Δ (11)
The error term is a linear transformation of a zero-mean Gaussian random variable, and therefore, is itself a zero-mean Gaussian random variable. Because the error is zero-mean, we say that the estimator is unbiased. A zero-mean Gaussian random variable is characterized by its covariance matrix. The covariance is given in Eq. 12 below as
K
Δ=ΔΔT=(DTD)−1DTnnTD(DTD)−1=σ2(DTD)−1 (12)
where σ2 is the sample variance of the noise.
The matrix (DTD)−1 can be considered as a gain factor which expresses how much the input noise gets amplified in producing the abundance estimates. In the simple case, where DTD is identity (no overlap between database spectra), the amplification is one, meaning that the abundance estimates have the same variance as the individual samples in the spectrum, and are mutually independent. In general, the inverse eigenvalues of DTD, or equivalently the eigenvalues of (DTD)−1, which can be computed in advance, determine the amplification of noise. When the input noise is expressed in terms of components in the directions of the eigenvectors of DTD, each noise component is amplified by the corresponding inverse eigenvalue. Certain noise components, e.g. in the direction of similar spectra, undergo large amplification of noise causing unstable estimates in the abundance of such spectra. While other directions, i.e. in the direction of highly distinct spectra, may undergo essentially no amplification, leading to relatively stable estimates of these abundances. In any case, the estimation errors can be estimated in advance from the spectral database. This admits the possibility of designing optimal experimental protocols that produce tolerable levels of errors or optimal interpretation of existing experimental protocols.
If the matrix DTD is nearly singular, then the abundance estimates will have large errors. The matrix DTD will be singular when one of the rows of the matrix can be written as a sum of the other rows. A trivial case is a matrix that contains two identical rows. The matrix is said to be ill-conditioned when it is nearly singular. This will happen when the database contains two spectra from the same precursor or contains a multiplex of spectra from precursors represented in the database. Some care must be taken to avoid this condition. If the database matrix D has more entries than sample values per entry, i.e. K>N, the matrix DTD will be singular.
Note that the resolution of spectra in the library has a significant effect on the overlap scores in the matrix DTD. For example, at low resolution, peak overlaps between similar masses increase, resulting in larger overlap scores between distinct database entries, i.e. “off the diagonal” of the matrix DTD. Pairs of spectra that have large overlaps, for example, either because of inherent similarity or insufficient resolution in the acquired spectra, are difficult to discriminate, and also reducing the overall discriminating power of the estimator by amplifying input noise.
The error covariance matrix can be used to determine the appropriate threshold for accepting low abundance ions versus setting noisy fluctuations in estimated abundances to zero. For example, if noise (i.e. a spectrum that does not overlap with any database entries) is presented to the estimator, the desired abundance vector would be zero, but in fact, the computed abundance vector is a random Gaussian vector. The mean value of the vector is zero but component values will fluctuate about zero due to the input noise, which is amplified and propagated by the estimator. The distribution of scores can be calculated for each database entry. The probability that any given score exceeds some arbitrary value (i.e. a threshold) can also be computed.
Therefore, a threshold can be chosen below which component scores can be discarded as noise. The threshold can be chosen so that only a small fraction of false positives are accepted. This false positive rate can be specified and used to compute the relevant threshold for detection.
At a given false positive rate, the detection sensitivity can also be calculated. Sensitivity depends upon signal-to-noise, as expected, but also depends upon the extent to which various components in the database can be discriminated from one another. This rather qualitative description of how sensitivity depends upon the database is exactly specified quantitatively by the error covariance matrix (DTD)−1.
The techniques described in this disclosure can be applied to either intentional or unintentional multiplexing. Intentional multiplexing refers, for example, to the sequential accumulation of multiple precursors in an ion trap, simultaneous fragmentation of all precursors, and simultaneous analysis. Intentional multiplexing can be used, for example, to exploit the large spectral bandwidth of Orbitrap® mass analyzers (commercially available from Thermo Fisher Scientific of Waltham Mass. USA) relative to the relatively low complexity of a single MS2 spectrum.
Unintentional multiplexing refers to the isolation of multiple precursors which happen to lie in the same isolation window as a single targeted precursor. This situation is unavoidable in complex mixtures, given that isolation windows are typically one Dalton or wider. The techniques taught in this document can be used for identification of additional precursors, even when it is believed that a single precursor has been isolated. In that case, only one precursor should have an estimated abundance significantly different from zero.
In some implementations, it may be advantageous to perform MS2 identifications in real-time on an embedded system. Graphic processing units (GPUs) are ideally suited for carrying out the required linear algebraic computations quickly and at relatively low cost.
There are two general approaches to identifying precursors from product ion spectra. The first involves matching the observed spectrum to entries in a spectral library database. The alternative is product ion spectrum prediction. In theory, observed product ion spectra provide substantial information about the identity of a precursor compound. Product ion spectra typically contain many detectable peaks. The collection of mass positions and intensities of these peaks provide a distinctive fingerprint of the precursor compound. Furthermore, the observed product ion spectrum is highly reproducible. Taking these two properties together, it is unlikely that random variations that affect the acquisition of a product ion spectrum would cause one precursor compound to be mistaken for a different precursor compound.
Despite the potential utility of spectral libraries for identification, conventional spectral libraries do not guarantee accurate, confident precursor identification. Several problems with spectral libraries are addressed in this patent application. Most importantly, spectral libraries are substantially incomplete. While it is trivial to acquire a product ion spectrum on a modern mass spectrometer, it is relatively laborious to prepare purified precursors to submit to the mass spectrometer to generate the pure product ion spectra that are needed for use a spectral library. General-purpose spectral libraries may contain thousands of compounds, but typically do not provide extensive coverage of the molecules encountered in many, highly-specific applications. If an analyzed precursor does not appear in the spectral library, the search will result, at best, no result, and at worst, an incorrect identification.
Even in the case where an analyzed precursor appears in the library, the library entry for that precursor may be a poor match to the observed spectrum because the two spectra have been acquired on different instruments and/or using different fragmentation conditions. For example, resonant collisionally induced dissociation in an ion trap, commonly called CID or less commonly RECID, may produce a significantly different product ion spectrum than collisionally induced dissociation where ions are accelerated to high-energy axial energy before entering a quadrupole (HCD). In RECID, the precursor resonantly absorbs energy based upon its mass-to-charge ratio and generates primary fragments upon colliding with neutral gas molecules. The primary fragments are not in resonance by virtue of their change in mass-to-charge, quickly lose energy, and do not produce secondary fragments. Conversely, in HCD, primary fragments may retain high kinetic energy and give rise to secondary fragments.
Diverse methods of fragmentation such as electron transfer dissociation (ETD), ultraviolet photodissociation (UVPD), and infrared multiple photon dissociation (IRMPD) rely upon completely different fragmentation mechanisms. Each of these fragmentation methods produces product ion spectra with distinctive properties that depend upon differing aspects of the precursor structure and reactivity.
Even when restricted to the same general type of fragmentation, differences in the experimental parameters can cause significant variations in the distribution of product ions. For example, in CID, increasing the pressure in the collision cell tends to favor multiple, sequential fragmentation events. A similar effect can be seen by lengthening the reaction time in ETD. Increasing the collision energy can favor different fragmentation pathways in CID than at lower energy.
The numerous difficulties with spectral libraries have led many practitioners to favor the alternative approach in which a product ion spectrum or, more commonly, a list of product ion masses is predicted from a precursor molecule of known structure. The primary advantage of this method is that an algorithm can generate a predicted spectrum for essentially any molecule that can be conceived, eliminating the need to synthesize and purify the molecule for analysis. The method is used to its greatest advantage in bottom-up proteomics. In such an application, the product ions of tryptic peptides formed by CID spectra are primarily b- and y-type ions that can be easily and reliably enumerated.
The disadvantages of product ion spectrum prediction, however, are significant: prediction quality is poor. It is difficult to predict the most abundant product ions for most classes of molecules. Even in cases where product ions can be predicted, intensity information in the observed product ion spectrum usually cannot be exploited. Large uncertainty in the prediction and the failure to use product ion intensities to discriminate precursors often result in mistaking one precursor compound for another.
Several deficiencies in spectral libraries that limit their utility in identification can be overcome simultaneously by enabling a mass spectrometer to use each product ion spectrum it acquires to generate its own spectral library. For confident identification, it is critical to collect analytic product ion spectra under essentially the same conditions for which the corresponding spectral library entry was collected. The best way to ensure this correspondence is to construct the spectral library on the same instrument where the subsequent analysis will be performed. In addition, it is necessary to standardize the experimental parameters to eliminate unnecessary variation. For example, the collision energy can be set as a deterministic function of the isolation window. If an instrument is enabled for multiple types of fragmentation, then separate libraries should be maintained for each fragmentation type.
A comprehensive approach in which every spectrum acquired on the mass spectrometer is used for automatic spectral library construction makes it possible to generate libraries that are essentially complete for a given application in a relative short amount of time and without any burden on the user. For example, if a mass spectrometer is acquiring product ion spectra at a rate of 10 Hz, the number of spectra acquired in an hour is 10×60×60=36,000; the number of spectra that may be acquired in a day is 36,000×24=864,000. It is thus possible to collect a product ion spectrum on nearly every detectable molecule in a class of samples, e.g. a collection of human proteomes in a clinical trial, in a matter of days or weeks.
Although the number of product ion spectra obtained over the lifetime of a mass spectrometer may number in the billions—i.e. a million a day for thousands of days—the size of the spectral library depends only upon the number of unique precursors it detects. The number of unique detectable molecules detected by a mass spectrometer is typically several orders of magnitude smaller. If a database contains one million product ion spectra and each spectrum requires a kilobyte of storage (i.e. four bytes for mass and four bytes of intensity for a few dozen peaks plus annotation), the memory required to store the database is one gigabyte. Thus, typical databases that encapsulate a complete record of every precursor a mass spectrometer will ever encounter can be stored locally and accessed rapidly.
A distinctive aspect of the automatically generated spectral library is that it contains an entry for every precursor that is detected, even if an identification has not yet been made. A novel precursor is typically added to the library without an annotation that identifies it. When the precursor is subsequently presented to the mass spectrometer, it is matched to the corresponding library entry, but not, strictly speaking, identified, unless that entry has been annotated.
Most of the precursors in the database may never need to be identified. For example, it is sufficient to be able to match a compound to the unidentified entry in the database to allow comparative analysis of multiple samples. However, if a particular precursor compound appears to have some significance, e.g. as a potential biomarker whose abundance stratifies patient response to some therapeutic intervention, then some additional effort can be taken to annotate the entry. The current inventive method does not address how that annotation is performed. However, once an entry is annotated, the corresponding precursor is identified each time it is encountered, simply by matching it to the annotated entry.
A key enabling aspect of the inventive method is the ability to compile a library in which each precursor it encounters is represented by a single entry representing the product ion spectra that would be acquired if the precursor were purified and subsequently fragmented. The analytic method for interpretation of the product ion spectra derived from mixtures of precursors described above is essential to the automatic construction of a suitable spectral library.
Consider the case where a product ion spectrum is acquired from a mixture of precursors that have been previously seen by the mass spectrometer and for which accurate spectral library entries exist. The acquired spectrum is projected onto the spectral library, and the precursors are correctly identified and quantified. The estimated mixture of identified precursors can be used to form a model product ion spectrum and compared against the observed spectrum. The residual difference between the spectra can be analyzed for the presence of additional novel precursor components. In this case, the residue would be judged to be typical noisy variations and discarded.
Now, consider the case where a product ion spectrum is acquired from a mixture of precursors which includes a compound that is not represented in the spectral library. In this case, the residual difference between a model product ion spectrum constructed from the extracted components and the observed spectrum would be significant. One might then hypothesize that the residue contains one or more novel precursors.
The threshold below which residual components should be discarded depends upon the reproducibility of the product ion spectra. Consider a case where a given known primary precursor is mixed with a small amount (e.g. 5%) of an unknown secondary precursor. Suppose the product ion spectrum of the known primary precursor has a typical variation of 1%. Then, the difference between the product ion spectrum and the library entry for the known primary precursor is significant; the residue is unlikely to be explained by variations in the appearance of the product ion spectrum of the primary precursor. Conversely, if the secondary precursor is present at 1% abundance, its presence may not be detected and the residue may be considered as typical variation in the product ion spectrum of the primary precursor.
An LC-MS experiment provides a convenient way to verify putative novel precursors and to purify multiple novel precursors from the residue after known precursors are extracted. Correlation between the time profiles of pairs of product ions or between a product ion and a putative precursor can be used to match product ions to a precursor and have been described in the art.
For example, a product ion spectrum acquired during an LC-MS experiment reflects a mixture of precursors that happen to elute at the same time. If we could obtain the elution profile of each precursor, we would see all precursors were eluting at the time when the product ion spectrum was acquired. Although the profiles overlap at this time, the profiles are, in general, not identical. For example, they may be shifted slightly in time. In addition, each product ion derived from a given precursor has a profile that is essentially identical, except for statistical fluctuations, to the profile of its precursor. Therefore, within a collection of product ion spectra representing a mixture of two or more precursors obtained sequentially over a short duration of time in a LC-MS run, we expect to see one subset of product ions whose abundances move up and down in concert with each other and the precursor elution profile, while another subset of product ions move up and down together in a different pattern, slightly shifted in time.
To ensure the quality of the spectral library entries, we can enforce the rule that we do not add a putative precursor to the spectral library, unless the elution profiles of its product ions can be matched to or correlated with the precursor profile that can be directly observed in the precursor spectrum. We expect to be able to detect trace compounds in a product ion spectrum even when are not directly observed in the precursor spectrum. However, one can set a higher standard for including these in the library when they are observed for the first time.
We have mentioned that the precursor may or may not be observed in the precursor spectrum that is usually obtained immediately before triggering a product ion spectrum. When one selects an isolation window, there necessarily exists coarse, but definitive information about the precursor mass: that is, the precursor mass is inside the isolation window. Even with this coarse information about the precursor mass, one need not project the product ion spectrum onto the entire spectral library. Instead, one can form a candidate list from the spectral library whose precursor masses lie in the given window. These candidate lists are most conveniently generated by keeping the spectral library entries sorted by precursor mass. In the case of sequential multiplexing, the database would be constructed by concatenating lists of precursors contained in each of multiple isolation windows.
When the set of possible isolation windows can be enumerated in advance, e.g. 1-Da wide windows at each nominal mass between 1-2000, the spectral library can be partitioned in advance. For example, a database of one million entries might be divided into 2000 mini-databases (one for each nominal mass up to mass 2000) each containing, on average, 500 entries. For each of these sub-libraries, the matrix DTD can be stored and pre-factored (i.e. by LU decomposition). If there are K entries in the database, the computational complexity of solving the matrix equation for the abundance estimates is reduced from O(K3) to O(K2) when an LU decomposition has been pre-computed. Together, using the isolation window to reduce the list of candidate precursors and pre-factorization of the matrices make it possible to interpret mixed spectra of arbitrary complexity in real time.
The act of assuming that the only information about the precursor mass is given by the isolation window is a “blind” approach, because it does not consider information in the precursor spectrum. An alternative approach to the “blind” decomposition is constructing a small list of candidate precursors in real time based upon accurate mass measurements of detected peaks that lie in the isolation window in the precursor spectrum. On an accurate mass instrument, the precursor mass can be confined to a mass range that can be three orders of magnitude smaller than isolation window (i.e. mDa vs. Da), thus reducing the list of precursor candidates by a similar factor. Guided by this information, a database is constructed by concatenating lists of precursors whose masses lie within a confidence interval of one of the estimated masses of detected precursors.
Not only is the calculation much faster when a small subset of the spectral library is used, but the false positive rate is also reduced. The disadvantage of this method is that it precludes detection of low abundance precursors that were not detected in the precursor scan. However, this disadvantage is relatively small in cases where an isolated precursor has such high abundance that detection of additional precursors in the product ion spectrum would require excessively high dynamic range.
Regardless of how a list of precursor candidates is generated from the spectral library, the error covariance matrix can be computed (or pre-computed) for any specific list of candidates. The error covariance matrix indicates how much noise or variation in the product ion spectrum will be amplified in generating the abundance estimates. At a certain level of error, one cannot distinguish whether a given precursor is present at low abundance or completely absent, leading to false-positive and false-negative identification errors. A real-time decision can be made regarding how much acquisition time would be necessary to make a correct identification based upon the list of candidates. In some cases, an easier target might be preferred if the acquisition time required for identification at a particular confidence level is judged to be lower.
A test library was constructed by performing the following operations: (a) reading a list of spectra contained in one or more files; (b) sorting the spectra by precursor mass-to-charge (m/z) ratio; (c) building library segments according to a user-defined precursor window step size; (d) assigning ions of the MS2 spectra to various bins according to a user-defined step size (resolution); (e) building qualified library segments by computing the DTD matrix for all spectra in a given precursor range, examining this matrix for pairs of highly correlated spectra (redundant entries), consolidating redundant entries, and re-computing the matrix; (f) computing and storing the LU factorization and the inverse of the DTD matrix. Neither the sample-weighting scheme to account for spectral variations due to ion counting statistics nor the detection threshold for ignoring calculated abundances that are not statistically different from zero were implemented in this example.
After the library was constructed, additional spectra were searched against this library by solving for {circumflex over (x)} in Eq. 8, where S is a matrix of observed spectra, and X is a matrix of calculated proportions of the various library components. The graphics processing unit (GPU) was utilized for all matrix operations. We performed the MS2 de-multiplexing calculations using two test cases: Case #1 a set of 1.1 million MS2 spectra from an LC-MS analysis of yeast proteomic samples on a Q-Exactive™ mass analyzer instrument (commercially available from Thermo Fisher Scientific of Waltham Mass. USA and including a quadrupole mass filter for precursor selection, a Higher Collision Energy Dissociation fragmentation cell and a high-resolution accurate-mass Orbitrap® mass analyzer for analysis and detection) and; Case #2 synthetically multiplexed spectra formed by summing observed MS2 spectra and adding random noise.
In the first case, we aggregated a large set of MS2 spectra of yeast samples produced on a Q-Exactive™ instrument in which fragments ions were produced in an HCD cell. Over 1.1 million such MS2 spectra were collected and searched via the Mascot software search engine (a conventional search engine that is able to identify proteins from mass spectrometry data) against a yeast protein database. Because a functional spectral library may be presumed to be curated for quality and relevance, we used the Mascot search as a filter to approximate such curation. Spectra that yielded peptide identifications at a false positive rate of 5% or better were retained and written to text files, yielding about 150,000 such spectra. Of the above spectra, 111,604 spectra were read from files to build a library, while 44,263 spectra were kept in reserve to serve as queries against the library. Spectra were binned at 0.1 m/z resolution and normalized to unit vectors. Each vector spans a mass range from 0 to 2000 Da, and thus contains 20,000 sample values.
The library was partitioned into sub-libraries representing 1 m/z unit (Da/e) precursor isolation steps. This relatively coarse partitioning of the library enables “blind” de-multiplexing. In blind de-multiplexing, it is possible, in theory, to detect precursors in an MS2 spectrum, even when the precursor is not detected in MS1 scan. Alternatively, one can use the accurate mass measurement of the detected precursor(s) in the isolation window to limit the search to a very small number of candidate precursors. The difficulty of the de-multiplexing problem grows non-linearly with the number of candidates in the library partition. We chose blind de-multiplexing as a test case to demonstrate the power of the method. The distribution of library sizes is shown as graph 72 in
The library was formed from non-redundant spectra, with the intent of retaining one copy of a spectrum from each distinct precursor. Redundant spectra were consolidated as follows: the matrix DTD containing all pairwise dot products of spectra (correlation coefficients) was examined for entries above a threshold. Highly correlated pairs were aggregated, reduced to distinct sets, and averaged. Single averaged spectra then replaced their “parent” spectra. Consolidation of a very large number of redundant entries explains why the final count of spectra (27,714) is much smaller than the size of the starting set.
The 44,263 test spectra were sorted by precursor m/z and searched against the library, taking about two minutes on a laptop computer. An example results matrix is shown in
The above calculations attempt to express the observed spectrum as a mixture of precursors. Because the spectra were pre-filtered by retaining only those spectra that produced high-confidence Mascot identifications, we expect most of the spectra to contain one (pure) component. The calculated results indicate that, in most cases, the methods indicated (presumably correctly) that one pure precursor was present. These values can be seen to form a near diagonal wall (the predominant feature of the surface 82) across the contour plot. The continuity between hits reflects the fact that both sets were sorted by precursor m/z prior to searching. Significant hits off of the diagonal can represent incorrect precursor assignment in the test data, or the presence of multiple peptides in the test spectra (multiplexed spectra).
A visual inspection of the results illustrated in
Detection of unintentionally multiplexed spectra is one major potential application of this approach; another is the decomposition of intentionally multiplexed spectra, that is spectra derived from multiple sequential isolations of different precursor species, followed by combined analysis of their fragments. To attempt to explore the suitability of our approach for such data, we took a set of MS2 spectra in which fragmentation was accomplished by HCD, such as in the system 15 illustrated in
Although not readily apparent in
The discussion included in this application is intended to serve as a basic description. Although the present invention has been described in accordance with the various embodiments shown and described, one of ordinary skill in the art will readily recognize that there could be variations to the embodiments and those variations would be within the spirit and scope of the present invention. The reader should be aware that the specific discussion may not explicitly describe all embodiments possible; many alternatives are implicit. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit, scope and essence of the invention. Neither the description nor the terminology is intended to limit the scope of the invention. All patent application disclosures, patent application publications or other publications are hereby explicitly incorporated by reference herein as if fully set forth herein. In any instances in which such incorporated material is in conflict with the present disclosure, the present disclosure shall control.
This application claims the priority benefit, under 35 U.S.C. 120, to U.S. Provisional Application for Patent No. 61/728,600, filed on Nov. 20, 2012 and titled “Interpreting Multiplexed Tandem Mass Spectra Using Local Spectral Libraries” and to U.S. Provisional Application for Patent No. 61/728,611, filed on Nov. 20, 2012 and titled “Methods for Generating Local Mass Spectral Libraries for Interpreting Multiplexed Mass Spectra”, both said applications in the names of the inventors of this application and assigned to the assignee of this application, and incorporated herein by reference in their entireties. Additionally, this application is related to a co-pending U.S. patent application Ser. No. 14/085,356, filed on Nov. 20, 2013 and titled “Methods for Generating Local Mass Spectral Libraries for Interpreting Multiplexed Mass Spectra”, which is in the names of the inventors of this application and is assigned to the assignee of this application.
Number | Date | Country | |
---|---|---|---|
61728600 | Nov 2012 | US | |
61728611 | Nov 2012 | US |