A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyrights whatsoever.
The invention relates to the fields of mass spectrometry and the identification of polypeptides and other biomolecules.
Mass spectrometry and related techniques have become important tools in the analysis of proteins, peptides, carbohydrates, and other biomolecules and biomolecule fragments, the understanding and identification of which are important in a wide variety of fields. For example, proteomic research programs typically include the identification of protein content of any given tissue, cell, subcellular organelle or bodily fluid, their isoforms, splice variants post-translation modifications, interacting partners, and higher-order complexes under different conditions. In other applications, samples from different study conditions are compared such as healthy, diseased and disease-treated with the intent of identifying proteins that are differentially expressed between the conditions. These proteins can be developed into therapeutics, biomarkers or diagnostics of human disease. Such analyses also aid in the fundamental understanding of disease and disease treatment. Indeed, many activities, innovations and decisions in basic biological research and pharmaceutical development depend on the accuracy of protein identification.
In one aspect, for example, the invention provides computer-usable media comprising computer-readable programming code adapted for causing a computer or other data processor to access data representing a plurality of expression patterns of peptides or other biomolecule fragments expressed from one or more samples and, using the accessed data, to identify or otherwise associate at least one protein or other biomolecule associated with the plurality of fragment expression patterns, and to determine coefficients useable for measuring correlations between the pluralities of expression patterns identified as associated with the various biomolecules. Such coefficients can be used, for example, in conjunction with, or without, other data to identify relatively high-confidence and a relatively low-confidence associations of fragments with precursor biomolecules.
Thus for example coefficients indicating a relatively low confidence in an association of a peptide or other biomolecule fragment with a protein or other biomolecule can be used to ensure that the association is not considered in subsequent analyses, or is at least identified as indicating a less-reliable identification and used accordingly in subsequent analyses. Furthermore such coefficients representing the correlation of peptide or biomolecule fragments matched to homologous or closely related biomolecules can be used to more accurately interpret the identification data and resolve between previously indistinguishable biomolecules or proteins.
The use of stored data sets representing previously-conducted analyses may be useful, for example, in confirming or improving the results of prior analyses. Stored data sets may be accessed from memory associated with the processor, as for example as a part of a computer adapted for controlling a mass spectrometer instrument, from a data base accessed locally or for from a local network source, as for example over a local area network (LAN), or remotely over a public or private electronics communications network (ECN) such as the internet or a private subscription service.
Thus, in an aspect of the invention there is a method useful in an identification of proteins. The method may be performed by a data processor and comprise: accessing data representing a plurality of expression patterns of peptides expressed from one or more samples; using the accessed data, identifying at least one protein associated with the plurality of peptide expression patterns; selecting a correlation coefficient useable for determining a correlation between each at least one protein and a plurality of expression patterns of peptides identified as associated therewith; and using at least the correlation coefficient, identifying at least one of a relatively high-confidence association and at least one of a relatively low-confidence association of precursor proteins with the peptides expressed from the one or more samples.
The correlation coefficient may include a correlation threshold value and a coverage threshold value. The identifying the at least one relatively high-confidence and low confidence associations of precursor proteins may include: identifying a largest subset of the plurality of expression patterns associated with the each at least one protein, the subset having pairwise correlation above the correlation threshold value; and identifying the each at least one protein as (i) a at least one relatively high-confidence association of precursor proteins if the subset size is greater or equal to the coverage threshold value, and (ii) a at least one relatively low-confidence association of precursor proteins if the subset size is small than the coverage threshold value.
The method may further comprise accessing second data representing randomized expression patterns of peptides. It may further comprise using at least the correlation coefficient, identifying from the second data at least one of a relatively high-confidence by-chance association and at least one of a relatively low-confidence by-chance association of the at least one proteins with the peptide expressed from the one or more samples. This identifying from the second data may be by: identifying in the second data a largest subset of the plurality of expression patterns by-chance associated with the each at least one protein, the subset having pairwise correlation above the correlation threshold value; and identifying the each at least one protein as (i) a at least one relatively high-confidence by-chance association if the subset size is greater or equal to the coverage threshold value, and (ii) a at least one relatively low-confidence by-chance association if the subset size is small than the coverage threshold value.
The method may further comprise determining a false positive rate as a ratio of a total of the at least one relatively high-confidence association of the precursor proteins over a total of the at least one relatively high-confidence by-chance association of the at least one proteins with the peptide expressed from the one or more samples. The method may further comprise evaluating whether the false positive rate is unacceptable, and if it is unacceptable, then selecting a new correlation threshold to replace the correlation threshold for use in repeating the said identifying steps until the false positive rate is acceptable.
The expression patterns may be obtained by liquid-chromatography/mass spectroscopy (LC-MS) analysis. The data relating to each expression pattern may be obtained by digesting a corresponding peptide with a protease. The accessing data representing the pluralities of expression patterns of peptides may comprise accessing data obtained using mass spectrometry. The accessing data representing the pluralities of expression patterns samples may comprise accessing data obtained using virtual mass spectrometry. The data representing the plurality of expression patterns of peptides expressed from the one or more samples may be accessed at least in part from real time analysis by a mass spectroscopy device associated with the processor.
The data representing a plurality of expression patterns of peptides expressed from one or more samples may be accessed at least in part from a stored data set. The stored data set may be stored in persistent media associated with the data processor. The stored data set may be accessed via a public communications network. The correlation may be between expression patterns obtained from a plurality of samples, with at least two of the samples collected from different subjects. The correlation may be between expression patterns from a plurality of samples, with at least two of the samples collected from a same subject at different times.
In another aspect of the invention, there is a method of validating a biomolecule identification from a plurality of peptides. The method may comprise: using at least an assignment of the plurality of peptides to at least one precursor biomolecule from a set of peptide expression profiles, determining a correlation coefficient for correlating the assignment of the plurality of peptides to the at least one precursor biomolecule within a false positive identification rate; and validating the biomolecule identification based on the assignment, if the biomolecule identification is correlated to one or more of the at least one precursor biomolecule within the false positive identification rate.
The false positive identification rate may be determined as a function of an expected random correlation between the plurality of peptides to the at least one biomolecule within the set of peptide expression profiles.
The expected random correlation may be a total number of expected false identifications based on the at least one biomolecule. The false positive identification rate may be determined as a ratio of the total number of expected false identifications over a total number of identifiable biomolecules. The total number of identifiable biomolecules may be based on the at least one biomolecule.
The correlation coefficient may comprise a correlation threshold and a coverage threshold. The total number of identifiable biomolecules may be determined by, for each of the at least one biomolecule, incrementing the total number of identifiable biomolecules if, in the set of peptide expression profiles, a largest subset of peptide assignment to the each at least one biomolecule has pairwise correlation above the correlation threshold and the subset has a size above the coverage threshold. The total number of expect false identifications may be determined by, for each of the at least one biomolecule, incrementing the total number of expected false identifications if, in a randomized set of peptide expression profiles, another largest subset of peptide assignment to the each at least one biomolecule has pairwise correlation above the correlation threshold and the subset has a size above the coverage threshold. The randomized set of peptide expression profiles may be generated from the set of peptide expression profiles.
The correlation coefficient may be selected on the basis of the false positive identification rate. The biomolecule may be a protein. The correlation coefficient may be selected from a plurality of test correlation coefficients, each of the test correlation coefficients being used to calculate a respective test false identification rate in the same manner that the correlation coefficient is used to determine the false positive identification rate. The test correlation coefficient having a test false identification rate that is closest within the false positive identification rate may be selected as the correlation coefficient.
The correlation coefficient may be selected by initially selecting a test correlation coefficient to determine a test false identification rate in the same manner that the correlation coefficient is used to determine the false positive identification rate. If the test false identification rate is not within the false positive identification rate, the method may iteratively adjust the test correlation coefficient until the test false identification rate is within the false positive identification rate, and then selecting the test correlation coefficient as the false positive identification rate.
In a further aspect of the invention, there is a computer usable medium having computer readable code embodied therein. The computer readable code may cause a computer to: access data representing a plurality of expression patterns of peptides expressed from one or more samples; using the accessed data, identify at least one protein associated with the plurality of peptide expression patterns. The computer readable code may further causes the computer to select a correlation coefficient useable for determining a correlation between each at least one protein and a plurality of expression patterns of peptides identified as associated therewith, the correlation coefficient having a correlation threshold value and a coverage threshold value. The computer readable code may further causes the computer to, using at least the correlation coefficient, identify at least one of a relatively high-confidence association and at least one of a relatively low-confidence association of precursor proteins with the peptides expressed from the one or more samples, by: identifying a largest subset of the plurality of expression patterns associated with the each at least one protein, the subset having pairwise correlation above the correlation threshold value; and identifying the each at least one protein as (i) a at least one relatively high-confidence association of precursor proteins if the subset size is greater or equal to the coverage threshold value, and (ii) a at least one relatively low-confidence association of precursor proteins if the subset size is small than the coverage threshold value.
The computer readable code may further causes the computer to access second data representing randomized expression patterns of peptides. The computer readable code may further causes the computer to, using at least the correlation coefficient, identify from the second data at least one of a relatively high-confidence by-chance association and at least one of a relatively low-confidence by-chance association of the at least one proteins with the peptide expressed from the one or more samples. The identify from the second data may be by: identifying in the second data a largest subset of the plurality of expression patterns by-chance associated with the each at least one protein, the subset having pairwise correlation above the correlation threshold value, and identifying the each at least one protein as (i) a at least one relatively high-confidence by-chance association if the subset size is greater or equal to the coverage threshold value, and (ii) a at least one relatively low-confidence by-chance association if the subset size is small than the coverage threshold value.
The computer readable code may further causes the computer to determine a false positive rate as a ratio of a total of the at least one relatively high-confidence association of the precursor proteins over a total of the at least one relatively high-confidence by-chance association of the at least one proteins with the peptide expressed from the one or more samples. The computer readable code may further causes the computer to evaluate whether the false positive rate is unacceptable, and if it is unacceptable, then selecting a new correlation threshold to replace the correlation threshold for use in repeating the said identifying steps until the false positive rate is acceptable.
In another aspect, there is a method for improving and measuring the accuracy of protein identification using peptide expression profiles. The method may comprise: providing a plurality of peptide-to-protein assignments; providing an expression profile over a plurality of samples for a plurality of peptides; for a plurality of correlation coefficient threshold and peptide coverage threshold pairs, determine the false positive protein identification rates for each said pair using randomizations of the peptide expression profiles; and for an optimal selection of the correlation coefficient threshold and peptide coverage threshold as determined by the false positive protein identification rate and number of proteins identified, generate a new peptide-to-protein assignment where all peptides assigned to a protein are pairwise correlated at or above the correlation coefficient threshold and the number of said peptides is at least the peptide coverage threshold.
In another aspect, there is a method of identifying biomolecules. The method may be performed by an automatic data processor and comprises: accessing data representing a plurality of expression patterns of biomolecule fragments expressed from one or more samples; using the accessed data, identifying at least one precursor biomolecule associated with said plurality of peptide expression patterns; determining a coefficient useable for measuring a correlation between a plurality of expression patterns of biomolecule fragments identified as associated with said precursor biomolecule; and based at least partly on the coefficient, identify at least one of a relatively high-confidence and a relatively low-confidence association of peptides with precursor proteins.
In another aspect, there is an apparatus useful for identifying proteins. The apparatus may comprise a data processor adapted to: access data representing a plurality of expression patterns of peptides expressed from one or more samples; using the accessed data, identify at least one protein associated with said plurality of peptide expression patterns; determine a coefficient useable for measuring a correlation between a plurality of expression patterns of peptides identified as associated with said protein; and based at least partly on the coefficient, identify at least one of a relatively high-confidence and a relatively low-confidence association of peptides with precursor proteins.
The plurality of peptide expression patterns may represent the expression of all peptides detected in a sample. The correlation coefficient may be determined only between expression patterns associated with peptides that are associated with a single protein. The processor may be adapted to access the data representing the expression patterns as signals provided by a liquid-chromatography/mass spectroscopy (LC-MS) analysis device. The processor may be adapted to access the data representing the expression patterns as signals recorded in persistent storage media. The persistent media may be associated with the data processor. The processor may be adapted to access the persistent media via a public communications network. The processor may be adapted to access the data representing the expression patterns as signals stored in volatile memory.
In another embodiment, there is an apparatus useful for identifying biomolecules. The apparatus may comprise a data processor adapted to: access data representing a plurality of expression patterns of biomolecule fragments expressed from one or more samples; using the accessed data, identify at least one precursor biomolecule associated with said plurality of biomolecule fragment expression patterns; determine a coefficient useable for measuring a correlation between a plurality of expression patterns of biomolecule fragments identified as associated with said precursor biomolecule; and based at least partly on the coefficient, identify at least one of a relatively high-confidence and a relatively low-confidence association of peptides with precursor proteins.
The foregoing and other aspects of the invention will become more apparent from the following description of specific embodiments thereof and the accompanying drawings which illustrate, by way of example only, the principles of the invention. In the drawings, where like elements feature like reference numerals (and wherein individual elements bear unique alphabetical suffixes):
The description which follows, and the embodiments described therein, are provided by way of illustration of an example, or examples, of particular embodiments of the principles of the present invention. These examples are provided for the purposes of explanation, and not limitation, of those principles and of the invention. In the description, which follows, like parts are marked throughout the specification and the drawings with the same respective reference numerals.
Bottom-up proteomics covers an approach to proteomics where biomolecules, such as proteins within a sample are digested using an enzyme such as trypsin resulting in a collection of peptides. The digested protein is generally referred to as the parent protein or precursor of the derived tryptic peptides. Protein identification in the context of bottom-up proteomics covers the assignment of peptides to parent proteins using proteomic technologies such as tandem mass spectrometry. The accuracy of protein identification is typically measured by the proportion of true positive to false positive parent protein identifications. See for example,
Advantageously, in embodiments of the invention described below, protein identification in the context of bottom-up proteomics includes a procedure where a peptide-to-protein assignment is filtered by an independent procedure that differentiates the peptides likely to be true positive assignments from those likely to be false positive assignments. Furthermore, this procedure can tend to rigorously quantify the resulting false positive protein identification rate. The procedure, as used in protein identification, is referred to as PRotein IDentification and Expression (PRIDE).
Embodiments of the invention provides systems, methods, apparatus, and programming useful for improving the accuracy of peptide to biomolecule, or protein, assignments by utilizing expression profiles for each peptide and defining a procedure for determining the false positive rate of biomolecule identification.
More specifically, in an embodiment of the invention, there is taken as input a plurality of putative peptide-to-protein assignments and for each peptide an expression profile across a plurality of samples. The embodiment then measures the correlation of the expression profiles for each pair of peptides. A correlation threshold and coverage threshold are determined (as described in more detail below) and the largest set of peptides that have pairwise correlation coefficients, or scores, above a correlation threshold is selected as the correct peptide-to-protein assignments. If the size of this set of peptides is less than the coverage threshold then the protein is determine to be a false positive protein identification. The false positive protein identification rate is determined for multiple correlation and coverage threshold values, which enables the optimization of these two parameters so that the false positive protein identification rate can tend to be minimized, while tending to maximize the number of acceptable protein identifications.
Examples of technologies that generate peptide to biomolecule assignments include tandem mass spectrometry coupled with protein database search engines such as Mascot (Matrix Science, London, UK). Tandem mass spectrometry can also be coupled with de novo sequencing tools such as PEAKS (Bioinformatics Solutions, Waterloo, Canada) followed by protein homology searches. Fingerprinting tools such as Aldente (Expasy, Swiss Institute of Bioinformatics, Geneva, Switzerland) can be used also.
The peptide expression profiles used in the embodiment can originate from mass spectrometric analyses of biological or clinical samples including technologies such as MALDI, ESI and SELDI. Peptide expression levels across samples may also be measured using immunoassays or any other technology that quantifies peptide levels. ICAT and other labeling technologies can also generate peptide expression profiles (see for example Gygi, S P et al., supra).
Correlations between the pluralities of expression profiles of peptides may be determined using any suitable algorithm or method. Examples include the Pearson correlation, Spearman ρ correlation, Kendall's τ correlation, correlation ratio and mutual information, Gamma association, Stuaru's tau-c, and Somer's D correlations, as well as other widely-accepted standard definition employing least-squares curve fitting. See for example, Cohen, J. et al., supra.
The selection of the largest set of pairwise correlating peptides may be performed using various established algorithms including graph theoretic algorithms (largest clique) and hierarchical clustering.
The false positive rate of protein identification may be determined using methods such as permutation tests on the underlying expression data and other similar randomization techniques.
It is possible that peptides are related biochemically, but in general, are not biochemical related. For the embodiment, the only assumed relationship is that they originate from the same parent protein or biomolecule.
The embodiment does not require that any of the putative peptide-to-protein (or biomolecule) assignments be correct. In some instances, the procedure may find that none of the assigned peptides correlate.
This is based on the observation that peptides originating from the same protein or biomolecule precursor will tend to share the same expression profile across samples in a bottom-up proteomics study. This follows from the fact that the protein expression profile is determined in vivo before the proteins in the samples are digested (say, by trypsin) to obtain peptides.
A distinct but related concept is that peptides exhibiting correlated expression profiles are biochemically or biologically related will also exhibit correlation in vivo; see for example J. Lamerz et al., supra. This latter working assumption is the converse of the working theory upon which PRIDE and the embodiments are based. More specifically, a PRIDE system utilizes a peptide-to-protein assignment which associates peptides together because they are assigned to the same protein by a protein identification procedure. As applied in the embodiments, the PRIDE system confirms that these peptides have correlated expression profiles, or not.
Further details on particular embodiments of PRIDE is now provided. In analyses, the samples may include, for example, multiple samples taken from a single source, such as a human or animal patient or test subject, or samples taken from multiple human or other subjects, such as multiple patients in a clinical program or study. For example, multiple samples may be collected from healthy and diseased individuals.
As described herein, biomolecules include proteins, polypeptides, peptides, and carbohydrates. Biomolecule fragments include proteins, polypeptides, peptides, amino acids, carbohydrates, and any other portions into which biomolecules may be separated. The terms “peptide” and “parent protein” are well understood by a person of skill in the relevant arts and require no further elaboration.
A polypeptide include a chain of two or more amino acids, regardless of any post-translational modification (e.g., glycosylation or phosphorylation). Polypeptides include proteins and peptides. Source polypeptides may be cleaved by the action of a protease into one or more digestion fragments, or otherwise fragmented by any means compatible with the purposes disclosed herein.
A digestion fragment include a portion of a polypeptide produced, actually or theoretically, by for example the action of a protease or other agent that reproducibly cleaves or otherwise fragments the polypeptide.
A source polypeptide include a polypeptide from which a specified digestion fragment is actually or theoretically produced by, for example, the action of a protease or other chemical cleavage agent that reproducibly cleaves or otherwise fragments the source polypeptide. A source polypeptide typically contains at least two potential digestion fragments.
A fraction include a portion of an analyte or sample separation. A fraction may correspond to a volume of liquid obtained during a defined time interval, for example, as in LC (liquid chromatography). A fraction may also correspond to a spatial location in a separation such as a band in a separation of a biomolecule facilitated by gel electrophoresis, e.g., SDS-PAGE. Furthermore, a fraction may correspond to an elution from a chromatography medium, e.g., strong cation exchange.
In an embodiment, the pairwise correlation between ordered lists of values, X and Y, may be viewed as a measurement of the dependence between the two lists. That is, as values in X increase then the values in Y also increase. In a negative correlation, as values in X increase then values in Y decrease. If the dependence is linear then the pairwise correlation between X and Y is often measured using the Pearson correlation defined:
where xi and yi are the values of X and Y, x and y are the means and sx and sy the standard deviations. The Pearson correlation tends towards 1 if there is a positive linear dependence and tends towards (−1) if there is a negative linear dependence. As the Pearson correlation tends to 0 there is no linear dependence between X and Y. As such, the Pearson correlation is an indication of the degree of linear dependence between X and Y. In the context of peptide expression profiles, the correlation between pairs of peptide expression profiles may be quantified using the Pearson correlation or other measures of dependence, as described below. In an embodiment, ordered lists of values such as X and Y can be log-transformed or normalized before quantifying the degree of dependence.
Referring now to
The embodiment of
After digestion there is an optional separation at 102. There are many separation technologies (see, for example, Laemmli, supra and Schagger et al., supra) including SDS-PAGE, SCX (Strong Cation Exchange), IEF (Isoelectric Focusing) among others. Such separation techniques are well known to a person of skill, and are therefore not repeated herein for brevity.
After separation, the fractions are submitted to a LC-MS analysis at 103. At 103, raw expression data is obtained for peptides. Exemplary methods for analyzing polypeptides and other biomolecules using mass spectrometry techniques are well known in the art (see for example, Godovac-Zimmermann et al., supra, Gygi et al. II, supra, Reinders et al., supra and Aebersold et al., supra), and doubtless others will hereafter be developed. The exact type of mass spectrometer used is not critical to the embodiments disclosed herein, and a person of skill will understand, with the descriptions herein, how to operate a mass spectrometer in accordance with the described embodiments.
Although the description of the embodiments herein are focused on polypeptides and other biomolecules, the embodiments are generally applicable to any biological polymers, e.g., oligosaccharides and polysaccharides, lipids, nucleic acids, and metabolites, capable of being detected via mass spectrometry.
After the raw expression information is obtained in 103, at 104 the raw LC-MS data is processed in a series of refinements. Such processing of LC-MS raw data is shown in
Once peptides have been detected, three dimensions of LC-MS data, namely, mass, retention time and intensity, are normalized across the study. For the embodiment, this is accomplished by selecting a standard sample and normalizing to that sample. The next step of data processing is clustering. The goal of clustering is to track the same peptide, within a fraction, across all samples of the study. This is achieved by performing hierarchical clustering on mass and retention time for each fraction.
Referring back to
Consequently, for the embodiment every peptide is assigned a unique identifier, the fraction it was detected in, the median m/z ratio and median retention time at which it was detected across the n samples of the study, the charge state and a vector representing the expression profile of the peptide across the study. In a typical plasma proteomic study with 8 SCX fractions, over 35000 highly reproducible peptides are typically found.
Returning to
After peptides have been selected for biomolecule or protein identification, they are submitted to mass and retention time fingerprinting at 106, such as described in co-owned application No. 60/691,414, described and incorporated by reference above, and/or tandem mass spectrometry using LC-MS/MS followed by database searches using Mascot or some another search engine known in the art or hereafter developed at 107. Irrespective of the methodology used for biomolecule or protein identification, in the context of bottom-up proteomics as utilized in the embodiment, the resulting biomolecule or protein identification is an assignment of peptides in the peptide expression profile database to peptide sequences within a parent biomolecule or protein. A graphical representation of an exemplary association is depicted in
After protein identification is completed at 106 and/or 107, the results of such protein identification efforts are merged and sent to a correlation filter 108, as shown in
Referring to
To select the corr_threshold parameter in a study independent manner, it is represented as a percentile value rather than an absolute correlation value. The reason for this choice in the embodiment is that peptide expression correlation coefficients are dependent upon the number of samples analyzed and the variability of the underlying proteomic platform. To obtain a percentile value, the distribution of all pairwise correlation coefficients between pairs of peptides in the database is determined using, for example, the Pearson correlation (or some other correlation method known or hereafter known in the art). This distribution can then be used to determine the percentile value of any raw correlation coefficient. Since a raw correlation score depends on, among other factors, the number of samples in the study, the inherent variability of the proteomic platform and the samples analyzed, converting to a percentile standardizes the approach used in the embodiment to determine confidence. This is tends to be advantageous as it enables comparisons among studies, which comparisons have heretofore not been seen in such studies.
Referring to
For example, the Pearson correlation for two sets of measurements X and Y is defined:
where xi and yi are the values of X and Y, x and y are the means and sx and sy the standard deviations. The Pearson correlation tends towards 1 if there is an increasing linear relationship and tends towards (−1) if there is a decreasing linear relationship. As the Pearson correlation tends to 0 there is no linear relationship between X and Y. As such, the Pearson correlation is an indication of the degree of linear dependence between X and Y.
The Pearson correlation is a parametric statistic. If the measurements X and Y are not normally distributed, then non-parametric correlation metrics such Spearman's ρ and Kendall's τ can be used. Even more general correlation measures that may be applied are the correlation ratio and mutual information. The mutual information of measurements X and Y is defined:
where p(x,y) is the joint probability distribution of X and Y, and p(x) and p(y) are the marginal probabilities of X and Y. Mutual information measures how much is known about Y if X is known, or vice-versa.
Although standard measures of correlation or dependence between measurements X and Y are utilized in the embodiments described, any measurement of correlation or dependence can be used in other embodiments that produces a coefficient that quantifies the degree of correlation or dependence.
Referring back to
Another approach that may be used include graph theoretic approaches such as finding the maximum clique in a graph (see Garey et al., supra), where each node in the graph is a peptide, and there is an edge between pairs of peptides if their percentile Pearson coefficient is below corr_threshold. Other methods of finding a maximal set of correlating peptides may be used in other embodiments. As described above and below, a wide variety of existing statistical methods may be employed in assessing the significance of correlations. Some such statistical methods may be based, for example, on varying assumptions related to interpretation of the fragment expression patterns, the propriety of the various assumptions and therefore of the use of the various statistical methods depending upon the nature and purpose of the fragment-precursor studies, and the techniques employed therein. Examples of suitable algorithms include the Pearson correlation, Spearman rank correlation, Kendall's rank correlation, Gamma association, Stuaru's tau-c, and Somer's D correlations, as well as other widely-accepted standard definition employing least-squares curve fitting.
Thus, at 126 for each protein identified in the initial peptide-to-protein assignment, the largest subset of peptide assignment that have pairwise correlation above the correlation threshold is determined. If the subset size, i.e., the number of peptides assignments having pairwise correlation above the correlation threshold, is less than the coverage threshold value, then the biomolecule is removed from the list of identified proteins. Otherwise, the biomolecule and its corresponding peptides are kept. In the embodiment, the kept biomolecule and its corresponding peptides can be considered a relatively high-confidence association, while the removed biomolecule and its corresponding peptides can be considered a relatively low-confidence association. Of course, it will be appreciated that such associations are variable with the correlation coefficient that is selected for the particular analysis.
It will also be appreciated that correlation coefficients can be preset, or determined during an analysis as described above. Until a coefficient is selected as optimal at 131, the correlation coefficients used in the determinations may be considered test coefficients.
Referring back to
At 131, the false positive rate and the total number of proteins identified (at 127 for non-randomized determination by 126) are considered. Depending on the requirements of a particular application, a low false positive rate might be required due to the cost or risk of permitting a false positive protein identification. Other applications may be more tolerant to errors and will thus accept a higher false positive rate in exchange for more proteins identified. Based on the contextual goals of a particular analysis, for an embodiment at 131 optimal values for corr_threshold and cov_threshold can be selected. In an embodiment, considerations might be to select the corr_threshold and/or cov_threshold values that are higher (to decrease the false positive rate) or lower (to increase the total number of proteins identified).
Referring back to
Displaying at 133 is typically done via a display unit at a computer terminal, but it will be appreciated that other outputs are possible. Visualization of the correlations among a set of peptides assigned to a protein or biomolecule are generally helpful for manual inspection. For example, in
Another example appears in
In another embodiment of the correlation filter, the correlation threshold and coverage threshold pairs that is acceptable can be determined iteratively. For example, the correlation threshold can be initially set to 90th percentile of the distribution, and the resulting FPR calculated therewith. The FPR and result set are examined to see if they are acceptable, and the correlation threshold and coverage threshold can be adjusted accordingly. For instance, in an embodiment, if one desires the FPR to be decreased, then corr_threshold and cov_threshold values can be adjusted upward; and if one desires that the total number of proteins identified be increased, then corr_threshold and cov_threshold can be adjusted downward. An example of such an iterative coefficients selection process is shown in
In other embodiments, simplified filtering may also be applied so that if a biomolecule does not have enough matches for its size, then it may be eliminated from further consideration. Other filters may further include restricting polypeptides accepted by their size, raw number of hits, and/or other scoring criteria.
Returning to
The results displayed at 130 relating to correlation coefficients can be used for a variety of purposes, depending upon the goals of the analysis. For example:
As example, the analysis of brucella virulence is examined below. Brucella virulence is linked to components of the cell envelope and tightly connected to the function of the BvrR/BvrS sensory-regulatory system. In this example, a label-free mass spectrometry-based analysis of spontaneously released outer membrane fragments from four strains of Brucella abortus: wild type virulent, avirulent bvrR− and bvrS− mutants as well as reconstituted virulent bvrR+ was performed to quantify the impact of BvrR/BvrS on cell envelope proteins. In total 167 differentially expressed proteins were identified of which 25 were assigned to the outer membrane.
Six samples of each strain were analyzed using the embodiment depicted in
To increase confidence in the protein identification results and to decrease the possibility of wrongly assigned peptides, the correlation filter as described with reference to
In another example, 24 Healthy and 24 Prostate cancer plasma samples were analyzed using the process depicted in
The process shown in
While the foregoing invention has been described in some detail for purposes of clarity and understanding, it will be appreciated by those skilled in the relevant arts, once they have been made familiar with this disclosure, that various changes in form and detail can be made without departing from the true scope of the invention in the appended claims. The invention is therefore not to be limited to the exact components or details of methodology or construction set forth above. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure, including the Figures, is intended or implied. In many cases the order of process steps may be varied without changing the purpose, effect, or import of the methods described.
This application claims the benefit of U.S. provisional patent application Ser. No. 60/781,720, filed 14 Mar. 2006 and entitled “AUTOMATED IDENTIFICATION OF BIOMOLECULES THROUGH EXPRESSION PATTERNS IN MASS SPECTROMETRY”, the entire contents of which, including any appendices, is incorporated by reference. This application is related to (i) U.S. provisional patent application Ser. No. 60/691,414, filed Jun. 16, 2005 and entitled “VIRTUAL MASS SPECTROMETRY”, the entire contents of which, including any appendices, is incorporated herein by reference, and (ii) U.S. non-provisional patent application Ser. No. 10/293,076, filed 13 Nov. 2002 and entitled “Mass Intensity Profiling System and Uses Thereof”, the entire contents of which, including any appendices, is incorporated herein by reference. The following are also incorporated by reference: Cohen, J., Cohen P., West, S. G., and Aiken, L. S. (2003), Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.), Hillsdale, N.J.: Lawrence Erlbaum Associates Jimmy K. Eng, Ashley L. McCormack and John R. Yates, III An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, JASMS, Volume 5, Issue 11, November 1994, Pages 976-989; Pappin D. J., Hojrup, P., Bleasby, A. J., Rapid identification of proteins by peptide-mass fingerprinting, Curr Biol. 3 (6), 327-32, 1993; and Adkins, J. N., Monroe, M. E., Auberry, K. J., Yufeng, S., et al., A proteomic study of the HUPO Plasma Proteome Project's pilot samples using an accurate mass and time tag strategy, Proteomics, 5, 3454-3466, 2005; Peng, Junmin. et al. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome, Journal of Proteome Research, 2, 43-50, 2003; Gygi, S P, Rist, B, Gerber, S A, Turecek, F, Gelb, M H, and Aebersold, R. 1999. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology, 17:994-999; J. Lamerz et al., Correlation-associated peptide networks of human cerebrospinal fluid, Proteomics, 5, 2789-2798, 2005; Laemmli, Nature 1970, 227:680-685; Washburn et al., Nat. Biotechnol. 2001, 19:242-7; Schagger et al., Anal. Biochem. 1991, 199:223-31; Godovac-Zimmermann et al. (2001) Mass Spectrom. Rev. 20: 1-57 (PMID: 10344271); Gygi et al., (2000) Proc. Natl. Acad. Sci. U.S.A. 97: 9390-9395 (PMID: 10920198) [hereinafter “Gygi et al. II”]; Reinders et al., 2004 Proteomics 4: 3686-703; Aebersold et al., 2003 Nature 422: 198-207; Garey, Michael R. and Johnson, David S., (1979) Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman; and Brucella abortus, Proteome Research, 2007; ASAP Article; DOI: 10.1021/pr060636a.
Number | Date | Country | |
---|---|---|---|
60781720 | Mar 2006 | US |