The present invention relates to the field of proteomics. More specifically, the invention relates to the identification of proteins in a protein mixture using peptides and protein databases.
A fundamental goal of proteomics is the systematic simultaneous analysis of large numbers of proteins in biological samples. Automated, high-throughput analyses of complex protein mixtures are presently a matter of routine, made possible by the application of soft-ionization methods to mass spectrometry, and the sequencing of an ever increasing number of genomes. These innovations permit the identification and characterization of proteins with greater sensitivity, shorter analysis times, more consistency in the analysis process, and the flexibility of multiple assays. Global analyses such as these will provide a comprehensive framework within which more traditional, studies directed to individual proteins can be carried out.
In shotgun proteomics, protein samples are generally enzymatically digested into smaller peptide fragments to make them amenable to sequence analysis by mass spectrometry [1]. The resulting complex peptide sample is then separated in time, using liquid chromatography (LC), and coupled to a tandem mass spectrometer so that peptides can be detected and selected for fragmentation as they elute.
Tandem mass spectrometry uses two mass analyzers. The first mass analyzer selects a single peptide mass from the initial mass spectrum (MS) by filtering out all other masses. The single peptide is then fragmented in a collision cell and the second mass analyzer acquires the resulting fragmentation spectra (MS/MS). Peptides typically fragment along the polypeptide backbone rather than in the side chains. Consequently, the series of ions generated by fragmentation can be used to determine the amino acid sequence of the peptide. Protein database searches find all candidate peptides that match the mass of the parent ion to peptides in silico protein digests, then rank the candidates based on the matching theoretical and experimental fragmentation spectra [2, 3]. Proteins containing the identified peptides are then considered to have been identified. There is growing evidence that the number of MS/MS mass spectra (queries) associated with a protein identification provide a measure of relative protein abundance [4, 5].
Unfortunately, identification of proteins in this way yields a redundant list of proteins due to redundancies in peptide identifications, redundant database entries, and gene products that have long stretches of conserved sequence identity. This redundancy must be eliminated to correctly interpret the biological significance of the results or to peptide counts to estimate abundance. A common approach is to group the protein hits on the basis of sequence similarity (e.g. [6]); this is laborious, time-consuming, subjective and is based on derived results (protein sequence) rather than primary data (peptide sequence). Another approach uses a probabilistic analysis to select the proteins with the highest likelihood of being present based on a knowledge of the probability that the individual peptide identifications are correct [7].
The present invention provides a simpler, set-based approach to the elimination of redundant protein identifications that yields the minimum number of proteins needed to explain the peptides observed.
In a broad embodiment of the invention, there is provided a method for identifying proteins in a mixture of proteins comprising: providing peptides derived from the mixture of proteins; obtaining mass spectra of the peptides to identify the peptides by comparing the mass spectra with spectra of a standardized database; matching the identified peptides with proteins in a database to generate a protein hits (PHs) list, each of the PHs having an associated peptides set; and identifying PHs having an associated peptides set that is included in at least one other PH-associated peptide set; and removing the identified PHs from the list and wherein remaining PHs provides an identification of the one or more proteins.
In another embodiment there is provided method as described above further comprising grouping the identified PHs that share a same set of peptides in primary protein groups and wherein each of the primary protein group identifies a non-redundant PH.
In another aspect the method can also comprise combining all primary protein groups that share at least one common characteristic among the non-redundant PH to generate secondary protein groups and identifying a non-redundant PH for each of the secondary protein groups based on the characteristic.
In another embodiment there is provided a method for reducing redundancy in a protein hits list, comprising: associating a set of peptides with each protein of the protein hits to generate PHs-associated peptide sets; comparing the set PHs-associated peptide sets; identifying PHs having an associated peptides set that is included in at least one other PH-associated peptides set; and removing the identified PHs from the list and wherein remaining PHs provides an identification of the one or more proteins.
The invention also provides a device for identifying proteins in a mixture of proteins, the device comprising a data input means for inputting peptide analysis results, a peptide database, a protein database, a first analyzer to identify the peptides, a second analyzer to match the identified peptides with proteins in the protein database to create protein hits (PH) and to create peptide sets associated with PHs, a comparator for comparing PH associated sets of peptide and for eliminating redundancy in PHs, and a display to display identified PH substantially free of redundancy.
In another embodiment, the invention also provides a computer readable medium with computer executable instructions for performing a method for identifying proteins comprising matching identified peptides obtained from a protein mixture with proteins in a database to generate protein hits (PH) each of said PHs having an associated peptide set; and eliminating PHs having a peptide set that is included in at least one other PH-associated peptide set thereby producing a set of PHs substantially free of redundancy.
Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
Protein Identification
A. Data Representation
Protein identification algorithms operate in three stages. First, experimental fragmentation (ms/ms) mass spectra are matched to theoretical spectra from an in silico digestion of sequences in a protein database. Next, the matches are examined in some way to determine those which are valid. Finally, the proteins containing identified peptides are determined. Irrespective of the tools used, the results may be considered to consist of a set of protein hits PHs, each comprising a protein identifier and the associated set of peptides used to identify it. For example, let us assume that the protein hits are stored as a structure array, PH, having the fields defined in
In practice, the protein hits resulting from the analysis of complex mixtures are found to be quite redundant. This is illustrated in
Moreover, there are cases where the peptides from one hit are a subset of those identifying another (e.g. hits 3 and 2 in
[PH (i).PEPTIDEID]⊂[(PH). PEPTIDEID]
Such hits are redundant since postulating the existence of protein j can explain all of the peptides in both hits i and j. There is no evidence that protein i is present although its existence cannot be ruled out.
B. Redundant Peptide Identifications
The first source of redundant protein identifications is that a single mass spectrum may be matched to more than one peptide. Search algorithms, such as Mascot™ and Sequest™ [2, 3], identify peptides by matching fragmentation spectra to an in silico digest and evaluating the goodness of fit in some way. There are a number of amino acids whose masses cannot be distinguished by mass spectra data (e.g. isolucine and leucine are structural isomers while lysine and glutamine have the same nominal mass). Consequently, peptides whose sequences differ only by interchanges of such amino acids cannot be distinguished by mass spectra and so will result in redundant peptide identifications. In addition, there may also be cases in which an experimental spectrum matches more than one theoretical spectra well. Examination of a number of data sets from rat liver organelles revealed that approximately 5% of the mass spectra match two or more peptides.
C. Redundant Peptide to Protein Mapping
A second source of redundant protein identifications is that a particular peptide may occur in more than one protein sequence in the database. This can result from database inconsistencies including redundant entries in the database, partial sequences, and splice variants. It may also arise biologically from proteins that are closely related gene products having long stretches of conserved sequences as occurs in closely related gene products. An in silico analysis of all the tryptic peptides in the NCBI nr database [8] with taxonomy restricted to rat, suggests that only about 15% of peptides occur in more than one protein sequence. However, tandem mass spectrometry only identifies peptides between 6 and: 30 amino acids. These shorter peptides are much less specific and as
In the present invention there is provided a set-based algorithm that eliminates or reduces redundancy in protein identification. The method can be applied to already established list of PHs or may include the preparation of peptides using enzymatic digestion and mass spectrometry to identify the peptides and the proteins using standardized databases. In one embodiment all PHs that have a peptides set that is included in any other PH are eliminated from the PHs list. The remaining PHs provide an identification of the protein(s) in the mixture of proteins.
Protein hits, PHs, that share the same set of peptides can be grouped together to form a protein group PG. For a PG,
[PH (i).PEPTIDEID]⊂[PG.PEPTIDEID] ∀i in PG
In the present description a group defined based on the above definition is referred to as a primary Protein Group or PG1.
The algorithm used to define the protein identification group is illustrated in
Groups can be defined iteratively by first sorting the protein hits by the number of peptides they each contain. Then all hits defined by sets of peptides contained within the initial set are found and merged into the first group. Hits assigned to a group are eliminated from the list of protein hits and the procedure repeated until all hits have been assigned.
Redundancy can be further reduced by performing an adjacency analysis of the primary protein groups. This analysis joins primary protein groups that share at least one peptide among themselves into secondary protein groups. That is to primary protein groups for which the non-redundant PHs share at least one peptide are placed in a secondary protein group. Then the connectivity of each primary protein group within a secondary protein group is established. By connectivity it is meant the number of primary protein groups with which a given primary protein group shares at least one peptide. Referring back to
The redundant PHs of a secondary protein group can be determined based on the connectivity. Thus for example, the primary protein group having the highest connectivity can be identified as the non-redundant PH of a secondary protein group. All other primary protein group associated non-redundant PHs would be eliminated from the list of PHs.
It will be appreciated that proteins that are identified as being redundant using the adjacency analysis are proteins for which the sequences are potentially highly related. For example a same protein obtained from different species, proteins exhibiting allelic variations, proteins in a database with sequencing errors and the like.
It will also be appreciated that criteria other than or in addition to peptide sharing among primary protein groups could also be applied in the adjacency analysis. For example, secondary grouping could be based on protein function, protein length and other such protein characteristics.
Query Counting
There is growing evidence that the number of MS/MS mass spectra (queries) associated with a protein identification are related in some way to the protein abundance [4, 5]. Consequently, the mass spectra information underlying the identification of each group is summarized by counting the associated peptides. Three peptide counts can be determined for each group. Thus,
Thus the relative abundance of a non-redundant PH can be determined by providing a count of all the queries (peptides) associated with the corresponding primary or secondary protein group.
The method of the invention can be implemented in part using computer-based system and methods as would be known to one skilled in the art.
The invention also provides a device for identifying proteins in a mixture of proteins, the device comprising a data input means for inputting peptide analysis results, a peptide database, a protein database, a first analyzer to identify the peptides, a second analyzer to match the identified peptides with proteins in the protein database to create protein hits (PH) and to create peptide sets associated with PHs, a comparator for comparing PH associated sets of peptide and for eliminating redundancy in PHs, and a display to display identified PH substantially free of redundancy.
In another embodiment, the invention also provides a computer readable medium with computer executable instructions for performing a method for identifying proteins comprising matching identified peptides obtained from a protein mixture with proteins in a database to generate protein hits (PH) each of said PHs having an associated peptide set; and eliminating PHs having a peptide set that is included in at least one other PH-associated peptide set thereby producing a set of PHs substantially free of redundancy.
We evaluated the algorithm by analyzing a representative data set from an organellar proteomics experiment using methods similar to those described in [4]. The raw data comprised 13,587 tandem mass spectra acquired from 93 bands from a 1 D gel of a sample of rat rough microsome. Mass spectra were first subjected to peak-detection using a commercial product (Mascot Distiller from Matrix Science) and the resulting peak-lists searched against the NCBI nr database [8] with taxonomy limited to rat using a probability-based search engine (Mascot from Matrix Science). A total of 5,685 mass spectra were assigned to peptides with a probability of random hit being less than 5%. There were 3,498 distinct peptide identifications. The search results were loaded into CellMapBase, our relational database for proteomics analysis [9] and analyzed using the method of the invention.
Table II provides the quantitative support for this information. Grouping decreased the number of proteins identified by more than 40% and increased the number of proteins identified by unique peptides from 512 to 600. Taken together, the percentage of identifications using only unique peptides from 35.2% to 80.1%.
This grouping algorithm provides an objective, automated means to eliminate redundancy in protein identifications in high throughput proteomic experiments. However, as
The Association of Biomolecular Resource Facilities (ABRF) recently circulated two samples containing 8 proteins in different amounts to assist laboratories in evaluating their ability to identify and quantify unknown proteins. This example describes the analysis of these samples using the proteomics pipeline.
Analysis Methods
The two ABRF samples were resolved on separate 1D-SDS PAGE gel lanes and subjected to standard band slicing, in-gel trypsinization and LC-coupled mass spectrometry. Peak lists were generated using. Mascot Distiller with optimized parameter values. Peptides were identified using Mascot to search the NCBI nr database with taxonomy limited to mammals. Peptides identified in the two samples were used to identify the proteins present and group them, according to the method described above into distinct sets to define the minimal set of proteins necessary to explain the observed peptides.
Table 2 shows the 59 protein groups defined by distinct sets of peptides initially identified.
Adjacency Analysis (Secondary Grouping)
Sets of closely related proteins groups were determined by adjacency analysis to generate secondary protein groups.
Related Proteins
Each “island” in
This confirms that proteins in each island are highly related, probably as a results of sequence redundancy among species.
Final Results
Groups in each island were collapsed together and grouping repeated. Seven of the 8 most abundant proteins corresponded to those in the ABRF samples. 1 ABRF protein, horseradish peroxidase, was not identified since the search taxonomy was limited to mammals (table 3)
Relative Abundance
Relative abundance of 6/8 ABRF proteins was estimated from the ratio of spectral counts. Estimates were not possible for: Horseradish peroxidase since this was not identified; Beta Casein which was only identified in Sample I, where it was in the highest abundance.
These estimates corresponded to well to relative abundances provided by ABRF.
Conclusions
Seven of the eight proteins in ABRF sample were identified conclusively. Estimates of their relative abundances in the two samples based on spectral counts agreed well with expected values; Protein identification by data base search is complex if taxonomy is unrestricted.
All references cited herein are incorporated by reference.
While the invention has been described in connection with specific embodiments thereof, it will be understood that it is capable of further modifications and this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosures as come within known or customary practice within the art to which the invention pertains and as may be applied to the essential features herein before set forth, and as follows in the scope of the appended claims.
Bos taurus
Bos taurus
Bos taurus
Homo sapiens
Ovis aries
Elephas maximus
Bos taurus
Bos taurus
Bos taurus
Homo sapiens
Homo sapiens
Rattus norvegicus
Bos taurus
Bos indicus
Homo sapiens
Bos taurus
Canis familiaris
Bos taurus
Rattus norvegicus
O. cuniculue
O. cuniculue
Bos taurus
Canis familiaris
Ovis aries
Mus musculus
Mus musculus
Felis catus
Capra hircus
Bison bison
Gazella thomsonii
Canis familiaris
M. auratus
Mus musculus
Homo sapiens
B. tragocamelus
Tragelaphus oryx
Bos taurus
Bos taurus
Rattus norvegicus
Bos taurus
Rattus norvegicus
Mus musculus
21478
69613
258
3320409
87406
41849
196
This application claims priority from U.S. provisional application No. 60/713,373 filed Sep. 2, 2005 and entitled METHOD FOR IDENTIFYING PROTEIN.
Number | Date | Country | |
---|---|---|---|
60713373 | Sep 2005 | US |