The comprehensive characterization of proteins at trace levels in a biological sample is a significant challenge. Two strategies are currently widely available for protein analysis by mass spectrometry. The first, the bottom-up or shotgun approach,1,2 begins with the digestion of a protein (or proteome) with an enzyme, such as trypsin, followed by separation of the resulting peptides and analysis by mass spectrometry. A typical experimental design uses electrospray ionization with an ion trap mass spectrometer.1,3 However, many peptides are not detected either because they are too small (less than 500 Da) or too hydrophilic (multiple phosphorylation or glycosylation sites) to be well-retained on reversed phase LC columns. Conversely, large peptides are not recognized by most types of mass spectrometers because fragments over 5,000 Da require a resolution of 25,000 or higher to resolve the charge states and determine monoisotopic masses. Ion suppression can be a problem, and data-dependent analysis in on-line LC-MS often fails to detect co-eluting peptides.3 Because of these issues, the traditional bottom-up approach does not generally provide comprehensive characterization, due to limited sequence coverage and failure to identify posttranslational modifications.4,5
The second strategy, the top-down approach, involves the analysis of intact proteins introduced by direct infusion into a Fourier transform mass spectrometer (FTMS).6,7 Using electrospray ionization, a number of multiply-charged ions of each protein are observed.8 Because these charge states are of a single protein, their m/z peaks can be mathematically deconvoluted to obtain the molecular weight. More recently, with high resolution, high mass-accuracy instruments such as FTMS, the charge can be directly determined from the ion envelope detected at high mass resolution, rather than by deconvolution.9 This approach is particularly useful when the species of interest are present in several isoforms, making determination of discrete charge states at low mass resolution highly problematic.9 Moreover, the high mass accuracy in FTMS matches fewer candidate peptide sequences and provides higher probabilities of correct assignments. The top-down approach has been used to obtain high sequence coverage of proteins as well as to determine alternative splicing or posttranslational modifications.10-14
However, the direct top-down approach (i.e., no enzymatic digestion) is generally applicable only for small to moderate size proteins (<50 kDa) with relatively homogeneous modifications.10 It is also generally difficult in the infusion of an intact protein to resolve the isotopic patterns of the multiple isoforms, each with a wide charge distribution, even when using the very high resolution of FTMS. Also, the top-down approach can require large amounts of material (often >10 pmole per protein),13,15,16 and the time required to isolate, fragment, detect and then remove the ions from the FTICR cell (>10 seconds per scan) is not compatible with on-line LC-MS. Moreover, the LC separation of intact protein mixtures, particularly with heterogeneous modifications, generally is much poorer than that of peptide mixtures.
Recently, a new on-line, top-down LC-MS strategy has been developed. This method uses a hybrid linear ion trap-FTMS instrument (LTQ-FTMS), which has been applied to the characterization of the disulfide linkages of intact human growth hormone, hGH (22 kDa), in a complex matrix of E. coli inclusion bodies, with detection at the 200 fmole level.17 In this method, the high mass resolution (typically 100,000) of the FTICR cell allowed the high charge states of intact human growth hormone precursor ions (15+ to 19+) and their large peptide fragments (10+, 11+) to be on-line determined with high mass accuracy (below 2 ppm).17 In this work, the LTQ-FTMS instrument was able to isolate in an on-line LC-MS experiment low abundance, highly charged (17+) ions of the intact protein in the FTICR cell, and to fragment these ions in the external linear ion trap cell using collisionally-induced dissociation (CID). Importantly, the fragment ions were transferred to and detected in the FTICR cell. The design of this hybrid instrument system permits such experiments to be performed on a chromatographic timescale.18,19
In the above LC-MS top-down strategy,17 the MS and MS2 spectra had to be measured in the FTICR cell in order to determine the high charge states associated with the high molecular weights of both the precursor and fragment ions. This is a drawback in an on-line LC-MS analysis, however, because the measurement in the FTICR cell requires 30- to 40-fold more time than measurement in the linear ion trap to achieve the same detection sensitivity. Furthermore, hGH, although relatively small, is about at the upper limit of proteins that can be analyzed accurately by this method. As noted above, the difficulty of resolving isotopic patterns of heterogeneous modifications increases with the size of the protein, becoming highly problematic above about 50 kDa. Thus, improved methods of characterizing proteins at trace levels continue to be desirable.
The invention is directed to a new and sensitive LC-MS platform, Extended Range Proteomic Analysis, which is able to achieve very high sequence coverage and comprehensive characterization of posttranslational modifications in complex proteins at the trace level (e.g., low pmole to fmole). The platform according to the invention provides advantages of both the top-down and bottom-up proteomic approaches by combining, In a preferred embodiment of the method, (i) digestion of the protein with an enzyme, such as Lys-C, that cuts less frequently than trypsin, or limited digestion with, e.g., trypsin, leading to, on average, a higher molecular weight peptide size with greater than 90% of the protein's peptide backbone sequence contained in fragments that are between 500 and 25,000 Da; (ii) high-performance LC separation of these resulting fragments; (iii) a new data acquisition strategy using on-line coupling of specific separated fragments to analysis in, e.g., the LTQ-FTMS, a hybrid mass spectrometer that couples a linear ion trap with a Fourier transform ion cyclotron resonance (FTICR) cell, for peptide analysis, preferably of the fragments in the range of 3000 to 10,000 Da; and (iv) new data analysis methods for assigning large peptide structures and determining the site of attachment of posttranslational modifications as well as structural features from the accurate precursor mass together with MS2 and MS3 fragmentations.
The LC retention of the (e.g., Lys-C) fragments is increased, relative to a tryptic digest, due to the generally greater hydrophobicity of the larger peptides, a result that is particularly important for peptides containing hydrophilic modifications such as glycosylation and phosphorylation. Furthermore, additional positively charged arginine and lysine residues, which might be included in these larger fragments, could enhance the sensitivity of the posttranslationally modified peptides by at least 10-fold relative to tryptic fragments.
In a typical operation following production of these larger peptide fragments, the FTICR cell provides a survey scan with the high mass resolution (>100,000-200,000) and accurate mass (<2 ppm) needed to characterize the higher charge-state precursor ions of the larger peptides. In parallel, the linear ion trap provides MS2 and MS3 fragmentation spectra, with a scan speed sufficiently fast for on-line LC-MS. Together, these data provide multiple means to determine or enhance the confidence of assignment of large or complicated peptides.
Using the method of the invention, we have demonstrated >95% sequence coverage in the analysis of the heavily phosphorylated protein, bovine beta-casein (at the 5-50 fmole level) and the heavily phosphorylated and glycosylated protein epidermal growth factor receptor (EGFR) (at the 20-200 fmole level). The detectibility range is also a function, e.g., of the size of the column used for fragment separation. This combination of digestion strategy, high-performance separation and use of the hybrid LTQ-FTMS instrument according to the method of the invention enables comprehensive characterization of large proteins, including their posttranslational modifications.
Thus, in general, the invention is directed to a method of protein characterization comprising providing an aliquot of a sample that includes a protein or a mixture of proteins whose identity is to be determined; carrying out digestion of the protein or mixture of proteins in the aliquot so that the digestion product comprises at least one fragment having a peptide backbone sequence of greater than or equal to 3000 (preferably, greater than or equal to 4000) Da in mass; separating the digestion product; and analyzing the structure of one or more of the fragments by mass spectrometry using a mass spectrometer system that includes a mass spectrometer having a mass resolution of at least 25,000 and a mass spectrometer having an electron multiplier detector. Preferably, the separating and analyzing steps are coupled on-line and the mass spectrometer system includes a mass spectrometer having a mass resolution of at least 50,000. The mass spectrometer system may include separate mass spectrometer instruments each having one of the indicated properties. Alternatively, the mass spectrometer system may include a single mass spectrometer instrument, such as a hybrid instrument.
The digestion step may be accomplished by proteolytic enzyme digestion, such as by one of the enzymes Lys-C, Arg-C and Asp-N, which tend to product larger fragments, or by limited digestion with a more frequent cutter, such as trypsin or Glu-C, to also produce larger fragments. Alternatively, digestion may be by a chemical reaction, with such compounds as dilute acid, cyanogen bromide and hydroxylamine. In a related aspect, in the digestion product, greater than 90% of the peptide backbone sequence of the protein or proteins is contained in fragments that are between 500 and 25,000 Da in mass, or more preferably between 1000 and 10,000 Da in mass. In particular embodiments, the digested fragments are separated using liquid chromatography or capillary electrophoresis, e.g., on a microfluidic chip. Alternatively, capillary electrochromatography is the method of choice.
Preferably, the mass spectrometer having an electron multiplier detector is an ion trap or quadrupole mass spectrometer and the mass spectrometer with a mass resolution of at least 25,000 is a Fourier transform mass spectrometer, a time-of-flight mass spectrometer or an Orbitrap. More preferably, the mass spectrometer system is a hybrid mass spectrometer that couples an ion trap with a Fourier transform ion cyclotron resonance cell is used. Alternatively, the mass spectrometer system is a hybrid mass spectrometer that couples a quadrupole mass spectrometer with a time-of-flight mass spectrometer is used. Most preferably, the detectors of the hybrid mass spectrometer are operated in parallel.
In another aspect, the method of the invention is carried out to determine the identity of different posttranslationally modified isoforms of a protein. For this use of the method, the sample includes a protein or mixture of proteins whose posttranslationally modified isoforms are to be determined. Following the digestion and separation steps, the position of fragments containing the common backbone of the posttranslationally modified isoforms is determined and the structure of the fragments containing the common backbone of the posttranslationally modified isoforms is analyzed according to the method of the invention. Preferably, the level of the individualized posttranslationally modified isoforms is quantitatively determined. In one embodiment, the posttranslational modification includes glycosylation, in another the posttranslational modification includes phosphorylation. other possible posttranslational modifications include sulfation, acetylation or methylation. In a further embodiment, the glycosylated posttranslational modification is further modified with sulfation and/or phosphorylation.
The examples described herein are provided to illustrate advantages of the invention, including those that have not been previously described, and to further assist a person of ordinary skill in the art with using the methods of the invention. The examples can include or incorporate any of the variations or inventive embodiments as described herein. The embodiments that are described herein also can each include or incorporate the variations of any or all other embodiments of the invention. The following examples are not intended in any way to otherwise limit or otherwise narrow the scope of the disclosure as provided herein.
Other features and advantages of the invention will be apparent from the following description of the preferred embodiments thereof and from the list of embodiments, taken in conjunction with the accompanying drawings, in which:
To overcome the limitations on protein size and heterogeneity and the long detection times required to achieve FTICR sensitivity in the top-down approach, we have developed a new intermediate and sensitive strategy called Extended Range Proteomic Analysis, a method that combines key features of the top-down and bottom-up approaches along with more productive use of the LTQ-FTMS instrument.
This new platform, for the first time, allows for the characterization of the complete structure of a protein present in a complex biological mixture. In the past, such analyses were only possible, in a limited sense, for a protein that had been extensively purified and was available in substantial amounts. Even in that situation, such an analysis was problematic in that one would not know if a particular set of modifications were indeed present in a given species. For example, if one characterizes a specific phosphorelation in a peptide and then in a separate analysis characterizes a sulfation in a carbohydrate site chain in another reasonable molecule (as found in another peptide fragment) there is no guarantee that both modifications are present in the same protein species. In fact, the situation may well be that the two modifications reside in separate protein molecules as a consequence of the biological heterogeneity of protein biosynthesis.
This lack of ability to characterize the structure of a protein completely, and in trace amounts, has important consequences in the biotechnology industry. Currently, the FDA regulates a protein pharmaceutical by both final product testing and characterization of the production process. Although protein products can be termed as being adequately characterized, the current state of technology does not allow for the production of “a well characterized biological.” Such a situation has clear ramifications when a pharmaceutical is produced in different countries with materials from different suppliers or when one wishes to produce a generic biotechnology drug. In a similar manner, the lack of complete characterization of proteins has major implications in the pharmaceutical industry both in terms of characterizing drug targets and in development of small molecule pharmaceuticals based on inhibition of specific protein interactions. One can also see that the monitoring of patients for clinical trials is hindered by this lack of knowledge as well.
The problem is that current technology has been unable to characterize large peptide fragments, particularly in proteins available in small amounts, such as those present in complex biological samples. The major thrust of this invention, then, is a technology that allows the characterization of overlapping peptide fragments and includes a process by which a nested set of fragments can be put together to describe the complete protein sequence with detailed knowledge of the occurrence of post-translational modifications.
In the method according to the invention, a protein to be analyzed is cleaved, preferably, using an enzyme that cuts less frequently than trypsin (or by incomplete digestion. e.g., with trypsin) to obtain a smaller number of peptides, many of which are of higher molecular weight than the tryptic fragments produced in the traditional bottom-up approach. Thus, the problems of the heterogeneity of intact proteins as well as the complexity of tryptic peptides are both decreased. Another important point is that trypsin will generate peptides that carry the positively-charged K or R residues only at their C-terminus, while peptides generated by, e.g., Lys-C digestion frequently include one or more additional R residues in their internal sequence. As will be shown, when glycosylated or phosphorylated peptides carry this additional positive charge, their MS signal intensity is enhanced 10-fold or higher and additional b ions are found in the fragmentation spectrum. In the method of the invention, the FTICR cell is used to resolve the higher charge states and obtain accurate mass measurements of the larger peptides (2- to 3-fold larger than typical tryptic peptides), while concurrently taking advantage of the speed and sensitivity of the linear ion trap for CID measurements on these peptides. In addition, because current software (e.g., Sequest, Mascot) does not address significance thresholds for peptides above a 3+ charge state, and because only limited software exists for automatically recognizing glycopeptides or assigning glycans, the method of the invention includes new strategies for these tasks.
The method of the invention will be very useful in the pharmaceutical and biotechnology industries, e.g., for applications such as the following: characterization of a protein product in a fermenter; tracking of even trace amounts of impurities during the purification process; final product testing and characterization for an FDA submission; improved QC testing using peptide maps (fewer fragments give a less complicated map with better coverage of the N and C-terminus of the protein).
These applications will be especially valuable for antibody drugs. For example, a fermentation sampler can be coupled to an antibody capture column that will extract the product away from the other fermentation proteins. The captured protein can then be released, and digested and analyzed according to the method of the invention.
Earlier in the drug development process, the method will find use for, e.g., drug target characterization, especially in the definition of the full sequence of a target including combinations of post translational modifications (PTMs). Specifically, cell line extracts can be analyzed to look at the effect of drugs on specific protein targets. Another use can be for signalling pathway definition, e.g., by providing for a detailed analysis of the level of phosphorylation of a receptor on the inner surface of a plasma membrane and of glycosylation motifs on the outer surface.
In general, in the method of the invention, the posttranslationally modified fragments produced are sufficiently large, and, thus, have a sufficiently large signal, that all isoforms of a modification at a specific site will elute at the same position. Thus, a quantitative determination can be made of the individualized isoforms. Examplary modifications include glycosylation, phosphorylation, sulfation, acetylation, methylation and other forms suitable for a specific purpose, such as a general membrane anchor. The modified groups attached to the peptide backbone of the protein have been observed to themselves be modified.
The method will also allow for the characterization of large peptide fragments without any enzyme digestion. This is an invaluable resource for the study of the peptidome or fragmentome. This application would involve a MW separation step (e.g., gel permeation chromatography or a membrane filter step) and characterization of the low MW fraction.
The use of the method of the invention for biomarker discovery in complex fluids such as plasma, where low level proteins are often identified with a single peptide, will be very helpful. With the large fragment analysis approach of the invention, a sample can be split into several (typically three) aliquots and digested with enzymes that produce large fragments. The separate aliquots can be analyzed individually according to the method of the invention and the peptide identifications pooled. In this way also, the forms, e.g., glycosylated forms, of the biomarker in a specific patient can be monitored, for individualized treatment with the appropriate therapy.
The method of the invention would also find application in the discovery of interacting proteins. This application would involve an affinity system (e.g., magnetic particles or binding agent such as an antibody on an affinity column) to isolate the target protein, e.g., EGFR, complexed with a ligand. The recovered complex can then be crosslinked with a crosslinking agent, such as glutaraldhyde. The crosslinked sample is then digested, and the fragments are characterized using the method of the invention, which will give the region of the structure of the target protein and the corresponding region of the interacting protein.
Using the method of the invention, improved protocols can be developed for MS identification of protein mixtures and proteomic samples. For example, this approach can be used to reduce the rate of false positives and can be combined with genomic databases to allow for unique identification of a protein member of a family. The method of the invention will also support the development of improved methods of determining missing regions in a protein as a result of proteolysis and/or alternative splicing.
In summary, by providing the ability to characterize large overlapping fragments, the method of the invention allows for the first time the determination of the full structure of a protein with very high sequence coverage and comprehensive characterization of posttranslational modifications, even at the trace level. High sequence coverage is particularly important for determining the tissue or sub-cellular compartment of origin of a specific posttranslationally modified variant.
The following examples are presented to illustrate the advantages of the present invention and to assist one of ordinary skill in making and using the same. These examples are not intended in any way otherwise to limit the scope of the disclosure.
To demonstrate the power of this method, the examination of two proteins with extensive posttranslational modifications is described herein: phosphorylated beta-casein (23 kDa), and the glycosylated and phosphorylated epidermal growth factor receptor (EGFR) (180 kDa). The advantages of the method according to the invention are shown in a comprehensive characterization of these proteins, including the identity and attachment sites for all significant phosphorylated and glycosylated peptides, with high sequence coverage (>95%) and a high sensitivity for beta-casein and EGFR. The approach will be directly applicable to the comprehensive analysis of protein biomarkers or protein complexes that have been isolated from biological matrices, for example, by immunoprecipitation.
As discussed above, the two most common mass spectrometric approaches for the characterization of proteins are direct analysis of intact proteins (top-down), or analysis of a mixture of peptides resulting from a tryptic digest (bottom-up). In the top-down approach, if the intact protein is larger than 50 kDa or has heterogeneous modifications, comprehensive analysis is highly challenging if not impossible. The bottom-up approach has low detection sensitivity for glycosylated or phosphorylated peptides in a mixture of non-modified tryptic peptides.21,22 The method of the invention, Extended Range Proteomic Analysis, described herein is an alternative and sensitive approach. Given below are, first, important general considerations in this approach and, then, analysis of several complex proteins as examples of using this method.
Trypsin is the most commonly used proteolytic enzyme today, in part because it is “well behaved”23 and in part because the majority of fragments tend to be below 3 kDa, which is compatible with present day ESI-MS instruments. However, with the new generation of instruments, such as the LTQ-FTMS, larger fragments with higher charge states can be routinely analyzed. This opens up the possibility of directly using enzymes other than trypsin that cut proteins less frequently.
We examined the theoretical digest of the human SwissProt database using 3 different enzymes (Glu-C, trypsin and Lys-C) and one chemical reagent (cyanogen bromide). The majority of the resulting digest fragments were below 3 kDa for Glu-C and trypsin, 3 to 10 kDa for Lys-C, and 5 to 20 kDa for cyanogen bromide. In terms of the number of peptides, Glu-C and trypsin generated 2- to 3-fold more peptides than Lys-C, and 5-fold more than cyanogen bromide. Because Lys-C created many peptides in the desired 3-10 kDa size range, it was selected for initial practice of the method of the invention. Further examination of the human proteome revealed that the occurrence of R and K residues in proteins was roughly equivalent at 6%. From these percentages, it can be predicted that most Lys-C peptides will have at least one additional R or K residue, relative to the corresponding tryptic peptides. As shown in the following results, this positive charge can improve the electrospray ionization efficiency at least 10-fold for posttranslationally modified peptides carrying the negative charge of phosphorylation or sialylation, when operating under the typical conditions of positive ion mode and low pH buffers. In addition, Lys-C is a robust enzyme which can digest proteins even under harsh conditions, such as 6 M urea or 0.5% SDS. As discussed below, we also explored the strategy of further digesting a portion of the Lys-C digest with trypsin, as necessary, if a Lys-C fragment was much larger than 10 kDa.
As shown in
The FTICR preview was followed by 8 MSn scans (typically 4 pairs of linear ion trap MS2 and MS3 scans). The total cycle time was 2.7 seconds. The acquisition time for the full FTICR scan required 1.8 seconds with an ion target value of 2×106 and a resolution of 100,000, as in the scheme above. Five MSn linear ion trap scans could be acquired concurrently within this 1.8 sec (parallel acquisition), with each additional MSn scans adding 0.3 sec to the cycle time. Each MS2 CID fragmentation (target value of 30,000 ions) was performed on a precursor ion that was isolated using the data-dependent acquisition mode to select ions automatically in sequence from the highest intensities of the preview scan. From the MS2 fragmentation spectrum, a subsequent MS3 CID fragmentation (at a target value of 5,000 ions) was performed on a precursor ion from the MS2 scan, isolated using the data-dependent acquisition mode to select automatically ions with the highest intensity.
Central to the strategy in the method of the invention is the ability to identify, with high confidence, peptides that can be 2- to 3-fold greater in length than normally produced in tryptic digests. Current database search software (e.g., Mascot and Sequest) has been optimized for use with tryptic peptides (<3,000 Da) and, although the current version of Sequest will process ions with charge states up to 8+, no statistics are available correlating the probability of correct identification to the Xcorr scores of ions above the 3+ charge state. In the absence of this information, our present strategy for analysis of large peptides with charge states of 4+ or higher is to use the current version of Sequest (BioWorks 3.1 SR1) as a means to select and rank the most likely candidates, then to confirm the sequences manually using (i) accurate mass measurement (within 2 ppm) provided by the FTMS, (ii) agreement between the MS2 and MS3 identifications, and (iii) expected cleavage ions in the CID spectra. As the database search software improves, less manual confirmation will be needed.
Interestingly, we have found that CID fragmentation of large precursor ions with higher charge states leads to a low number of high-intensity preferred-cleavage fragments rather than a complete ion series. Also, although Xcorr thresholds of significance increase with charge states from 1+ to 3+ (e.g. >1.9 for 1+ ions, >2.2 for 2+ ions, and >3.75 for 3+ ions), our preliminary findings suggest that Xcorr threshold scores of ions of 4+ or higher charge are similar to those for 3+ ions. This trend could reflect the fact that larger precursor ions have fewer potential peptide candidates in the database, especially when using strict precursor ion mass tolerance and proteolytic enzyme specificity. For example, searching against the SwissProt database using Lys-C specificity with zero missed cleavages and ±10 ppm mass accuracy, we found roughly 60 potential peptide candidates at MH+=2,000 Da, as compared to only 5 candidates at MH+=8,000 Da. Thus, with high mass accuracy FTMS data, the assignment of spectra of large peptides can be based on fewer fragment ion assignments than are required for small peptides. The analysis of two commercially available complex proteins, beta-casein and epidermal growth factor receptor, is given below to demonstrate the principles and effective results of this method.
Bovine beta-casein, a 23 kDa protein containing 5 known serine phosphorylation sites, was chosen as a model phosphoprotein. 20, 26 Carrying out the method of the invention by conducting LC-MS on Lys-C fragments of the protein provided 97% sequence coverage (202 out of 209 amino acid residues), including the identification of all phosphorylation sites, at the 50 fmole level (see Table 1 and
+e
aSee FIG. 3 caption and Experimental Section for details.
b“*” indicates phosphorylated amino acids (S*).
cTheoretical monoisotope with a hydrogen mass [M + H]+ shown, including the molecular weight of phosphorylation if indicated.
d“+” = identified, “−” = not identified.
eBoth phosphorylated and nonphosphorylated peptides found.
Referring to
Experimentally, after digestion with Lys-C, beta-casein (50 fmole) was analyzed online by the method of the invention following LC separation over a 75 μm i.d. C-18 reversed phase column (300 Å pore). The total ion chromatogram is shown in
The MS2 and MS3 database search results were then confirmed manually. First and foremost, the accurate mass measurement of the precursor ion (6359.264 Da) matched within 2 ppm the molecular weight of the peptide identified in the MS2 assignment (6359.255 Da). Additionally, we compared the masses of the high intensity ions observed in the fragmentation spectra with those that would be generated by cleavage at preferential sites of the candidate sequence, specifically: cleavage between L and P (y18 and b38), between M and F (b43), and between F and P (y12 and b44), as indicated in the insert to Panel B and the MS2 spectrum in Panel C of
Another large peptide with a molecular weight of 5316 Da (residues 49-97 in
We next turn to examination of the tetraphosphorylated peptide. At an elution time of 26.80 min (
As in
An important finding from the results shown in
Finally, the fifth phosphorylation site for beta-casein was seen in the peptide sequence FQS*EEQQQTEDELQDK, eluting at 15.41 min. This site was readily identified with the same neutral loss analysis, and both the non-phosphorylated as well as phosphorylated forms of this peptide were observed (Table 1). Both trypsin and Lys-C digestion produced this same monophosphopeptide fragment.
Thus, 97% sequence coverage of beta-casein (202 out of 209 amino acid residues) was obtained using the approach according to the method of the invention at the 50 fmole level, including identification of all five phosphorylation sites. The only missing residues were the three di- and tri-peptides (IEK, VK, and HK), as shown in Table 1. In contrast, the typical tryptic bottom-up approach with the same quantity of protein would not have detected the tetraphosphorylated peptide nor the two large peptides in beta-casein, resulting in a sequence coverage of 36% at best.
We now turn to the ability of the method of the invention to comprehensively characterize a large protein with heterogeneous phosphorylation and glycosylation structures. EGFR is a transmembrane glycoprotein comprising 1186 amino acids with a molecular weight of 132 kDa, based on the amino acid sequence. In SDS-PAGE, however, the receptor migrates with an apparent molecular weight of 180 kDa, which suggests the presence of posttranslational modifications, particularly glycosylation.27-29 EGFR is composed of three domains; an extracellular ligand-binding domain (residues 1-621), a transmembrane region (residues 622-644) and an intracellular cytoplasmic domain (residues 645-1186).30 When activated, the cytoplasmic kinase domain of the receptor triggers signaling cascades within the cell that are implicated in a number of diseases.31,32 EGFR overexpression is a well-known biomarker in several cancers.33 The capability to analyze such receptors comprehensively, including the posttranslational modifications which are indicative of their activation states, could thus offer important insights into a number of disease processes.34,35 Multiple phosphorylation sites have been reported for EGFR,29,36-38 and there are 12 theoretically possible N-linked glycosylation sites.30,39 Because of the large size and heterogeneous modifications of this protein, direct top-down analysis would be difficult and insensitive if it could be carried out at all.40
A theoretical digest of EGFR using Lys-C yielded 63 peptide fragments, with 12 peptides less than 500 Da and 16 peptides greater than 3500 Da (one as large as 10,266 Da). Trypsin, in contrast, yielded 106 peptide fragments, with 25 peptides less than 500 Da, and only 4 peptides over 3500 Da. Thus, half as many peptides overall, two-fold fewer peptides of less than 500 Da, would be generated using Lys-C relative to trypsin digestion.
EGFR (1 pmole) was digested with Lys-C and then analyzed by the method of the invention using several injections of 200 fmole each onto a 75 μm i.d. Biobasic C-4 column (300 Å pore). A total of 95% sequence coverage of EGFR, including the identification and location of 10 glycosylation sites and three phosphorylation sites, was found (see Table 2). In the following, we illustrate key points of this analysis. As in the discussion of the beta-casein analysis, because the identification of unmodified peptides less than 3500 Da was similar to conventional methods, we present details of analysis of some of the large peptides: one unmodified, two phosphorylated, and two glycosylated.
+e
+g
+f
+g
+f
+f
+e
+f
+f
+e
+e
+f
+h
+h
aSee FIG. 5 caption and Experimental Section for details.
b“*” indicates glycosylated (N*) or phosphorylated amino acids (S* or T*).
cTheoretical unmodified monoisotope with a hydrogen mass [M + H]+shown, including the molecular weight of carbamidomethylation for cysteine residues.
d“+” = identified, “−” = not identified.
eBoth glycosylated and nonglycosylated peptides found.
fOnly glycosylated peptide found.
gOnly nonglycosylated peptide found.
hBoth phosphorylated and nonphosphorylated peptides found.
We first examine the analysis of one of the large unmodified peptides. The elution profile of the Lys-C digest of EGFR is shown in the total ion chromatogram of
It is noteworthy in
We next turn to an examination of the ability of this method to identify an important phosphorylated peptide in the same Lys-C digest of EGFR. Phosphorylation at T669 has been shown to have a significant effect on intracellular protein kinase activity.41 At the elution time of 35.82 min, shown in Panel A of
Further examination of
An important difference was seen between the Lys-C and trypsin digests in the sensitivity of detecting the peptide containing phosphothreonine 669. Loading 200 fmole on the column, we were able to detect the Lys-C phosphopeptide RTLRRLLQERELVEPLT*PSGEAPNQALLRILK (residues 653-684) but could not observe the corresponding tryptic fragment (underlined and in bold) even when loading as much as 2 pmole under similar separation conditions. The tryptic fragment of this phosphopeptide was detected principally as a 2+ charge ion (only 5% as a 3+ ion), while this Lys-C phosphopeptide was principally a 5+ charge ion (with 25% as a 4+ and 25% as a 6+ charge ion). An approximately 30-fold difference in the limit of detection was found between the Lys-C and the corresponding tryptic phosphopeptide ions, when summing the intensities from all observed charge states. Even when comparing only the most intense ion from each series of charges, this difference was still 10- to 15-fold. The improved electrospray ionization efficiency was again attributed to the Lys-C phosphorylated peptide carrying additional R residues, relative to the corresponding tryptic fragment. Moreover, the longer Lys-C peptide eluted 10 minutes later than the corresponding tryptic fragment, and peptides eluting with less aqueous solvent should have improved electrospray ionization efficiency.
We now discuss the analysis of a peptide which was both very large (10,266 Da) as well as phosphorylated (see Table 2). Since Sequest does not support searches for peptides over 10 kDa, manual searching was necessary. For this peptide, we used the accurate precursor mass to narrow down the peptide candidate list. Indeed, employing Lys-C specificity with +5 ppm mass accuracy for such a large molecular weight resulted in only one peptide candidate in the SwissProt database, EGFR residues 947-1037. We then examined the preferred cleavage sites and confirmed that the high intensity ions in the MS2 and MS3 spectra were in agreement with the predicted cleavages. A second ion with a mass of 10,346 Da (addition of 80 Da, phosphate) was also observed, and MS2 and MS3 fragmentation spectra showed phosphorylation at either S967 or S1002 in this large Lys-C peptide. Thus, this Lys-C fragment was observed in three forms: non-phosphorylated, phosphorylated at S967 or phosphorylated at S1002. Importantly, no doubly-phosphorylated form was observed. The high mass resolution and accuracy of FT-MS was crucial in this analysis to resolve and determine the molecular weight and high charge states of the 10,266 Da peptide.
For additional confirmation of the 10,266 Da peptide assignment, a portion of the Lys-C digest was further cleaved with trypsin. The 10.3 kDa fragment was reduced to seven smaller peptides. Two of the peptides were too small to be analyzed; the other five were identified by Sequest and supported our assignment of the Lys-C fragment. Partial phosphorylation at the same two serine residues was observed in two of these tryptic peptides; but, it could not be determined from the tryptic digest whether the phosphorylation occurred at only one site in a given EGFR molecule, or whether both sites were simultaneously phosphorylated. The accurate mass measurement of the large intact Lys-C peptide (residues 947-1037) did, however, reveal that only one of S967 and S1002 was phosphorylated at any one time. The ability to make this biologically relevant distinction is an important advantage of the method of the invention for large peptide assignment and has the potential to provide a more comprehensive characterization of the protein than tryptic digestion.
With respect to the LC-MS analysis of the Lys-C digest of EGFR, we note that the selection of the capillary LC stationary phase was an important consideration. It is well-known that the recovery of large fragments from LC columns is a function of the pore diameter (and pore shape), along with the n-alkyl chain length.42-45 In our initial studies to assess recovery, we found that 200 Å or even 300 Å pore commercial C-18 packing materials did not yield full recovery of fragments over 6,000 Da, as evidenced by ghost peaks in subsequent blank runs. In contrast, a C-4, 300 Å pore (Biobasic) solid phase performed well for the separation of the Lys-C digest of EGFR.
We next examined the capabilities of this strategy in determining the glycosylation pattern of EGFR. In brief, using a total of one pmole of EGFR (200 fmole/injection), we were able to determine which of the 12 potential N-glycosylation sites were fully glycosylated (N151, N328, N337, N420, N504, and N559), partially glycosylated (N32, N389, N544, and N579) or unglycosylated (N104 and N172) (see Table 2). Ongoing characterization of these glycan structures indicates that most are high-mannose or complex-type and so, as representative examples, we will illustrate the analysis of a high-mannose-type glycan and a complex-type glycan in the following discussions of
In contrast to assigning peptides or phosphopeptides, which can be aided by an automated Sequest search, software is limited for recognizing glycopeptides, and therefore the mass spectra at present had to be inspected manually. The search was based on several criteria. Initially, we examined the MS, MS2 and MS3 spectra for characteristic features, such as a ladder of fragments differing by the mass of a hexose unit (Hex, Δm=162), typically encountered in the high-mannose type glycans, or for the loss of sialic acid (SA, Δm=292) or N-acetylhexosamine (HexNAc, Δm=203), typically observed for complex-type glycans. In this context, we searched for ladders of masses with charge states as high as the charge state of the precursor ions. In addition, we examined the MS2 and MS3 spectra for ions corresponding to the loss of specific oligosaccharide chains, such as antennae (Hex-Hex-Hex, m/z=486; Hex-Hex-HexNAc, m/z=528; or SA-Hex-HexNAc, m/z=657). Since these signature ions are singly charged, their m/z values would be constant regardless of the precursor ion charge state, and therefore, they could be easy to recognize in a spectrum. After a glycopeptide was found by one of the above signatures, the peptide sequence was manually assigned, using one of the strategies discussed in the examples below.
As shown in
The analysis in
The ability to identify the glycan moiety depended critically on the determination of the peptide sequence. If peptide backbone fragments of the glycopeptide were not observed in the MS2 or MS3 CID spectrum, then two alternative approaches could be taken. First, a higher stage fragmentation such as MS4 or MS5 could be employed to determine the peptide sequence. However, more material may be required (e.g., 10 fold) for additional LC-MS analyses to target specific precursor masses and to acquire a sufficient number of precursor ions for MS4 or MS5 fragmentation. A second and more general approach involves performing a parallel analysis according to the method of the invention of a deglycosylated sample (i.e., further treating an aliquot of the Lys-C digest with PNGase F or A), which was used in the next example.
We next turn to the identification of a complex-type glycan structure. A glycopeptide was found at an elution time of 36.70 min (Panel A in
Further practice of the method of the invention can include examination of the deglycosylated digest to obtain the sequences of the peptide portions. Deglycosylation can also provide an unambiguous determination of the glycosylation sites, since a glycosidase converts N to D in the removal of the glycan from the peptide.
The detailed glycan structure analysis of each glycopeptide at this stage can be time-consuming because there is at present only limited software that can unambiguously recognize glycosylated peptides or determine their glycan structures, and therefore, individual spectra must be inspected manually. Determining the elution times of deglycosylated peptides greatly reduced the amount of manual examination, as it was found that the glycopeptide and its corresponding deglycosylated form eluted close to each other. Although interpretation was aided by GlycoMod,46 the type and branching of the glycans still required manual interpretation based on likely gas-phase cleavages. The development of software to automate the recognition process would be helpful.
Achromobacter protease I (Lys-C) was obtained from Wako (Richmond, Va.), and trypsin (sequencing grade) was purchased from Promega (Madison, Wis.). The proteins, beta-casein from milk and human epidermal growth factor receptor (EGFR) from an A431 cancer cell line, as well as dithiothreitol (DTT), iodoacetamide (IAA), guanidine hydrochloride and ammonium bicarbonate, were obtained from Sigma-Aldrich (St. Louis, Mo.). Formic acid, acetone and acetonitrile were purchased from Fisher Scientific (Fair Lawn, N.J.), and the HPLC-grade water used in all experiments was from J. T. Baker (Bedford, Mass.).
To beta-casein (1 mg/mL), the endoproteinase Lys-C was added in a 1:100 (w/w) ratio, and incubated for 4 hrs at 37° C. Half of the digest was directly analyzed by LC-MS (see below), and the other half of the sample was further digested by trypsin (1:100 w/w) for an additional 16 hr at 37° C., followed by the LC-MS analysis. EGFR was received as a lyophilized powder containing 500 units of the protein. The powder (˜1 pmole of EGFR) was reconstituted with 200 μL of 6 M guanidine hydrochloride, reduced with 20 mM DTT for 30 min at 37° C. and alkylated with 50 mM of IAA in the dark for 1.5 hr at room temperature. After desalting over a Microcon spin column (10 kDa MWCO; Millipore, Bedford Mass.), the endoproteinase Lys-C (1:100 w/w) was added to digest the protein for 4 hr at 37° C. Digestion was stopped by addition of 1% formic acid. Similar to beta-casein, half of the digest was directly analyzed by LC-MS, and the other half was further digested by trypsin (1:100 w/w) for an additional 16 hr at 37° C., then analyzed by LC-MS. For deglycosylation, a portion of the above Lys-C digest (10 μL) was treated with 0.5 unit of PNGase F at 37° C. for 2 hr. In order to avoid possible interactions between silanol groups in the LC stationary phase and phosphate groups in the phosphopeptides, 1% phosphoric acid was added to all samples prior to the LC-MS analysis.20
LC-MS experiments were performed on an LTQ-FTMS instrument (Thermo Electron, San Jose, Calif.) with an Ultimate nanoLC pump (Dionex, Mountain View, Calif.), using a reversed phase column (75 μm i.d. ×10 cm, BioBasic C18 or C4, 5 μm particle size, Thermo Electron). The flow rate was 400 nL/min for sample loading and 200 nL/min for separation. Mobile phase A was 0.1% formic acid in water, and mobile phase B was 0.1% formic acid in acetonitrile. A shallow gradient was used for all analyses: (i) 5 minutes at 2% B for sample loading at 400 mL/min, (ii) flow rate lowered to 200 nL/min for 5 minutes, (iii) linear gradient to 65% B over 50 min, then (iv) to 80% B over 10 min, and finally (v) constant 80% B for 10 min.
The ion transfer tube of the linear ion trap was held at 245° C.; the normalized collision energy was 28% for MS2 and 20% for MS3; and the spray voltage was set at 2.2 kV. Briefly, the mass spectrometer was operated in the data-dependent mode to switch automatically between MS, MS2, and MS3 acquisition. Survey full-scan MS spectra with 2 microscans (m/z 400-2000) were acquired in the FTICR cell with mass resolution of 100,000 at m/z 400 (after accumulation to a target value of 2×106 ions in the linear ion trap), followed by 4 pairs of sequential MS2 and MS3 scans (see
Peptides and proteins were identified by automated searching of all MS2 and MS3 spectra against spectra of theoretical fragmentations of a human proteomic database (SwissProt), using the Sequest algorithm incorporated into the BioWorks software (version 3.1 SR1, Thermo Electron). The Sequest search was conducted with a mass tolerance ±4 Da. Peptide ions (≦3+ ions) were assigned automatically with Xcorr scores above the following thresholds: >3.75 for 3+ ions, >2.2 for 2+ ions, and >1.9 for 1+ ions; with Lys-C or trypsin specificity, as appropriate, and up to 3 missed cleavages. For larger ions (≧4+ charge), Sequest was used to assign the most likely peptide sequence, then confirmed the assignment manually by (i) comparing the accurate precursor mass in the survey scan with the predicted mass of the candidate peptide, (ii) using combined MS2 and MS3 peptide assignments and (iii) confirming preferred fragmentation patterns in the observed MS2 and MS3 fragmentation, as discussed in the Results and Discussion Section. For peptide identification by MS3 spectra, no enzyme specificity restriction was applied as long as the parent peptides from the MS2 assignments were based on full enzyme specificity. In addition, loss of water (−18 Da) from the C terminus was included as a potential modification in the MS3 spectral search. Glycopeptides were assigned manually as described in above.
While the present invention has been described in conjunction with a preferred embodiment, one of ordinary skill, after reading the foregoing specification, will be able to effect various changes, substitutions of equivalents, and other alterations to the compositions and methods set forth herein. It is therefore intended that the protection granted by Letters Patent hereon be limited only by the definitions contained in the appended claims and equivalents thereof.
This application claims the priority of U.S. Provisional Application No. 60/605,058 filed Aug. 27, 2004 entitled, CHARACTERIZATION OF PROTEINS USING LARGE PEPTIDE FRAGMENT ANALYSIS, the whole of which is hereby incorporated by reference herein.
Part of the work leading to this invention was carried out with United States Government support provided under a grant from the National Institutes of Health, Grant No. GM-15847. Therefore, the U.S. Government has certain rights in this invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2005/030713 | 8/29/2005 | WO | 00 | 2/27/2007 |
Number | Date | Country | |
---|---|---|---|
60605058 | Aug 2004 | US |