Aging is a complex and ubiquitous biological process that leads to accumulation of molecular, cellular, and organ damage, resulting in reduced health, increased vulnerability to disease, and eventually to death. The chronological and biological age of individuals can vary. For example, lifestyle choices such as smoking may increase the rate of biological aging relative to chronological aging. While various biomarkers have been used to estimate biological age, there remains a need for accurate and easily measured biomarkers for determining the age of a subject using a biological sample.
The Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in limiting the scope of the claimed subject matter.
The present disclosure is based in part on the novel application of mass spectrometry to measure glycopeptides in biological samples, as well as the finding that chronological age correlates strongly with the relative abundance of one or more measured glycopeptides.
In one aspect, provided herein are methods for determining the age of a biological sample from a subject. In some embodiments, the age of the subject is determined based on the age of the biological sample. In some embodiments, the methods comprise measuring a relative abundance of at least one glycopeptide in the biological sample. In some embodiments, the at least one glycopeptide comprises any of the glycopeptides in Table 2 herein. In some embodiments, the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgM-209-5411, IgM-J-5412, Haptoglobin (Hp)-241-7602, or a combination thereof. In some embodiments, the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgG2-3410, IgM-209-5411, IgM-J-5412, Hp-241-7602, or a combination thereof.
In some embodiments, the methods herein further comprise measuring a concentration of at least one protein in the biological sample. In some embodiments, the at least one protein comprises any of the proteins in Table 2. In some embodiments, the at least one protein comprises IgG3.
In some embodiments, the methods comprise comparing the relative abundance of the at least one glycopeptide and/or the concentration of the at least one protein to an age prediction model, wherein the age prediction model comprises the relative abundance of the at least one glycopeptide and/or the concentration of the at least one protein in at least one control biological sample. In some embodiments, each control biological sample is from a control individual of a known age. In some embodiments, the age prediction model comprises the relative abundance of the at least one glycopeptide in a plurality of control biological samples. In some embodiments, the age prediction model comprises a linear regression model or a multiple linear regression model based on a correlation between the relative abundance of the at least one glycopeptide in the at least one control biological sample and the age of the control individual. In some embodiments, the age prediction model comprises one of the multiple linear regression models of Table 5 herein.
In some embodiments, the biological samples and the control biological samples are liquid samples. In some embodiments, the samples are blood samples, serum samples, plasma samples, or a combination thereof.
In some embodiments of the methods herein, measuring the relative abundance of at least one glycopeptide and/or measuring the concentration of at least one protein comprises mass spectrometry (e.g., multiple reaction monitoring mass spectrometry). In some embodiments, measuring the relative abundance of the at least one glycopeptide comprises calculating the relative response of the at least one glycopeptide as the area under the mass spectrometry curve of the at least one glycopeptide divided by the area under the curve of a non-glycosylated reference peptide from the same protein as the at least one glycopeptide.
In some embodiments, the subject is male or female. In some embodiments, the biological sample is from a criminal forensics investigation.
The present application includes the following figures. The figures are intended to illustrate certain embodiments and/or features of the compositions and methods, and to supplement any description(s) of the compositions and methods. The figures do not limit the scope of the compositions and methods, unless the written description expressly indicates that such is the case.
The following description recites various aspects and embodiments of the present compositions and methods. No particular embodiment is intended to define the scope of the compositions and methods. Rather, the embodiments merely provide non-limiting examples of various compositions and methods that are at least included within the scope of the disclosed compositions and methods. The description is to be read from the perspective of one of ordinary skill in the art; therefore, information well known to the skilled artisan is not necessarily included.
The following definitions are provided to assist the reader. Unless otherwise defined, all terms of art, notations, and other scientific or medical terms or terminology used herein are intended to have the meanings commonly understood by those of skill in the chemical and medical arts. In some cases, terms with commonly understood meanings are defined herein for clarity and/or for ready reference, and the inclusion of such definitions herein should not be construed as representing a substantial difference over the definition of the term as generally understood in the art.
Articles “a” and “an” are used herein to refer to one or to more than one (i.e. at least one) of the grammatical object of the article. By way of example, “an element” means at least one element and can include more than one element.
The use herein of the terms “including,” “comprising,” or “having,” and variations thereof, is meant to encompass the elements listed thereafter and equivalents thereof as well as additional elements. Embodiments recited as “including,” “comprising,” or “having” certain elements are also contemplated as “consisting essentially of and “consisting of those certain elements. As used herein, “and/or” refers to and encompasses any and all possible combinations of one or more of the associated listed items, as well as the lack of combinations where interpreted in the alternative (“or”).
As used herein, the transitional phrase “consisting essentially of”' (and grammatical variants) is to be interpreted as encompassing the recited materials or steps “and those that do not materially affect the basic and novel characteristic(s)” of the claimed invention. See, In re Herz, 537 F.2d 549, 551-52, 190 U.S.P.Q. 461, 463 (CCPA 1976) (emphasis in the original); see also MPEP § 2111.03. Thus, the term “consisting essentially of” as used herein should not be interpreted as equivalent to “comprising.”
Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. For example, if a concentration range is stated as 1% to 50%, it is intended that values such as 2% to 40%, 10% to 30%, or 1% to 3%, etc., are expressly enumerated in this specification. These are only examples of what is specifically intended, and all possible combinations of numerical values between and including the lowest value and the highest value enumerated are to be considered to be expressly stated in this disclosure.
The terms “about” and “approximately” as used herein shall generally mean an acceptable degree of error for the quantity measured given the nature or precision of the measurements. Exemplary degrees of error are within 20% (%); preferably, within 10%; and more preferably, within 5% of a given value or range of values. Any reference to “about X” or “approximately X” specifically indicates at least the values X, 0.95X, 0.96X, 0.97X, 0.98X, 0.99X, 1.01X, 1.02X, 1.03X, 1.04X, and 1.05X. Thus, expressions “about X” or “approximately X” are intended to teach and provide written support for a claim limitation of, for example, “0.98X.” Alternatively, in biological systems, the terms “about” and “approximately” may mean values that are within an order of magnitude, preferably within 5-fold, and more preferably within 2-fold of a given value. Numerical quantities given herein are approximate unless stated otherwise, meaning that the term “about” or “approximately” can be inferred when not expressly stated. When “about” is applied to the beginning of a numerical range, it applies to both ends of the range.
“Polypeptide,” “peptide,” and “protein” are used interchangeably herein to refer to a polymer of amino acid residues. As used herein, the terms encompass amino acid chains of any length, including full-length proteins, wherein the amino acid residues are linked by covalent peptide bonds.
The amino acids in the polypeptides described herein can be any of the 20 naturally occurring amino acids, D-stereoisomers of the naturally occurring amino acids, unnatural amino acids and chemically modified amino acids. Unnatural amino acids (that is, those that are not naturally found in proteins) are also known in the art, as set forth in, for example, Zhang et al. “Protein engineering with unnatural amino acids,” Curr. Opin. Struct. Biol. 23(4): 581-587 (2013); Xie et la. “Adding amino acids to the genetic repertoire,” 9(6): 548-54 (2005)); and all references cited therein. Beta and gamma amino acids are known in the art and are also contemplated herein as unnatural amino acids.
As used herein, a chemically modified amino acid refers to an amino acid whose side chain has been chemically modified. For example, a side chain can be modified to comprise a signaling moiety, such as a fluorophore or a radiolabel. A side chain can also be modified to comprise a new functional group, such as a thiol, carboxylic acid, or amino group. Post-translationally modified amino acids are also included in the definition of chemically modified amino acids.
Also contemplated are conservative amino acid substitutions. By way of example, conservative amino acid substitutions can be made in one or more of the amino acid residues, for example, in one or more lysine residues of any of the polypeptides provided herein. One of skill in the art would know that a conservative substitution is the replacement of one amino acid residue with another that is biologically and/or chemically similar. The following eight groups each contain amino acids that are conservative substitutions for one another:
By way of example, when an arginine to serine is mentioned, also contemplated is a conservative substitution for the serine (e.g., threonine). Nonconservative substitutions, for example, substituting a lysine with an asparagine, are also contemplated.
Provided herein are methods for measuring and using the relative abundance of glycopeptides in biological samples from subjects to estimate the age of the subjects. As demonstrated herein, glycopeptides can be efficiently and accurately measured in biological samples, and the relative abundances of certain glycopeptides correlate strongly with chronological age. Along with nucleic acids, proteins, and lipids; glycans (oligosaccharides) are one of the four fundamental classes of molecules that make up all living systems (1). Traditionally, the information stream of a cell is viewed as starting in the genome and ending with a set of expressed proteins, representing the cell's phenotype. However, in order for a protein to function appropriately, it often requires post-translational modifications, of which glycans are one of the most commonly added modifiers. They can function as protein “on and off” switches or as “analog regulators” to fine-tune and direct protein function (2). The process that synthesizes and enzymatically attaches glycans to organic molecules is called glycosylation and it can produce thousands of unique glycan structures by linking together a finite set of sugar monomers (3). However, unlike DNA, RNA and protein synthesis, there is no template to guide the production of glycans. The process is thus immensely complex and impossible to predict from gene expression profiles alone. In fact, when one considers the massive 3-dimensional structural diversity of glycans combined with their variation in attachment sites, the complexity of the glycome parallels that of the genome (2).
As part of their glycoscience “Roadmap” (2), the National Research Council of the U.S. National Academies highlighted the importance of developing a site-specific map of the serum glycome, which would aid in the development of glycans as biomarkers of human diseases. One reason for the excitement around the use of glycans as disease-specific biomarkers is that glycosylation is a process influenced by a variety of factors including: the type of cell and its activation state; environmental factors, such as the presence of available metabolites; the age of the cell, as glycan moieties can be lost over time; and inflammatory mediators, such as cytokines and chemokines. All these factors can be altered in the setting of human diseases, making the glycome an expression of the overall health status of an individual. Furthermore, it has been hypothesized that glycans not only become altered in the setting of human disease but that they actually play a major role in the etiology of all human diseases (2). It is therefore not surprising that alterations in the glycome have already been linked to a variety of human diseases, especially cancer and autoimmunity (4-16). Most of these prior studies used labor-intensive methodologies to characterize glycans released from purified proteins and perhaps for this reason, detailed analyses have only been conducted on a relatively small number of patients. Lower resolution techniques, which yield limited structural information or no site-specific information, have been used to characterize larger patient cohorts, but such analyses are not ideally suited for biomarker discovery research. As a result, the sensitivity and specificity of site-specific glycosylations as disease-specific multi-analyte classifiers of autoimmunity is currently unknown.
In comparison to the advances made in the fields of genomics and proteomics, glycoscience remains relatively understudied, which is due to a lack of the analytical tools needed to drive the field forward (2). In this regard, glycoscience is similar to where the field of genetics was during the initial stages of the human genome project (2). Mass spectrometry (MS)-based technologies remain very appealing for glycan biomarker research because glycans are ionizable molecules. Also, the potential to accurately profile and quantitate thousands of glycan structures from a relatively small amount of starting material (e.g. 2 μl of serum) makes glycans superior to other molecules traditionally used as biomarkers of human diseases. For example, a site-specific glycoprofiling method could theoretically increase the accuracy of a serum protein biomarker by subdividing it into its different glycoforms.
With the goal of deploying glycan biomarkers clinically, Multiple Reaction Monitoring (MRM) has been developed to site-specifically characterize the human glycome in a rapid and reproducible fashion (17). Although MRM MS is mainly used in the fields of metabolomics and proteomics (18-21), its high sensitivity and linear response over a wide dynamic range makes it especially suited for glycan detection (22). In the studies described herein, MRM MS is used to construct a detailed site-specific structural map of the human plasma glycome of healthy individuals and to characterize the glycans' inter-and intra-molecular correlations. Glycan alterations associated with age and gender (common covariants in biomarker research and discovery) were also identified and multi-analyte classifiers capable of predicting age were constructed and validated.
In one aspect, provided herein is a method for determining the age of a biological sample from a subject. As used herein, the term “subject” refers to animals such as mammals, including, but not limited to, humans, non-human primates, cows, sheep, goats, horses, dogs, cats, rabbits, rats, mice and the like. In some embodiments, the biological samples used in the methods provided herein are obtained from a human subject. In some embodiments, the subject is male or female. In some embodiments, the biological samples are obtained as part of a forensics investigation (e.g., criminal forensics). As used herein, the term “age” and its grammatical equivalents may refer to either chronological age, i.e., the length of time that a living organism has been alive, or biological age (also referred to as physiological age), i.e., how old the body of a living organism seems to be, based on any of a number of biological factors. The methods herein may be used to determine or predict chronological age, biological age, or both chronological age and biological age.
A biological sample of the present disclosure may be any suitable sample from a subject (e.g., a solid sample, a liquid sample, a tissue sample, a cellular sample, a waste sample, etc.). In some embodiments, the sample is a blood sample. In some embodiments, the blood sample is a whole blood sample. In some embodiments, the whole blood sample is processed (e.g., by centrifugation or filtration) to enrich one or more blood components. In some embodiments, the blood sample has been processed to deplete one or more blood components. In some embodiments, the blood sample comprises plasma, serum, buffy coat, or any other blood fraction. In some embodiments, the blood sample comprises venous and/or capillary blood. In some embodiments, the biological sample is a blood sample, a serum sample, a plasma sample, or a combination thereof.
In some embodiments, the methods provided herein comprise measuring a relative abundance of at least one glycopeptide (e.g., one glycopeptide, two glycopeptides, three glycopeptides, four glycopeptides, five glycopeptides, six glycopeptides, seven glycopeptides, eight glycopeptides, nine glycopeptides, ten glycopeptides, or more) in a biological sample. In some embodiments, the at least one glycopeptide comprises any of the glycopeptides in Table 2. In some embodiments, the at least one glycopeptide comprises at least one (e.g., one, two, three, four, five, or all six) of the glycopeptides shown in
In the present disclosure, glycopeptides are designated using the format [protein]-[glycosylation site (optional)]-[glycan structure]. The protein is generally indicated using the common name (e.g., as indicated in UNIPROT), but abbreviations and/or alternative names may be used as indicated. When present, the glycosylation site (e.g., the amino acid residue to which the glycan structure is connected) is indicated following UNIPROT numbering. When there is no position indicated, the glycosylation occurs at the immunoglobulin constant heavy chain domain 2 (CH2)-84.4 glycosylation site (IMGT numbering system). Glycan structures are presented as four-digit codes. The first digit represents the total number of hexose sugars (e.g., the number of mannose and galactose residues combined); the second digit represents the total number of N-acetylglucosamine residues; the third digit represents the number of fucose residues; and the fourth digit represents the number of sialic acid moieties. In some embodiments (e.g., in humans), sialic acid is N-acetylneuraminic acid (Neu5Ac or NANA). As an example, Hp-241-7602 refers to haptoglobin (protein name) with a glycan at residue 241 (glycosylation site) having 7 hexose sugar residues, 6 N-acetylglucosamine residues, 0 fucose residues, and 2 sialic acid residues.
In the present disclosure, glycopeptides and glycans may also be depicted schematically (e.g., in
Various methods may be used to measure the relative abundance of the glycopeptides described herein. In some embodiments, the methods comprise a mass spectrometry (MS) technique. In some embodiments, the methods comprise multiple reaction monitoring mass spectrometry (MRM MS). In some embodiments, the methods comprise isolating the biological sample (e.g., serum or plasma) from a subject. In some embodiments, the methods comprise digesting the proteins in the biological sample (e.g., with trypsin), which creates a mixture of peptides and glycopeptides. In some embodiments, measuring the relative abundance of a glycopeptide (or a peptide) comprises calculating the relative response of each glycopeptide as the MS area under the curve of the glycopeptide divided by the MS area under the curve of a non-glycosylated reference peptide from the same protein. This is different from absolute protein concentrations, which is determined by a calibration curve (also called a standard curve). To create the calibration curve, standard proteins are digested with trypsin and a dilution series is made. The dilution series is then analyzed by mass spectrometry.
In some embodiments, the methods provided herein comprise comparing the relative abundance of at least one glycopeptide to an age prediction model. In some embodiments, the age prediction model comprises the relative abundance of the at least one glycopeptide in at least one (e.g., at least two, at least three, at least five, at least 10, at least 20, at least 50, at least 75, at least 100, or more) control biological sample(s), wherein each control biological sample is from a control individual of a known age, thereby determining the age of the biological sample. In some embodiments, the age of the subject is determined based on the age of the biological sample. In some embodiments, the age prediction model comprises the relative abundance of the at least one glycopeptide in a plurality of control biological samples. In some embodiments, a control population of individuals of different ages is used to identify glycopeptides that are associated with age. For example, for each glycopeptide, a scatter plot may be created by plotting the relative abundance of the glycopeptide against age for each control individual. From this scatter plot, a correlation coefficient and p value may be calculated. In some embodiments, a control population of individuals comprises individuals of any age. For example, a control population may be selected to represent the general age distribution of a larger population (e.g., the population the subject of interest is part of).
In some embodiments, the age prediction model comprises a linear regression model or a multiple linear regression model based on a correlation between the relative abundance of the at least one glycopeptide in the at least one control biological sample and the age of the control individual. For example, a single or multiple glycopeptide age prediction classifier (i.e., an age prediction model) may be constructed from the glycopeptides that correlate with age (e.g., as described above). Such an age prediction model can be represented as [Age=X1G1+X2G2 . . . XnGn+C], where X1, X2 . . . Xn represent coefficients G1, G2 . . . Gn represent glycopeptide abundance, and C represents a constant variable. In some embodiments, the age prediction model comprises one of the multiple linear regression models described in Table 5.
In some embodiments, the age prediction models further comprise peptide or protein abundances in addition to glycopeptide relative abundances. As such, in some embodiments, the methods provided herein further comprise measuring a concentration of at least one protein in the biological sample and comparing the concentration of the at least one protein to the age prediction model, wherein the age prediction model further comprises the concentration of the at least one protein in the at least one control biological sample. In some embodiments, the at least one protein comprises any of the proteins in Table 2 herein. In some embodiments, the at least one protein comprises IgG3. Protein or peptide concentrations may be measured using any suitable method. In some embodiments, measuring protein or peptide concentration comprises MS (e.g., MRM MS).
The following embodiments are contemplated. All combinations of features and embodiments are contemplated.
Embodiment 1: A method for determining the age of a biological sample from a subject, the method comprising measuring a relative abundance of at least one glycopeptide in the biological sample and comparing the relative abundance of the at least one glycopeptide to an age prediction model, wherein the age prediction model comprises the relative abundance of the at least one glycopeptide in at least one control biological sample, wherein each control biological sample is from a control individual of a known age, thereby determining the age of the biological sample.
Embodiment 2: An embodiment of embodiment 1, wherein the age of the subject is determined based on the age of the biological sample.
Embodiment 3: An embodiment of embodiment 1 or 2, wherein the at least one glycopeptide comprises any of the glycopeptides in Table 2.
Embodiment 4: An embodiment of any of the embodiments of embodiment 1-3, wherein the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgM-209-5411, IgM-J-5412, Haptoglobin (Hp)-241-7602, or a combination thereof.
Embodiment 5: An embodiment of any of the embodiments of embodiment 1-4, wherein the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgM-209-5411, IgM-J-5412, and Haptoglobin (Hp)-241-7602.
Embodiment 6: An embodiment of any of the embodiments of embodiment 1-5, wherein the method further comprises measuring a concentration of at least one protein in the biological sample and comparing the concentration of the at least one protein to the age prediction model, and wherein the age prediction model further comprises the concentration of the at least one protein in the at least one control biological sample.
Embodiment 7: An embodiment of embodiment 6, wherein the at least one protein comprises any of the proteins in Table 2.
Embodiment 8: An embodiment of embodiment 6 or 7, wherein the at least one protein comprises IgG3.
Embodiment 9: An embodiment of embodiment 8, wherein the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgG2-3410, IgM-209-5411, IgM-J-5412, Hp-241-7602, or a combination thereof.
Embodiment 10: An embodiment of embodiment 8 or 9, wherein the at least one glycopeptide comprises IgG1-3510, IgG1-5410, IgG2-3410, IgM-209-5411, IgM-J-5412, and Hp-241-7602.
Embodiment 11: An embodiment of any of the embodiments of embodiment 1-10, wherein the age prediction model comprises the relative abundance of the at least one glycopeptide in a plurality of control biological samples.
Embodiment 12: An embodiment of any of the embodiments of embodiment 1-11, wherein the biological sample and the control biological sample are liquid samples.
Embodiment 13: An embodiment of any of the embodiments of embodiment 1-12, wherein the biological sample and the control biological sample are blood samples, serum samples, plasma samples, or a combination thereof.
Embodiment 14: An embodiment of any of the embodiments of embodiment 1-13, wherein measuring the relative abundance of the at least one glycopeptide comprises mass spectrometry.
Embodiment 15: An embodiment of any of the embodiments of embodiment 1-14, wherein measuring the relative abundance of the at least one glycopeptide comprises multiple reaction monitoring mass spectrometry.
Embodiment 16: An embodiment of embodiment 15, wherein measuring the relative abundance of the at least one glycopeptide comprises calculating the relative response of the at least one glycopeptide as the area under the mass spectrometry curve of the at least one glycopeptide divided by the area under the curve of a non-glycosylated reference peptide from the same protein as the at least one glycopeptide.
Embodiment 17: An embodiment of any of the embodiments of embodiment 1-16, wherein the age prediction model comprises a linear regression model or a multiple linear regression model based on a correlation between the relative abundance of the at least one glycopeptide in the at least one control biological sample and the age of the control individual.
Embodiment 18: An embodiment of embodiment 17, wherein the age prediction model comprises one of the multiple linear regression models of Table 5.
Embodiment 19: An embodiment of any of the embodiments of embodiment 1-18, wherein the subject is male or female.
Embodiment 20: An embodiment of any of the embodiments of embodiment 1-19, wherein the biological sample is from a criminal forensics investigation
Disclosed herein are materials, compositions, and methods that can be used for, can be used in conjunction with or can be used in preparation for the disclosed embodiments. These and other materials are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these materials are disclosed that while specific reference of each various individual and collective combinations and permutations of these compositions may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a method is disclosed and discussed, and a number of modifications that can be made to a number of molecules included in the method are discussed, each and every combination and permutation of the method, and the modifications that are possible are specifically contemplated unless specifically indicated to the contrary. Likewise, any subset or combination of these is also specifically contemplated and disclosed. This concept applies to all aspects of this disclosure including, but not limited to, steps in methods using the disclosed compositions. Thus, if there are various additional steps that can be performed, it is understood that each of these additional steps can be performed with any specific method steps or combination of method steps of the disclosed methods, and that each such combination or subset of combinations is specifically contemplated and should be considered disclosed.
Publications cited herein and the material for which they are cited are hereby specifically incorporated by reference in their entireties. The following description provides further non-limiting examples of the disclosed compositions and methods.
The following examples are offered to illustrate, but not to limit the claimed invention.
With knowledge of the collision induced dissociation (CID) behavior of the most abundant serum glycoforms (17,23) (
After the relative contribution of each of the glycopeptides that make up the bulk of the plasma glycome was calculated (
Firstly, it was not uncommon for a glycan at one glycosylation site to positively correlate with the same or highly similar glycans at another distant glycosylation site within the same glycoprotein. In other words, structurally similar glycans often occur at different sites within the same protein. For example, the presence of glycan 5402 at position 176 of Alpha-2-HS-glycoprotein (A2HSG) positively correlated (PPMCC 0.974) with the presence of glycan 5402 at site 156 of A2HSG (P<2E-16) (
In addition to the same or structurally similar glycans tending to occupy different sites within the same protein, glycans of similar structure also tended to occupy the same glycosylation. For example, the presence of glycan 5411 strongly correlated (PPMCC 0.908) with glycan 5410 at the same site of IgG1 (P<2E-16) (
Although the above examples might seem intuitive, the opposite was also possible, i.e. the relative abundance of a glycan at two different sites within the same glycoprotein can be negatively correlated. For example, glycan 5402 at position 55 of A2MG negatively correlated (PPMCC-0.463) with 5402 at A2MG position 1424 (P=1.84E-06) (
Apart from the intra-protein glycan correlations just described, there were also inter-protein glycan correlations that were of significance, i.e., glycans on different proteins can correlate (positively or negatively) with one another. This was especially true for the different immunoglobulin subclasses. For example, the abundance of glycan modifiers on IgG1 correlated with their identical counterparts on IgG2 (
Finally, in many cases, the relative abundance of a particular glycan at a defined site correlated with the protein's serum concentration. One interesting example is glycan 5402, which had a small positive correlation (PPMCC 0.28) with A1AT's serum concentration when present at site A1AT site 70 (P=0.006) but had a strong highly significant negative correlation (PPMCC −0.81) with the serum concentration of A1AT when present at A2AT site 271 (P<2E-16) (FIG. 5F). Other examples were the non-sialylated N-glycan 7600 and O-glycan 2200 occurring at sites 176 and 346 of A2HSG, respectively. Both glycans had a strong negative correlation with A2HSG serum concentration (PPMCC-0.87, P<2E-16, and PPMCC-0.98, P<2E-16) (
Previous studies conducted mainly on either released glycans or tryptic peptides of purified IgG have demonstrated that age and gender can alter the glycosylation of serum proteins (24-28). Thus, the site-specific glycan alterations that could be contributed to the age and gender effect were characterized (
Importantly, the specific glycan modifications affected by age were consistent across the different IgG subclasses. For example, for IgG1 and IgG2 subclasses, the non-galactosylated 3510 Fc glycan modification was positively correlated with age (PPMCCs 0.43 and 0.49, respectively) (
Many biological processes are altered by gender and, ultimately, this leads to differences in disease frequencies and treatment outcomes (29,30). Thus, characterizing gender-specific alterations in glycosylation is an important step in developing glycans as biomarkers of human disease.
Since there were 41 statistically significant glycopeptides that correlated with age (Table 2), the question arose whether enough information was held within the human glycome to construct an age prediction model. Linear regression models comprised of either glycopeptides only or a mixture of glycopeptides and proteins were thus constructed utilizing a forward stepwise selection method. A resulting “glycan only” model revealed that five sites of glycosylation (IgG1-3510, IgG1-5410, IgM-209-5411, IgM-J-5412, and Haptoglobin (Hp)-241-7602) were sufficient to accurately predict age (PPMCC 0.81) (
Because model constituents IgG1-5410 and IgM-J-5412 had been previously monitored, a meta-analysis was also conducted to determine the weighted averages of their respective glycan-age correlations. These meta-analyses yielded averages that were highly significant (P<2E-16 and P=8.4E-06, respectively) with no evidence (P=0.27 and P=0.93, respectively) of any substantial residual heterogeneity (i.e. there was no remaining variability in effect sizes that was unexplained) (
A second combined age-prediction model, which included serum protein concentrations as additional variables, was also constructed. The resulting model contained six glycopeptides (IgG1-3510, IgG1-5410, IgG2-3410, IgM-209-5411, IgM-J-5412, Hp-241-7602) and 1 serum protein (IgG3). This model was also highly accurate in its ability to predict age (PPMCC 0.85; r2=0.67+/−0.05, 5-fold CV) (
Study design. The objective of this study was to identify the relative abundance of site-specific glycosylations within the most abundant plasma proteins and then to use this information to make multianalyte classifiers capable of predicting age. Healthy individuals were recruited from the University of California (UC) Davis Medical Center. The University of California, Davis Institutional Review Board (Committee B) approved this study. Research was performed in accordance with relevant guidelines and regulations. All participants provided their written informed consent.
Sample preparation. For each individual enrolled, plasma was separated from whole blood using a Ficoll gradient. From each plasma preparation, a 2-μL aliquot was reduced, alkylated, and then subjected to trypsin digestion at 37° C. (35). To allow for absolute quantification, 100 μg of IgG, IgA and IgM (all from Sigma-Aldrich, St. Louis, MO) was digested according to the same protocol and a dilution series was made prior to sample injection.
UPLC-ESI-QqQ-MS analysis. The neat enzymatically prepared samples containing both peptides and glycopeptides were then directly analyzed without further hands-on sample cleanup or dilution using an Agilent 1290 infinity liquid chromatography (LC) system coupled to an Agilent 6490 triple quadrupole (QqQ) mass spectrometer (Agilent Technologies, Santa Clara, CA), as previously described (23,35,36). Briefly, an Agilent Eclipse plus C18 (RRHD 1.8 μm, 2.1×100 mm) coupled with an Agilent Eclipse plus C18 pre-column (RRHD 1.8 μm, 2.1×5 mm) was used for UPLC separation. 1.0 μL of the digested plasma samples was injected and analyzed using a 25-minute binary gradient consisting of solvent A of 3% acetonitrile, 0.1% formic acid, solvent B of 90% acetonitrile, 0.1% formic acid in nano-pure water (v/v) at a flow rate of 0.5 mL/min.
The MRM MS method used for this study requires predetermined knowledge of the peptide or glycopeptide's LC retention time and its collision induced dissociation (CID) behavior, which were previously determined for all the non-glycosylated peptides and glycopeptides used in this study (
Statistical analysis. All statistical analyses were done using R software (37). For each analyte, skewedness was calculated, and data was log transformed when necessary to remove excessive skewness. Outliers were identified using R package “extreamvalues” (38), and when present, were winsorized from the analysis, so that the outliers were set equal to the nearest non-outlier value. Analytes could be detected in all samples; thus, there was no need for imputation of missing data. ANCOVA and linear regression assumptions about the normality of residuals were examined by use of the Shapiro-Wilk test. Colinearity of variables in the multivariate models was examined by calculating variance inflation factor (excessive if >2.5) with R package “car” (39). Nonlinear relationships between the analytes and the outcome were evaluated with R package “mfp” using a multiple fractional polynomial method (40). Variable selection in the multiple linear regressions analyses was performed by forward stepwise exhaustive search using “leaps” R package (41). The algorithm searched the best models of all sizes up to the specified maximum number variables. To identify the best number of variables, each model's performance was estimated by the leave-one-out cross validation method using “caret” (42) R package and the number with minimum root-mean-square error (RMSE) was selected. Logistic regression models were fitted using Firth's bias reduction method with the R package “logistf” (43). This package was also used for automated variable selection based on penalized likelihood ratio tests. Model performance estimated by 5-fold cross-validation was calculated using R package “HandTill2001” (44). Meta-analyses were conducted to assess findings across the multiple datasets using R package “metafor” (45). A weighted random-effects model was used to estimate a summary effect size. Restricted maximum-likelihood estimator was selected to estimate between-study variance. Weighted estimation with inverse-variance weights was used to fit the model. To present the correlations between all analytes simultaneously, the dimensionality reduction algorithm “t-distributed stochastic neighbor embedding” (t-sne) was used, implemented in the R package “Rtsne” (46).
Described herein, e.g., in Examples 1-4, is a detailed site-specific map of the human serum glycome, which reveals many novel features of glycosylation. In some cases, glycosylation varied with protein abundance, such that the probability of a particular site-specific glycosylation occurring became rare as the serum concentration of the protein increased (
Other interesting phenomena that came to light from the experiments described herein include the observed correlations of site-specific glycosylations across different proteins. This was especially true for IgG1 and IgG2 glycosylations (
Importantly, the MRM MS method described in the Examples herein is substantially different from methods previously employed for analysis of serum IgG glycans (31,32). Specifically, the prior methods required purification of IgG and enzymatic release of the modifying glycans. In contrast, the method described herein was site-specific and required no protein purification. Thus, the glycan mapping results herein differ from those previously reported (31,32). Furthermore, some amount of glycan structural information is inevitably lost during the ionization process. Thus, different ionization and analysis methods will yield different efficiencies of detection for different glycan structures. The methods herein were not used to definitively determine that a certain glycan structure was more prevalent than another at a specific glycosylation site. Rather, they were used to develop a highly precise method of site-specific glycan detection (i.e., a method with high reproducibility;
Age and gender are the covariants most commonly accounted for in biomarker research and discovery. As an aid for future glycan biomarker discovery research, glycan alterations associated with these common covariants were identified. Analysis of a large control group, representing healthy individuals ages 21 to 84 years old, demonstrated that IgM was negatively correlated with age (
The study described herein is unique for a variety of reasons: 1) glycan quantification was site-specific across multiple serum proteins including different Ig classes and subclasses, while previous studies typically focus on characterizing released glycans or glycoprofiled only a few serum proteins (4-16,31,32); 2) the MRM approach eliminated the need for additional protein purification or chemical processing, which allowed for large patient cohorts to be rapidly characterized; 3) the analysis was precise, rapid, and automated for high throughput; 4) it required only 2 μl of serum or plasma and little sample preparation, while current techniques require several mL of blood to quantitate Ig levels; and 5) in addition to total protein quantification, the technique provided the relative abundance of each glycopeptide, making it more suitable for biomarker research and discovery. For these reasons, the use of this approach as a clinical diagnostic tool is very appealing, especially when compared to its more labor-intensive alternatives (4-16,31,32). Glycan analysis may thus be advantageously applied to the diagnosis and management of human diseases, especially diseases of the immune system and cancer.
35 Hong, Q., Lebrilla, C. B., Miyamoto, S. & Ruhaak, L. R. Absolute quantitation of immunoglobulin G and its glycoforms using multiple reaction monitoring. Anal Chem 85,8585-8593, doi:10.1021/ac4009995 (2013).
The present application claims priority to U.S. Provisional Application No. 63/255,850 filed Oct. 14, 2021, the full disclosure of which is incorporated by reference in its entirety for all purposes.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/046769 | 10/14/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63255850 | Oct 2021 | US |