The disclosure relates to methods for making species-specific phosphorylation site databases and arrays and species-specific kinome arrays.
Protein phosphorylation is believed to be the most widespread mechanism of cellular signalling, with approximately one-third of all proteins in the eukaryotic cell estimated to undergo this post-translational modification (Johnson and Hunter, 2005). A recently developed technology for studying phosphorylation-mediated cellular signalling is the kinome microarray. Each spot on a kinome microarray contains a peptide representing a phosphorylation site (the actual phosphorylated residue, and several surrounding residues) from a given protein. These peptides are capable of being phosphorylated with similar kinase-catalyzed kinetics as the corresponding intact protein (Zetterqvist et al., 1976; Kemp et al., 1977). First proposed and tested in 2002 (Houseman and Mrksich, 2002; Houseman et al., 2002), kinome microarrays have since been used to study signalling in a number of biological systems (e.g. Löwenberg et al., 2005; Sikkema et al., 2009; Schrage et al., 2009).
The abundance of phosphorylation data for human, rat, and mouse in online databases like PhosphoSitePlus (Hornbeck et al., 2004) makes it relatively straightforward to design kinome microarrays for studying these species. Unfortunately, little phosphorylation data are available for other species.
A method of preparing one or more species-specific phosphorylation site database entries for a target organism comprising:
a) selecting a first known non-target organism (NTO) phosphorylation site sequence of a first non-target organism, the first known NTO phosphorylation site sequence comprising at least 5 residues and less than 30 residues and/or 30 or fewer residues;
b) obtaining for the first known NTO phosphorylation site sequence a first cognate known NTO phosphorylation polypeptide sequence corresponding to the first known NTO phosphorylation site sequence, the cognate known NTO phosphorylation polypeptide sequence comprising the first known NTO phosphorylation site sequence;
c) identifying a matching target organism (TO) phosphorylation site sequence for the first known NTO phosphorylation site sequence;
d) obtaining for the matching TO phosphorylation site sequence a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence, the cognate TO phosphorylation polypeptide sequence comprising the matching TO phosphorylation site sequence;
e) determining a plurality of output values, each output value indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence; and
f) determining a similarity value between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence, wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
In an embodiment, identifying a matching TO phosphorylation site sequence comprises:
a) retrieving a proteome of the target organism;
b) creating a dataset of target organism polypeptide sequences using the retrieved proteome of the target organism; and
c) querying the dataset of target organism polypeptide sequences.
In another embodiment, a processor executes a software program to retrieve the proteome of the target organism from an electronic database of protein sequence data and wherein the dataset of proteins of the target organism is a BLAST database created using the makeblastdb program.
In yet another embodiment, identifying a matching TO phosphorylation site sequence comprises:
a) comparing the first known NTO phosphorylation site sequence against a plurality of sequences of residues of the dataset of target organism proteins; and
b) determining the sequence of the plurality of sequences of residues of the dataset of target organisms proteins having the most number of identical residues as the NTO phosphorylation site sequences as the matching TO phosphorylation site sequence.
In an embodiment, the identifying of the matching TO phosphorylation site sequence comprises running a blastp search using the first known NTO phosphorylation site sequence as the query and the dataset of target organism proteins as the queried database.
In another embodiment, the plurality of output values comprises one or more of: a sequence difference, a non-conservative sequence difference, a matching TO phosphorylation site, a 9-mer sequence difference, and a 9-mer non-conservative sequence difference.
In yet another embodiment, the sequence difference is equal to the difference between the number of residues in the first known NTO phosphorylation site sequence and the number of identical residues between the first known NTO phosphorylation site sequence and the matching TO phosphorylation site sequence;
wherein the non-conservative sequence difference is equal to the difference between the number of residues in the first known NTO phosphorylation site sequence and the sum of the number of identical residues between the first known NTO phosphorylation site sequence and the hit sequence and the number of residues of the hit sequence that are conservative substitutions of the corresponding residue of the first known NTO phosphorylation site sequence.
wherein the matching TO phosphorylation site corresponds to a start position of the TO phosphorylation site sequence in the cognate TO phosphorylation polypeptide sequence;
wherein the 9-mer sequence difference is equal to the number of sequence differences in the count of positions where the two residues are different in a gapless alignment between a 9-amino-acid long peptide corresponding to the first known NTO phosphorylation site sequence and a 9-amino-acid long peptide corresponding to the matching TO phosphorylation site sequence;
and wherein the 9-mer non-conservative sequence difference is equal to the number of non-conservative sequence differences in the count of positions where the two residues have a non-positive score in a gapless alignment between the 9-amino-acid long peptide corresponding to the first known NTO phosphorylation site sequence and the 9-amino-acid long peptide corresponding to the matching TO phosphorylation site sequence.
In an embodiment, determining of the similarity value comprises:
a) retrieving a proteome of the first known non-target organism;
b) creating a dataset of first known NTO phosphorylation polypeptide sequences using the retrieved non-target organism proteome;
c) comparing the first known NTO phosphorylation polypeptide sequence to each of the TO phosphorylation polypeptide sequences of the dataset of TO phosphorylation polypeptide sequences to generate a plurality of TO dataset similarity values;
d) identifying a best TO dataset similarity value (E1B) from the plurality of TO dataset similarity values and identifying a first TO dataset similarity value (E1F) of the match between the first known NTO phosphorylation polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide sequence (HF) from the plurality of TO dataset similarity values;
e) comparing the TO phosphorylation polypeptide sequence to each of the first known NTO phosphorylation polypeptide sequences in the dataset of first known NTO phosphorylation polypeptide sequences to generate a plurality of NTO dataset similarity values;
f) identifying a best NTO dataset similarity value (E2B) from the plurality of NTO dataset similarity values and identifying a first NTO dataset similarity value (E2F) of the match between the first known NTO phosphorylation polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide sequence (HF) from the plurality of NTO dataset similarity values; and
g) if the first TO dataset similarity value equals the best TO dataset similarity value and if the first NTO dataset similarity value equals the best NTO dataset similarity value, determining the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
In an embodiment, one or more of the similarity values comprises an E-value. In another embodiment, the E-value is selected at less than 10−3.
In an embodiment, the method comprises:
In an embodiment, each of the first known NTO phosphorylation site sequence and the matching TO phosphorylation site sequence is at least 8 residues and less than 15 residues in length.
In yet another embodiment, wherein a plurality of output values is displayed. In an embodiment, the plurality of output values is outputted electronically in a delimited plain text format.
A further aspect includes a method of making a species-specific array comprising selecting a plurality of matching target organism phosphorylation site sequences according to a method described herein, synthesizing a plurality of peptides each peptide comprising a sequence of one of the matching target organism phosphorylation site sequences and attaching the plurality of peptides to a substrate surface.
A further aspect includes a plurality of peptides, each of which comprises a sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids, wherein each sequence comprises a contiguous sequence of at least 5 amino acids present in a peptide sequence selected from the group of SEQ ID NOs: 1 to 292, wherein the contiguous sequence comprises a chicken phosphorylation site sequence.
In an embodiment, the plurality of peptides comprises about 5, 10, 15, 20, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275 or 292 peptides each comprising all or part of a sequence selected from SEQ NO: 1-292.
Yet a further aspect includes a species-specific array comprising a plurality of peptide attached to a support surface, each peptide comprising a sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids or about 8 to about 15 amino acids, wherein the sequence is a matching target organism phosphorylation site sequence selected as described herein, wherein the similarity is below a preselected threshold.
In an embodiment, the array plurality of peptides comprises at least 100, 200, or 292 matching target organism phosphorylation site sequences.
In an embodiment, the array further comprises one or more negative control peptides and/or one or more positive control peptides.
In a further embodiment, the array is a chicken species array and the plurality of peptides are chicken peptides.
In an embodiment, the plurality of array peptides comprises about 5, 10, 15, 20, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275 or 292 peptides each comprising all or part of a sequence selected from SEQ NO: 1-292.
In another embodiment, for each of a plurality of the array peptides, each peptide is 8-15 contiguous amino acids of a sequence selected from SEQ ID NO: 1-292
In yet another embodiment, a plurality of the array peptides is spotted in replicates of 2, 3, 4, 5, 6, 7, 8, or 9 or more.
Another aspect includes a method of determining kinome activity of a test sample comprising:
a) incubating an array described herein with the test sample to provide a test array and optionally incubating a second array described herein with a comparator sample to provide a comparator array; and
b) measuring a phosphorylation level signal intensity for each of the plurality of peptides for the test array and optionally the compartor array wherein the phosphorylation level signal intensity results from the interaction of the sample with each of the plurality of peptides;
wherein the kinome activity is determined by identifying an increased or decreased phosphorylation level of one or more of the plurality of peptides on the test array compared to the comparator or an internal control.
A further aspect includes a method of determining a phosphorylation profile of a test sample comprising:
a) incubating a species-specific array described herein with the test sample to provide a test array; and
b) measuring a phosphorylation level signal intensity for each of the plurality of peptides for the test array providing a test array phosphorylation profile, wherein the phosphorylation level signal intensity results from the interaction of the sample with each of the plurality of peptides.
In an embodiment the method further comprises incubating a species-specific array with a comparator sample to provide a comparator array; measuring a phosphorylation level signal intensity for each of the plurality of peptides for the comparator array wherein the phosphorylation level signal intensity results from the interaction of the sample with each of the plurality of peptides and comparing the test array phosphorylation profile to the comparator phosphorylation profile to detect one or more differentially phosphorylated peptides.
In an embodiment, the comparator sample is a control that can correspond to background. In an embodiment, the comparator sample is a test sample,
A further aspect includes a non-transitory computer-readable storage medium upon which a plurality of instructions are stored, the instructions for performing the steps of:
a) querying a dataset comprising a plurality of target organism (TO) polypeptide sequences with a selected plurality of known NTO phosphorylation site sequences (query phosphorylation site sequences) to identify for each of the plurality a matching TO phosphorylation site sequence;
b) obtaining for each of the matching TO phosphorylation site sequences a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence, the cognate TO phosphorylation polypeptide sequence comprising the matching TO phosphorylation site sequence;
c) determining a plurality of output values, one or more of the output values being indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence; and
d) determining a similarity value between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence, wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
In an embodiment, the instructions are further for performing the steps of displaying the matching TO phosphorylation site sequences and/or cognate TO sequence accession numbers when the similarity value is below a preselected threshold.
Another aspect includes a system for preparing one or more species-specific phosphorylation site database entries for a target organism, the system comprising:
a) a memory for storing a plurality of instructions; and
b) a processor coupled to the memory for:
A further aspect includes a kit comprising a plurality of peptides described herein, an array described herein, and/or a kit control and/or package housing the peptides, array and/or kit control.
Other features and advantages of the present disclosure will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples while indicating preferred embodiments of the disclosure are given by way of illustration only, since various changes and modifications within the spirit and scope of the disclosure will become apparent to those skilled in the art from this detailed description.
An embodiment of the disclosure will now be discussed in relation to the drawings in which:
The kinome microarray is a relatively new technology for studying phosphorylation-mediated cellular signalling. Other than for human, rat, and mouse, relatively little phosphorylation data are available for most organisms, making it difficult to design kinome microarrays suitable for studying them. Recently a protocol was developed for leveraging known phosphorylation sites from one organism to identify putative sites in a different organism. While effective, this procedure is time-consuming, tedious, and cannot feasibly make use of even a small fraction of the known phosphorylation sites. Methods and systems for identifying putative phosphorylation sites in an organism of interest are provided. In an embodiment, the disclosure includes a collection of Perl scripts called Design Array for PhosPhoryLation Experiments (DAPPLE) that automates the identification of putative phosphorylation sites in an organism of interest, improving and accelerating the process of designing kinome microarrays for example species other than human, rat, and mouse.
As used in this specification and the appended claims, the singular forms “a”, “an” and “the” include plural references unless the content clearly dictates otherwise.
The term “accession number” as used herein refers to a code such as a Genbank accession number that uniquely identifies a particular polypeptide sequence (e.g. protein or part thereof) and/or DNA encoding said polypeptide or part thereof.
The term “corresponds to” as used herein means in the context of a sequence and a second sequence from the same species, corresponds to sequences that derive from the same protein e.g. a phosphorylation site sequence and a full length polypeptide which contains the phosphorylation site sequence. Similarly, regarding a first sequence and a “corresponding protein identifier” from the same species refers to a protein identifier such as an accession number that identifies the same protein as contains the first sequence. As another example, reference to a “matching target organism (TO) phosphorylation site sequence that corresponds to an orthologue polypeptide of the known non-target organism (NTO) phosphorylation polypeptide sequence” means that the matching TO phosphorylation site sequence is found in the same protein which is an orthologue of the NTO phosphorylation polypeptide sequence protein.
The term “E-value” or “Expect value” as used herein has the same meaning as provided by National Center for Biotechnology Information (NCBI) and means a parameter that describes the number of hits one can “expect” to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a score equal to or greater than the score actually observed simply by chance. The smaller the E-value, or the closer it is to zero, the more “significant” the match is. However, keep in mind that virtually identical short alignments have relatively high E-values. This is because the calculation of the E-value takes into account the length of the query sequence. These high E-values make sense because shorter sequences have a higher probability of occurring in the database purely by chance.
The phrase “cognate TO phosphorylation polypeptide sequence” and/or HF as used herein means a polypeptide sequence that comprises the TO phosphorylation site sequence and has for example the same accession number as the TO phosphorylation site sequence e.g. they relate to the same protein. The cognate TO phosphorylation polypeptide sequence is longer in length than the TO phosphorylation site sequence, and can for example comprise the full length sequence of the protein or a part thereof. For example, each TO phosphorylation site sequence is identified by screening a database of polypeptides and accordingly its sequence is contained within a protein and is understood to correspond to the protein from which it derives. Accordingly, the TO phosphorylation site sequence and the TO phosphorylation polypeptide sequence correspond to the same protein, for example as defined by a protein identifier such as an accession number. The TO phosphorylation polypeptide sequence can for example consist of at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or the full length of the protein
The phrase “cognate known NTO phosphorylation polypeptide sequence” and/or QF as used herein means a polypeptide sequence that comprises the NTO phosphorylation site sequence and has for example the same accession number as the NTO phosphorylation site sequence e.g. they relate to the same protein. The cognate NTO phosphorylation polypeptide sequence is longer in length than the NTO phosphorylation site sequence and can for example comprise the full length sequence of the protein or a part thereof. For example, NTO phosphorylation site sequence is identified by screening a database of polypeptides and accordingly its sequence is contained within a protein and is understood to correspond to the polypeptide/protein from which it derives. Accordingly, the NTO phosphorylation site sequence and the NTO phosphorylation polypeptide sequence correspond to the same protein, for example as defined by a protein identifier such as an accession number. The NTO phosphorylation polypeptide sequence can for example consist of at least 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or the full length of the protein.
The term “low-throughput references” as used herein indicates the number of references to peer-reviewed papers in which the authors employed low-throughput biological techniques (techniques capable of analyzing only one or a few phosphorylation sites at a time) to characterize the known phosphorylation site and the term “high-throughput references” indicates the number of references to peer-reviewed papers in which the authors used high-throughput biological techniques (techniques capable of analyzing many phosphorylation sites at a time, like mass spectrometry) to characterize the known phosphorylation site. The number of low-throughput references and high-throughput references are provided for example by the PhosphoSitePlus database for each known phosphorylation site.
The phrase “matching TO phosphorylation site sequence” and/or “H” refers to a TO polypeptide sequence consisting of at least 5 residues and less than 30 residues or at least 5 residues and 30 or fewer residues (or corresponding nucleotide residues) that has the highest similarity of a plurality of TO polypeptide sequences, for example the highest percent identity of the TO proteome polypeptides over for example a portion of H, with a corresponding NTO phosphorylation site sequence, and which is identified for example by querying a TO polypeptide database with a known NTO phosphorylation site sequence query. As an example, a matching TO phosphorylation site sequence may have 0, 1, 2, 3, 4, 5 or 6 or more residues that are different from the NTO query sequence, for example depending on the length of the query sequence. The phosphorylation site or site phosphorylated in H is HC, the description of the protein/polypeptide comprising H can be HD, HL can be the number or residues in H (e.g. length) and HA can be the accession number associated with H, for example a Genbank accession number.
The phrase “matching TO phosphorylation polypeptide sequence” as used herein refers to a TO polypeptide sequence consisting of all or part of the corresponding protein and that has the highest similarity of a plurality of TO polypeptide sequences, for example the highest percent identity of the TO proteome polypeptides, with a corresponding NTO phosphorylation polypeptide sequence, which is identified for example by querying a TO polypeptide database with a known NTO phosphorylation polypeptide sequence. As an example, a first polypeptide sequence (e.g. a TO phosphorylation polypeptide sequence) will match a second polypeptide sequence (e.g. a NTO phosphorylation polypeptide sequence) if the E-value is less than a preselected value, for example 10−3.
As used herein, “NTO phosphorylation site sequence” and “Q” as used herein, refers to a known phosphorylation site sequence, which can be for example from 5 amino acid residues (or corresponding nucleotide residues e.g. 15 nucleotides) to about and including 30 amino acids (or corresponding nucleotides e.g. about 90 nucleotides) and which is used as a “query” sequence in the methods described. The NTO phosphorylation site sequence can be any string of amino acids (or corresponding nucleotides) found in the NTO that is known (or suspected) of having a residue that is phosphorylated. For example, any string of any amino acids comprising at least one of “serine”, “threonine” or “tyrosine” or encoding at least one of these, can be suspected of having a residue that is phosphorylated. The phosphorylation residue can be for example in the middle position of Q (e.g. amino acid residue 8 for a 15 amino acid query sequence) or for example any position. The phosphorylation site or site phosphorylated in Q is QC, the organism can be QO and the description of the protein/polypeptide comprising Q can be QD, QL can be the number or residues in Q (e.g. length) and QA can be the accession number associated with Q, for example a Genbank accession number.
The term “non-conservative sequence change” as used herein means when referring to an amino acid sequence, a corresponding (e.g. aligned) amino acid residue between a first sequence and a second sequence, wherein the amino acid residue in the first sequence is not a conservative or semi-conservative substitution of the corresponding amino acid in the second sequence, e.g. the polarity of the amino acid residue (or other biochemical property) in the first sequence is markedly different from the polarity (or other biochemical property) of the corresponding amino acid residue in the second sequence. For example, replacing one amino acid residue with another having similar hydrophobicity and/or molecular side chain bulk can be considered a conservative sequence change. As an example, blastp as a default uses the substitution matrix BLOSUM62 to assess conservative and non-conservative substitutions. However the user can specify a substitution matrix that fits a particular sequence comparison context. As examples, alanine, serine and threonine are considered conservative substitutions, as are aspartic acid and glutamic acid, or asparagine and glutamine. Similarly, arginine and lysine are commonly considered conservative substitutions, as are isoleucine, leucine, methionine and valine. Phenylalanine, tyrosine and tryptophan are also considered conservative changes. Non-conservative changes would include for example alanine and aspartic acid; serine and aspartic acid; or arginine and valine.
Homologues are proteins that have shared evolutionary ancestry. Most homologues are orthologues or paralogues. Orthologues are proteins from different species that evolved from a common ancestral gene by speciation, and which typically retain the same function in the course of evolution. The term “orthologous polypeptide” refers to a protein that is the orthologue of the protein in another species. Paralogues are proteins in the same species, one of which resulted from a genetic duplication of the other).
As used herein, “peptide array” or “array” means a plurality of peptides coupled to a support, wherein each peptide comprises a putative or known phosphorylation motif, e.g. a phosphorylation site sequence. An array can be for example a two-dimensional arrangement of a plurality of peptide molecules, each peptide comprising a known or putative phosphorylation site, attached on a support surface such as a slide or a bead. Arrays are generally comprised of regular, ordered peptide molecules, as in for example, a rectilinear grid, parallel stripes, spirals, and the like, but non-ordered arrays may be advantageously used as well. The arrays generally comprise in the range of about 2 to about 3000 different peptides, more typically about 2 to about 1,200 different peptides. The array can for example comprise 25, 50, 100, 150, 200, 250, 300, 400, 500, 1000, 1200 or more different peptides, spotted in a single replicate, or in replicates of 2, 3, 4, 5, 6, 7, 8, or 9 or greater. For example, depending on the dataset to be obtained, the peptide array can comprise peptides with known phosphorylation motifs (e.g., phosphorylation site sequences), optionally phosphorylation motifs for proteins that are found in a signaling pathway or related pathways. Such peptide arrays can be useful for deciphering peptides phosphorylated or signaling pathways activated by a stressor such as an infectious agent or a macromolecule. The peptide molecules comprise for examples peptides or parts thereof, selected from the peptides listed in Table 6.
The term “attached,” as in, for example, a support surface having a peptide molecule “attached” thereto, includes covalent binding, adsorption, and physical immobilization. The terms “binding” and “bound” are identical in meaning to the term “attached.” The peptide can for example be attached via a flexible linker.
Alternatively, the peptide array can comprise random peptide sequences comprising putative phosphorylation sites wherein the plurality of peptides or a subset thereof comprises at least one of a serine, threonine or tyrosine residue.
The term “peptide molecule” or “peptide” as used herein includes a molecule comprising a chain of 5 or more amino acids comprising optionally a known or putative phosphorylation site or optionally in the case of a control peptide, the lack of a phosphorylation site. A peptide in the context of a peptide array typically comprises a peptide having from about 5 to about 21 amino acid residues or any number in between. The peptide can also be longer, for example up to 30 amino acids, up to 50 amino acids or up to 100 amino acids. For example, the peptide can comprise a sequence listed in Table 6 and additional surrounding cognate protein sequence which can be identified according to the corresponding accession number. An amino acid linker can also be included. A polypeptide and/or protein can comprise any length of amino acid residues. In an embodiment, the term “peptide” for example when used as a probe on an array refers to a peptide comprising at least 5 residues and less than 30 residues and/or 30 or fewer residues.
The phrase “phosphorylation site sequence” means a polypeptide sequence consisting of at least 5 residues and less than 30 residues and/or 30 or fewer residues (for example 15 residues) and that comprises at least one serine, threonine or tyrosine residue phosphorylatable by one or more kinases.
For example, the peptide or phosphorylation site sequence can be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 residues.
As used herein, the term “plurality of peptides” means at least 2, for example at least 3 peptides, at least 4 peptides, at least 5 peptides, at least 10, at least 15, at least 25 peptides, at least 50 peptides, at least 100 peptides, at least 200 peptides, at least 300 peptides, at least 400, at least 500 or at least 1000 or any number in between 2 and 1000.
The term “proteome” as used herein refers to the set of polypeptides expressed by a particular organism, optionally under control or test conditions. The term “subproteome” refers to a subset of the set of polypeptides comprised in a proteome, for example, a subset expressed under a specified test condition e.g. stimulated, or a subset that corresponds to a group of proteins e.g. immune system proteins.
The term “phosphorylation profile” or “subject phosphorylation profile” as used herein refers to, for a plurality (e.g. at least 2, for example 5) of peptides and/or their corresponding proteins, phosphorylation signal intensities detectable after contacting a sample from a subject with the plurality of peptides under conditions that permit peptide phosphorylation as would be known to a person skilled in the art (e.g. temperature, buffer constituents, presence of ATP and/or other suitable ATP source etc.). The plurality of peptides optionally comprises at least 2, at least 3, at least 4, at least 5, or more of the peptides listed in Table 6, including for example any number of peptides between 2 and 292.
The term “determining a phosphorylation level” or “determining a phosphorylation profile” as used herein means the contacting a reagent such as a peptide, or a plurality of peptides, to a sample, for example a sample of the subject chicken and/or a control sample, for ascertaining or measuring quantitatively, semi-quantitatively or qualitatively the amount of peptide phosphorylation signal intensity. For example, the plurality of peptides can be comprised in an array (e.g. on a slide or beads) as described herein and phosphorylation specific stains such as fluorescent ProQ Diamond Phosphoprotein Stain (Invitrogen) and Stains-All” (1-ethyl-2-[3-(3-ethylnaphtho [1,2]thiazolin-2 ylidene)-2-methylpropenyl]-naphtha [1,2]thiazolium bromide) and/or labeled ATP such as radiolabelled ATP can be used to detect phosphorylation. The phosphorylation signal can be detected by a number of methods known in the art such as using phosphospecific antibodies directly or indirectly labeled and/or using a method disclosed herein. For example a phosphospecific detection agent such as an antibody, for example a labeled antibody, which specifically binds the phosphorylated forms of peptides, can be used for example to detect relative or absolute amounts of peptide phosphorylation.
The term “difference in the level” as used herein in comparison to a control (e.g. or to a phenotype reference phosphorylation profile) or an internal control refers to a measurable difference in the level or quantity of peptide phosphorylation in a test sample, compared to the control that is of sufficient magnitude to allow assessment, for example of a statistically significant difference. For example, a difference in a level of peptide phosphorylation is detected if a ratio of the level in a test sample as compared with a control is greater than 1.2. For example, a ratio of greater than 1.3, 1.4, 1.5, 1.6, 1.7, 2, 2.5 or 3 or more and/or has a p-value of less than 0.1, 0.05 or 0.01.
The term “phosphorylation level” as used herein in reference to a peptide phosphorylation refers to a phosphorylation signal intensity that is detectable or measurable in a sample and/or control.
The term “measuring” or “measurement” as used herein refers to the application of an assay to assess the presence, absence, quantity or amount (which can be an relative or absolute amount) of either a given substance within a subject-derived sample, including the derivation of qualitative or quantitative concentration levels of such substances.
The term “sequence identity” as used herein refers to the percentage of sequence identity between two polypeptide sequences or two nucleic acid sequences. To determine the percent identity of two amino acid sequences or of two nucleic acid sequences, the sequences are aligned for optimal comparison purposes (e.g., gaps can be introduced in the sequence of a first amino acid or nucleic acid sequence for optimal alignment with a second amino acid or nucleic acid sequence). The amino acid residues or nucleotides at corresponding amino acid positions or nucleotide positions are then compared. When a position in the first sequence is occupied by the same amino acid residue or nucleotide as the corresponding position in the second sequence, then the molecules are identical at that position. The percent identity between the two sequences is a function of the number of identical positions shared by the sequences (i.e., % identity=number of identical overlapping positions/total number of positions times 100%). In one embodiment, the two sequences are the same length. The determination of percent identity between two sequences can also be accomplished using a mathematical algorithm. A preferred, non-limiting example of a mathematical algorithm utilized for the comparison of two sequences is the algorithm of Karlin and Altschul, 1990, Proc. Natl. Acad. Sci. U.S.A. 87:2264-2268, modified as in Karlin and Altschul, 1993, Proc. Natl. Acad. Sci. U.S.A. 90:5873-5877. Such an algorithm is incorporated into the blastn and blastp programs of Altschul et al., 1990, J. Mol. Biol. 215:403. BLAST nucleotide searches can be performed with the blastn nucleotide program parameters set, to default parameters or e.g., wordlength=28. BLAST protein searches can be performed with the blastp program parameters set to default parameters, or e.g., wordlength=3 to obtain amino acid sequences homologous to a polypeptide molecule of the present disclosure. To obtain gapped alignments for comparison purposes, Gapped BLAST can be utilized as described in Altschul et al., 1997, Nucleic Acids Res. 25:3389-3402. Alternatively, PSI-BLAST can be used to perform an iterated search which detects distant relationships between molecules (Id.). When utilizing BLAST, Gapped BLAST, and PSI-Blast programs, the default parameters of the respective programs (e.g., of blastp and blastn) can be used (see, e.g., the NCBI website). The percent identity between two sequences can be determined using techniques similar to those described above, with or without allowing gaps. In calculating percent identity, typically only exact matches are counted.
The phrase “similarity value” as used herein means a value that indicates that two sequences are likely orthologues based on the similarity of the sequences. The similarity value can for example be a reciprocal blast hit (RBH) value which for example is identified by taking your known sequence, BLAST searching it against the gene sequences of the target organism, taking the highest scoring hit (e.g. lowest E-value) and BLAST searching the hit against a database of gene sequences of the known organism to determine if the known sequence is the best match (e.g. lowest Evalue) and therefore putative ortholog; or an E-value, percent similarity or other similar value. The similarity value is in an embodiment an E-value which gives for example an indication of whether the blast hit is a homologue and/or orthologue. For example, when comparing two sequences, a small E-value, for example below a selected threshold, is indicative the sequences are likely orthologues and/or homologues. The smaller the E-value, the greater the likelihood of similarity. Correspondingly, a large E-value, for example above a selected threshold, indicates the two sequences are likely not orthologues. The larger the E-value the less likely the two sequences are orthologues. As another example, a high percentage identity can be indicative that the two sequences are orthologues. The higher the percentage identity, the greater the likelihood the two sequences are orthologues and the lower the percentage identity, the greater the likelihood the two sequences are not orthologues. Although percent identity can also be used, E-value is preferable as the E-value takes into account sequence length, database size, etc. In embodiments, where the similarity value is an E-value, the smaller the similarity value, the greater the likelihood the sequences are orthologues. In embodiments when comparing a similarity value to a preselected threshold, a person skilled in the art would understand that if other similarity parameters are used (e.g. other than E-value) such as percent identity where the larger the value the greater the likelihood two sequences are orthologues, the inverse number e.g. 1/(percent identity), can be used to compare to the preselected threshold e.g. such that a similarity value below a preselected threshold is indicative of the two sequences being orthologues.
The phrase “species-specific phosphorylation site” as used herein means a sequence of amino acid residues which comprise a known or putative phosphorylation site of a specific target organism. The species-specific phosphorylation sites are identified for example by comparison to known phosphorylation sites of another species in orthologous polypeptides.
The phrase “species-specific phosphorylation site database” as used herein means a plurality of polypeptide sequences and corresponding annotations of a particular organism, wherein each sequence comprises a putative phosphorylation site. The sequences and annotations can be digitized and stored for retrieval, for example on a storage medium.
As used herein “target organism” means the species for which the user wants to design a database or a kinome array.
The term “sample” as used herein means any biological fluid or tissue sample from a subject, or fraction thereof which can be assayed for kinase activity, including for example, a lysate of a part of an organism or cell population wherein the cell population is obtained from a subject. The sample can be an experimental sample treated with a stressor or a control that is optionally untreated or treated with a control treatment (e.g. vehicle only). Depending on the stressor, an appropriate control treatment can be a vehicle only treatment (e.g. stressor dissolution agent) or a control treatment that is similar in composition to the stressor treatment but lacking the specificity of the stressor. For example a control treatment for a macromolecule, such as a peptide or RNA that induces a sequence specific cell response, can comprise a scrambled macromolecule, e.g. sequence scrambled peptide or RNA molecule. Similarly an isotype control antibody can be used as a control treatment wherein the stressor is an antibody. Any population of cells can be treated. For example, the cell or population of cells can comprise subject cells from multiple subjects, each sample optionally corresponding to a different subject, wherein one or more subsets of cells from each subject are treated with a stressor, optionally in vivo (e.g. an animal challenge) or in vitro (e.g. ex vivo treated primary cells). The cells are optionally clonal cells (e.g. cell culture experiment) and comprise propagated cells under defined conditions. Wherein multiple stressors are being compared or when using cells from one or more subjects, a biological control dataset for the same subject and/or sample treatment is optionally obtained and optionally subtracted from an experimental dataset (e.g. a control dataset comprising phosphorylation signal intensities corresponding to an unstimulated level of kinase activity is subtracted from each treatment dataset).
The term “phenotype” as used herein means a physical, behavioural, developmental, physiological, or biochemical characteristic of an organism, determined by genetic makeup and/or environmental influences.
The term “reference phosphorylation profile” or “phenotype reference phosphorylation profile” as used herein refers to a suitable comparison profile, for example which comprises the phosphorylation characteristics of a plurality of peptides, for example selected from the peptides listed in Table 6, associated with a particular phenotype. The reference phosphorylation profiles are compared to subject phosphorylation profiles for a plurality of peptides). A subject can be classified by comparing to a phenotype reference phosphorylation profile, where the phenotype reference phosphorylation profile most similar to the subject profile is indicative that the subject is likely to express the phenotype associated with the phenotype reference phosphorylation profile.
The term “similar” in the context of a phosphorylation level as used herein refers to a subject phosphorylation level for a peptide that falls within the range of levels associated with a particular class. Accordingly, “detecting a similarity” refers to detecting a phosphorylation level (or levels) that falls within the range of levels associated with a particular class. In the context of a reference phosphorylation profile, a subject profile is “similar” to a reference phosphorylation profile associated with a phenotype if the subject profile shows a number of identities and/or degree of changes (e.g. in terms of direction of phosphorylation (increased or decreased) and/or magnitude) with the reference phosphorylation profile.
The term “most similar” in the context of a reference phosphorylation profile refers to a reference phosphorylation profile that shows the greatest number of identities and/or degree of changes with the subject phosphorylation profile.
The term “kit control” as used herein means a suitable assay standard or reference reagent useful when determining a phosphorylation level of a peptide, for example a peptide that known to be phosphorylated or not phosphorylated under the conditions of the assay or for example a peptide corresponding to a substrate of a kinase with constitutive activity.
In understanding the scope of the present disclosure, the term “comprising” and its derivatives, as used herein, are intended to be open ended terms that specify the presence of the stated features, elements, components, groups, integers, and/or steps, but do not exclude the presence of other unstated features, elements, components, groups, integers and/or steps. The foregoing also applies to words having similar meanings such as the terms, “including”, “having” and their derivatives. Finally, terms of degree such as “substantially”, “about” and “approximately” as used herein mean a reasonable amount of deviation of the modified term such that the end result is not significantly changed. These terms of degree should be construed as including a deviation of at least ±5% of the modified term if this deviation would not negate the meaning of the word it modifies.
In understanding the scope of the present disclosure, the term “consisting” and its derivatives, as used herein, are intended to be close ended terms that specify the presence of stated features, elements, components, groups, integers, and/or steps, and also exclude the presence of other unstated features, elements, components, groups, integers and/or steps.
The recitation of numerical ranges by endpoints herein includes all numbers and fractions subsumed within that range (e.g. 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.90, 4, and 5). It is also to be understood that all numbers and fractions thereof are presumed to be modified by the term “about.” Further, it is to be understood that “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. The term “about” means plus or minus 0.1 to 50%, 5-50%, or 10-40%, preferably 10-20%, more preferably 10% or 15%, of the number to which reference is being made.
Further, the definitions and embodiments described in particular sections are intended to be applicable to other embodiments herein described for which they are suitable as would be understood by a person skilled in the art. For example, in the following passages, different aspects of the invention are defined in more detail. Each aspect so defined may be combined with any other aspect or aspects unless clearly indicated to the contrary. In particular, any feature indicated as being preferred or advantageous may be combined with any other feature or features indicated as being preferred or advantageous.
The embodiments of the systems and methods described herein may be implemented in hardware or software, or a combination of both. However, preferably, these embodiments are implemented in computer programs executing on programmable computers each comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. For example and without limitation, the programmable computers may be a personal computer, laptop, workstation, or network of a plurality of computers. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices, in known fashion.
Each program is preferably implemented in a high level procedural or object oriented programming and/or scripting language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or a device (e.g. ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The subject system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
Furthermore, the system, processes and methods of the described embodiments are capable of being distributed in a computer program product comprising a computer readable medium that bears computer usable instructions for one or more processors. The medium may be provided in various forms, including one or more diskettes, compact disks, tapes, chips, wireline transmissions, satellite transmissions, internet transmission or downloadings, magnetic and electronic storage media, digital and analog signals, and the like. The computer useable instructions may also be in various forms, including compiled and non-compiled code.
Disclosed herein are methods and products for identifying putative phosphorylation sites in a target organism.
Referring now to
In an aspect, the disclosure includes a method of preparing one or more species-specific phosphorylation site database entries for a target organism comprising:
e) at step 102 selecting a first known non-target organism (NTO) phosphorylation site sequence of a first non-target organism, the first known NTO phosphorylation site sequence comprising at least 5 residues and less than 30 residues and/or 30 or fewer residues;
f) at step 104 obtaining for the first known NTO phosphorylation site sequence a first cognate known NTO phosphorylation polypeptide sequence corresponding to the first known NTO phosphorylation site sequence, the cognate known NTO phosphorylation polypeptide sequence comprising the first known NTO phosphorylation site sequence;
g) at step 106 identifying a matching target organism (TO) phosphorylation site sequence for the first known NTO phosphorylation site sequence;
h) at step 108 obtaining for the matching TO phosphorylation site sequence a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence, the cognate TO phosphorylation polypeptide sequence comprising the matching TO phosphorylation site sequence;
i) at step 110 determining a plurality of output values, one or more of the output values being indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence; and
j) at step 112 determining a similarity value between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence, wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
In an embodiment, a database is populated with one or more values corresponding to the TO phosphorylation site sequence (e.g. when the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are determined to be orthologues of each other).
The method may be repeated for a plurality of known non-target organism phosphorylation site sequences such that a plurality of database entries for the target organism can be prepared. The plurality of entries form a species-specific phosphorylation site database for the target organism which may then be used to facilitate the design of species-specific kinome microarrays.
In an embodiment, the first known non-target organism (NTO) phosphorylation site sequence is downloaded from a database, for example from PhosphoSitePlus Hornbeck, P. V., Chabra, I., Kornhauser, J. M., Skrzypek, E., and Zhang, B. (2004). PhosphoSite: A bioinformatics resource dedicated to physiological protein phosphorylation. Proteomics, 4(6), 1551-61. In an embodiment, the NTO phosphorylation site sequence is obtained from PhosphoSitePlus data files. Where the method is repeated for a plurality of known NTO phosphorylation site sequences, each of the NTO phosphorylation site sequences is downloaded from the database.
Depending on the database used for the download, the plurality of known non-target organism (NTO) phosphorylation site sequences may comprise duplicate phosphorylation site sequences from one or more NTO. For example, the PhosphoSitePlus data file contains entries with identical sequences (from different organisms).
In an embodiment, a processor executes a software program to download the first known non-target organism (NTO) phosphorylation site sequence from a database.
In an embodiment, the processor is operatively linked to an electronic database of phosphorylation site sequence data.
In an embodiment, the plurality of non-target organism (NTO) phosphorylation site sequences are depleted of duplicate or redundant known NTO phosphorylation site sequences to provide a set of non-redundant phosphorylation site sequences and the set of non-redundant phosphorylation site sequences are used to query the dataset comprising a plurality of TO polypeptide sequences.
While methods herein have been described for a single known non-target organism phosphorylation site sequence, it will be understood that where the method is repeated for a plurality of known NTO phosphorylation site sequences, one or more steps of the method for creating database entries for each of the plurality of known NTO phosphorylation site sequences may be performed simultaneously. For example, the plurality of known non-target organism (NTO) phosphorylation site sequences downloaded from a database may be simultaneously entered as queries into a search program for identifying one matching target organism phosphorylation site sequence for each of the plurality of known non-target phosphorylation site sequences.
In an embodiment, the non-target organism (NTO) phosphorylation site sequence comprises sequences from one, two, three or more NTOs. In an embodiment, the sequences are from 4, 5, 6, 7, 8, 9 or 10 NTOs. In an embodiment, the NTO is selected from human, mouse, rat and bovine.
In an embodiment, the phosphorylation site sequence (e.g. NTO and/or TO) comprises at least 5 residues. In another embodiment, the phosphorylation site sequence (e.g. NTO and/or TO) comprises at least 6 residues. In another embodiment, the phosphorylation site sequence consists of 30 or fewer than 30 residues. In another embodiment, the number of phosphorylation site sequence residues is equal to or less than 20 residues in length. In an embodiment, the number of phosphorylation site sequence residues is at least or equal to 7, at least or equal to 8 residues, at least or equal to 9 residues, at least or equal to 10 residues, at least or equal to 11 residues, at least or equal to 12 residues, at least or equal to 13 residues, or at least or equal to 14 residues. In another embodiment, the phosphorylation site sequence is equal to or less than 18, 17, 16 or 15 residues. In an embodiment, the phosphorylation site sequence is at or equal to 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 residues.
In an embodiment, the NTO polypeptide sequence is/comprises full length protein sequences. In another embodiment NTO polypeptide sequences comprise at least 30%, 40%, 50%, 60%, 70%, 80% of the corresponding protein sequence and/or for example at least 30, at least 40, at least 50, at least 75, at least 100, at least 150, at least 200, at least 250, at least 300 or more residues.
In an embodiment, information pertaining to the first known NTO phosphorylation site sequence is retrieved when the sequence file is downloaded from the database. For example, the sequence file may contain the NTO phosphorylation site sequence accession number, the NTO phosphorylation site sequence, NTO phosphorylation site sequence description, NTO phosphorylation site sequence organism, NTO phosphorylation site sequence site, and NTO phosphorylation site sequence length. When the sequence file is downloaded from the PhosphoSitePlus data, the file may further contain the NTO phosphorylation site sequence low throughput references and/or the NTO phosphorylation site sequence high throughput references. One or more of these pieces of information may be included in the plurality of output values that are then displayed or included in the species-specific phosphorylation site database entry created according to the method.
The number of low-throughput references and high-throughput references are provided for example by the PhosphoSitePlus database for each known phosphorylation site.
Referring now to
In an embodiment, identifying the matching TO phosphorylation site sequence and its cognate TO phosphorylation polypeptide sequence comprises, for example at step 206 of method 200, querying a dataset comprising a plurality of target organism (TO) polypeptide sequences with the known NTO phosphorylation site sequence (e.g. query phosphorylation site sequence) to identify a matching TO phosphorylation site sequence, and obtaining the accession number of the matching TO phosphorylation site sequence to thereby identify the cognate TO phosphorylation polypeptide sequence.
In an embodiment, the method of preparing a species-specific phosphorylation site database entry for a target organism comprises:
a) selecting a first known non-target organism (NTO) phosphorylation site sequence (Q) from a first NTO (QO), the known NTO phosphorylation site sequence comprising a length (QL) of at least 5 residues and less than 30 residues and/or 30 or fewer residues;
b) obtaining for the first known NTO phosphorylation site sequence, a first cognate known NTO phosphorylation polypeptide sequence (QF) and/or accession number (QA) corresponding to the known NTO phosphorylation site sequence, wherein the known NTO phosphorylation polypeptide sequence comprises the known NTO phosphorylation site sequence;
c) identifying a matching TO phosphorylation site sequence (H) for the known NTO phosphorylation site sequence, the matching TO phosphorylation site sequence comprising a length (HL) of at least 5 residues and less than 30 residues and/or 30 or fewer residues;
d) obtaining for the matching TO phosphorylation site sequence an accession number (HA) and/or cognate TO phosphorylation polypeptide sequence (HF);
e) identifying for the cognate known NTO phosphorylation polypeptide sequence (QF) (e.g. query polypeptide sequence) a matching TO phosphorylation polypeptide sequence for example by querying the dataset comprising the plurality of TO polypeptide sequences (TP);
f) determining a plurality of output values, one or more of the output values being indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence;
g) determining a similarity value between the first cognate NTO phosphorylation polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide sequence (HF), for example by determining if the matching TO phosphorylation polypeptide sequence and the cognate TO phosphorylation sequence are the same sequence and/or have the same accession number;
wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
For example, a similarity value is calculated between the cognate known NTO polypeptide sequence, and the cognate TO phosphorylation polypeptide sequence (corresponding to the TO phosphorylation site sequence identified for example by BLAST searching a NTO phosphorylation site sequence in the TO proteome).
In an embodiment, step c) comprises querying a dataset comprising a plurality of target organism (TO) polypeptide sequences (TP) with the first known NTO phosphorylation site sequence (e.g. query phosphorylation site sequences) to identify a matching TO phosphorylation site sequence (H).
In an embodiment, step d) comprises querying the dataset comprising the plurality of TO polypeptide sequences with the matching TO phosphorylation site sequence to obtain an accession number (HA) and/or cognate TO phosphorylation polypeptide sequence (HF).
In an embodiment, the method further comprises populating a database with matching TO phosphorylation site sequences and/or related information optionally when known non-target polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide (HF) are orthologues e.g. reciprocal polypeptides.
In an embodiment, the database is populated with the matching TO phosphorylation site sequences and cognate TO sequence accession numbers when the similarity value is below a preselected threshold.
The phosphorylation site sequences of a TO that correspond to NTO phosphorylation site sequences can be selected for inclusion in an array, such as a kinome array. Accordingly, in an aspect, the disclosure provides a method of selecting sequences for preparing a species-specific phosphorylation site array for a target organism comprising:
a) selecting a first known non-target organism (NTO) phosphorylation site sequence of a first non-target organism, the first known NTO phosphorylation site sequence comprising at least 5 residues and less than 30 residues and/or 30 or fewer residues;
b) obtaining for the first known NTO phosphorylation site sequence a first cognate known NTO phosphorylation polypeptide sequence corresponding to the first known NTO phosphorylation site sequence, the cognate known NTO phosphorylation polypeptide sequence comprising the first known NTO phosphorylation site sequence;
c) identifying a matching target organism (TO) phosphorylation site sequence for the first known NTO phosphorylation site sequence;
d) obtaining for the matching TO phosphorylation site sequence a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence, the cognate TO phosphorylation polypeptide sequence comprising the matching TO phosphorylation site sequence;
e) determining a plurality of output values, one or more of the output values being indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence;
f) determining a similarity value between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence, wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other; and
g) selecting the matching TO phosphorylation site sequences determined to correspond to orthologue polypeptides for inclusion in the array.
In an embodiment, the matching TO phosphorylation site sequence is selected for the array when the similarity value is below a preselected threshold.
The dataset comprising the plurality of target organism (TO) polypeptide sequences (Tp) comprises for example, the TO proteome, optionally the full proteome or a subset thereof, e.g. a subproteome. A subproteome may be desired if for example the database and/or array is desired to be limited to a particular subset (e.g. immune system proteins). Alternatively, the desired subset can be selected subsequently, for example filtering a set of identified matching target organism phosphorylation site sequences for a desired subset such as immune system proteins.
In an embodiment, the dataset comprising the plurality of TO phosphorylation polypeptide sequences is prepared by first retrieving, for example at step 202 of method 200, a proteome of the target organism, for example from an available database of proteomes. The dataset of TO phosphorylation polypeptide sequences is then created, for example at step 204 of method 200, using the retrieved proteome of the target organism. It will be understood that the dataset of TO phosphorylation polypeptide sequences is a database of sequences that may be queried. For example, the dataset of TO phosphorylation sequences can be a BLAST database that is created using the makeblastdb program being run on the retrieved proteome of the target organism.
In an embodiment, the dataset is the TO proteome and is optionally downloaded. For example, a proteome of the target organism wherein the target organism is bovine can be downloaded from The International Protein Index (IPI) for example from (Citation for IPI: P J Kersey, J Duarte, A Williams, Y Karavidopoulou, E Birney, and R Apweiler. The International Protein Index: an integrated database for proteomics experiments. Proteomics, 4(7):1985-8, 2004). Integr8 can also be used (citation for Integr8; P Kersey, L Bower, L Morris, A Home, R Petryszak, C Kanz, A Kanapin, U Das, K Michoud, I Phan, A Gattiker, T Kulikova, N Faruque, K Dug-gan, P Mclaren, B Reimholz, L Duret, S Penel, I Reuter, and R Apweiler. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res, 33(Database issue):D297-302, 2005).
In an embodiment, a processor executes a software program to retrieve the proteome of the target organism from an electronic database of polypeptide sequence data.
In an embodiment, the processor is operatively linked to an electronic database of polypeptide sequence data.
In an embodiment, the dataset comprising the plurality of TO phosphorylation site sequences is created by first downloading the TO proteome from one of the sources listed above and then created by running the makeblastdb program on the TO proteome in order to create a BLAST database comprising a plurality of TO phosphorylation polypeptide sequences. The created BLAST database can then be queried using other functions and programs, such as blastp, in order to identify a matching TO phosphorylation site sequence for the first known NTO phosphorylation site sequence.
In an embodiment, one or more data sets are obtained in nucleotide format and translated in one or all reading frames to provide a database containing polypeptide sequences. For example, if nucleotide TO sequence data is obtained, for example as a collection of cDNAs, the cDNA is translated to polypeptide sequence—if a start codon is unknown, the cDNA sequence can be translated in all reading frames.
Alternatively in another embodiment, nucleotide databases can be employed where the query sequences are for example nucleotide sequences corresponding to polypeptide sequences.
In an embodiment, the sequences (e.g. the NTO phosphorylation site sequences, the NTP phosphorylation polypeptide sequences, the TO polypeptide sequences, the TO phosphorylation site sequences and/or the TO phosphorylation polypeptide sequences or any other sequences described herein) are in FASTA format. In another embodiment, the sequences are in raw, GCG, GenPept, XML, EMBL, Swiss-PROT, PIR and/or PDB formats. Other formats can also be used.
The NTO phosphorylation site sequence is compared, for example at step 208 of method 200, against each of TO polypeptide site sequences of the dataset in order to identify a matching TO polypeptide site sequence. The known NTO phosphorylation sites are for example compared against the full proteins in the target proteome (using, for example, a local alignment such as BLAST). As another example, the comparing comprises comparing, for example at step 210 of method 200, the alignment of residues of the NTO phosphorylation site sequence against the residues of each of the plurality of TO polypeptide site sequences to find the number of identical residues between the NTO phosphorylation site sequence and each of the TO phosphorylation site sequences. The TO phosphorylation site sequence that contains the best match in terms of number identical residues is identified as the matching TO phosphorylation site sequence.
In an embodiment, the matching TO phosphorylation site sequences and/or the matching TO phosphorylation polypeptide sequences are identified using a blastp search. For example, the nonredundant phosphorylation site sequences optionally in FASTA format are used as queries in a BLAST search for example the stand-alone version of blastp (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST), with the dataset of TO phosphorylation polypeptide sequences that is queried. The blastp search may be performed using the -ungpapped option in order to produce an ungapped alignment of residues.
After identifying a matching TO phosphorylation site sequence, a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence is obtained. The corresponding cognate TO phosphorylation polypeptide sequence comprises the matching TO phosphorylation site sequence.
In an embodiment, the querying of the dataset of TO phosphorylation polypeptide sequences with the known NTO phosphorylation site sequence to identify a matching TO phosphorylation site sequence generates a query output that comprises information about the matching TO phosphorylation site sequence including, for example, matching TO phosphorylation site sequence accession number, matching TO phosphorylation site sequence description, number of sequence identities in the residue alignment between the known NTO phosphorylation site sequence and the matching TO phosphorylation site, matching TO phosphorylation site sequence, and the matching TO phosphorylation site sequence start position relative to the cognate TO phosphorylation polypeptide sequence. One or more of these pieces of information may be included in the plurality of output values that are then displayed or included in the species-specific phosphorylation site database entry created according to the method.
In an embodiment, the query output is parsed to extract information about the matching TO phosphorylation site sequence. Where the query is performed as a blast search using blastp, the query output may be parsed using BioPerl module SearchIO which parses the text output from BLAST, allowing the relevant information for the query output to be easily extracted in an automated fashion.
The matching polypeptide or matching phosphorylation site sequence can for example be the best match e.g. the one with the smallest E-value. For example, no match is identified if the smallest E-value is larger than 10, which is the default “Expect threshold” used by BLAST. In an embodiment, matches with E-values greater than the expect threshold are not reported at all. In another embodiment, more than one “match” is selected e.g. the best two, three, four etc. matches are selected. In an embodiment, each of the selected matches are compared to the cognate TO phosphorylation sequence and the match with for example the same accession number is selected.
In an embodiment, the number of sequence differences (e.g. number of non-exact residue matches) between the NTO phosphorylation site sequence (e.g. the entire query sequence and not just the portion of the query sequence that matched) and the matching TO phosphorylation site sequence (e.g. best hit sequence and only the portion that matched) is then calculated. The number of sequence differences is indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence. In an embodiment, the Levenshtein edit distance is calculated. In an embodiment, the number of sequence differences is up to 80%, 70%, 60%, 50%, 40%, 35%, 30%, 25%, 20%, 15% or 10%. In an embodiment, the matching sequence has 0 sequence differences. For example, where the input sequence is 8 amino acid residues, the matching sequence may have 6 or less, 5 or less, 4 or less, 3 or less, 2 or less, 1 or no sequence differences (e.g. 0 sequence differences) and where the input sequence is 15 amino acid residues, the matching sequence may have 12 or less, 11 or less, 10 or less, 9 or less, 8 or less, 7 or less, 6 or less, 5 or less, 4 or less, 3 or less, 2 or less, 1 or no sequence differences (0 sequence differences). Sequences having more than for example 60% different residues are not considered matches. The number of sequence differences may be calculated as the difference between the NTO phosphorylation site sequence length and the number of sequence identities described above. The number of sequence differences may be included in the plurality of output values that are displayed in the species-specific phosphorylation site database entry created according to the method.
In an embodiment, the number of non-conservative sequence differences between the NTO phosphorylation site sequence (e.g. the entire query sequence and not just the portion of the query sequence that matched) and the matching TO phosphorylation site sequence (e.g. best hit sequence and only the portion that matched) is then calculated. The number of sequence differences is indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence. In an embodiment, the number of non-conservative sequence differences is up to 90%, 80%, 70%, 60%, 50%, 40%, 35%, 30%, 25%, 20%, 15% or 10%. For example, where the input sequence is 8 amino acid residues, the matching sequence may have 8 or less, 7 or less, 6 or less, 5 or less, 4 or less, 3 or less, 2 or less, 1 or no non-conservative sequence differences and where the input sequence is 15 amino acid residues, the matching sequence may have 6 or less, 5 or less, 4 or less, 3 or less, 2 or less, 1 or no non-conservative sequence differences. Sequences having more than for example 60% different residues are not considered non-conservative matches. The number of sequence differences may be calculated as the difference between the NTO phosphorylation site sequence length and the sum of the number of sequence identities mentioned and the number of non-conservative substitutions. The number of non-conservative sequence differences may be included in the plurality of output values that are displayed in the species-specific phosphorylation site database entry created according to the method.
In an embodiment, the method comprises comparing the full protein sequence of for example the mature (or other species) TO phosphorylation sequence polypeptide with the full protein of for example the mature (or other species) NTO phosphorylation sequence polypeptide.
In an embodiment, the identifying of the matching TO phosphorylation site sequence further comprises determining the hit site of the TO phosphorylation site sequence. This hit site corresponds to the site of the phosphorylated residue within the cognate TO phosphorylation polypeptide sequence. In an embodiment, where the length of the known phosphorylation site sequence is for example equal to 15, and the phosphorylation site residue in the known phosphorylation site sequence is at position 8, the hit site can be calculated according to the expression Hs−Qs+8 where Hs is the start position of the hit in the matching TO phosphorylation site sequence and Qs is the start position in the known NTO phosphorylation site sequence for example as reported by local alignment (e.g. BLAST). A person skilled in the art would understand that if the phosphorylated residue in the known phosphorylation site is at another position, e.g. position 9 of the known phosphorylation site sequence of length 17, the hit site can be calculated according to the expression Hs−Qs+9. The hit site may be included in the plurality of output values that are displayed in the species-specific phosphorylation site database entry created according to the method.
In an embodiment, the identifying of the matching TO phosphorylation site sequence further comprises calculating the n-mer optionally 9-mer, sequence differences and the n-mer, optionally 9-mer, non-conservative sequence differences. For example, a 9-mer or 9-amino-acid-long substring of a 15 amino acid NTO phosphorylated site sequence (Q9) where the phosphorylated residue is its central residue, is identified by locating the phosphorylated residue of the NTO phosphorylated site sequence and the 4 indices (residues) on either side of the phosphorylated residue (e.g. residues 4 to 12 inclusive). Similarly, a 9-amino-acid-long substring of the TO phosphorylated site sequence (e.g. H9, where HL is at least 9 residues long) where the phosphorylated residue is at its centre, is identified, by locating the phosphorylated residue of the TO phosphorylated site sequence and for example by taking the substring between indices (5−Q5) and (13−QS) inclusive. A person skilled in the art would recognize that if QL, the length of the known NTO phosphorylation site sequence, is not 15, the indices will vary accordingly. Depending on the selected n-mer, selected to query lengths and identified hit lengths, a person skilled in the art would be able to modify the above equations accordingly.
In an embodiment, the 9-mer sequence differences is calculated as the number of sequence differences between the TO phosphorylated site sequence. 9-amino-acid long substring (H9) and the query 9-amino acid long substring (Q9). In an embodiment, the 9-mer non-conservative sequence is calculated as the number of non-conservative sequence differences between the TO phosphorylated site sequence 9-amino-acid long substring and the query 9-amino acid long substring.
As described, the NTO phosphorylation site sequences are at least 5 amino acid residues and up to for example 30 amino acid residues in length. Due to the short length of the query sequences, the cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence may not be orthologous to the first known NTO phosphorylation polypeptide sequence.
Orthology can be assessed for example by identifying reciprocal blast hits as outlined herein (e.g. below and under description of DAPPLE and Detailed description of DAPPLE methodology). Orthology can also be assessed by selecting a threshold and sequences sharing an E-value below the threshold are likely to be orthologues (e.g. ABC).
In an embodiment, reciprocal blast hits are identified and the following further comparisons are made. A comparison is made between the first known NTO phosphorylation polypeptide sequence to each of the TO phosphorylation polypeptide sequences of the dataset of TO phosphorylation polypeptide sequences, to generate a plurality of TO dataset similarity values. A best TO dataset similarity value is identified, E1B. The comparison step also includes in an embodiment identifying a first TO dataset similarity value of the match between first known NTO phosphorylation polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide sequence (HF) from the plurality of TO dataset similarity values, E1F. The similarity rank (e.g. E-value rank) (S) of the match between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide is further determined. The comparison may be performed, for example as a blastp search using the first cognate known NTO phosphorylation polypeptide sequence as the query and the dataset of TO phosphorylation polypeptides as the queried database, in which case the TO dataset similarity values are a plurality of E-values, wherein the smallest E-value is identified as the best TO dataset similarity value.
Another comparison is made between the cognate TO phosphorylation polypeptide sequence to each of the NTO phosphorylation polypeptide sequences of the dataset of NTO phosphorylation polypeptide sequences, to generate a plurality of NTO dataset similarity values. A best NTO dataset similarity value is identified, E2B. In an embodiment, the method further comprises identifying a first NTO dataset similarity value of the match between the first known NTO phosphorylation polypeptide sequence (QF) and the cognate TO phosphorylation polypeptide sequence (HF) from the plurality of NTO dataset similarity values.
In some embodiments, the non-target proteome is downloaded from The International Protein Index (IPI) for example from ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.BOVIN.fasta.gz (Citation for IPI: P J Kersey, J Duarte, A Williams, Y Karavidopoulou, E Birney, and R Apweiler. The International Protein Index: an integrated database for proteomics experiments. Proteomics, 4(7):1985-8, 2004). Integr8 can also be used (citation for Integr8; P Kersey, L Bower, L Morris, A Horne, R Petryszak, C Kanz, A Kanapin, U Das, K Michoud, I Phan, A Gattiker, T Kulikova, N Faruque, K Dug-gan, P Mclaren, B Reimholz, L Duret, S Penel, I Reuter, and R Apweiler. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res, 33(Database issue):D297-302, 2005). The makeblastdb program is then run on the NTO proteome in order to create a BLAST database comprising a plurality of NTO phosphorylation polypeptide sequences. The BLAST database forms the dataset of NTO phosphorylated polypeptide sequences. In this case, the second comparison may be performed as a blastp search using the cognate TO phosphorylation polypeptide sequence as the query and the dataset of NTO phosphorylation polypeptides as the queried database, in which case, the NTO dataset similarity values are a plurality of E-values, wherein the smallest E-value is identified as the best NTO dataset similarity value.
The best TO dataset similarity value E1B is then compared against the first TO dataset similarity value E1F and the best NTO dataset similarity E2B value is compared against the first NTO dataset similarity value E2F, wherein if the first TO dataset similarity value E1F equals the best TO dataset similarity value E1S and the first NTO dataset similarity value E2F equals the best NTO dataset similarity value E2B, the cognate TO phosphorylation polypeptide sequence is determined to be an orthologue of, or reciprocal blast hit of, the first known NTO phosphorylation polypeptide sequence. An indication of whether the TO phosphorylation polypeptide sequence is an orthologue of the first known NTO phosphorylation polypeptide sequence is included in the plurality of output values. In some embodiments, the plurality of output values may further include the first TO and/or NTO dataset similarity value. In some embodiments, the plurality of output values may further include the hit polypeptide sequence rank, which is determined as the rank of the first TO and NTO dataset similarity values amongst the plurality of TO and NTO dataset similarity values.
An example of the above steps for performing the reciprocal blast hit comparison is outlined in steps 332-340 under the heading Detailed description of DAPPLE methodology.
In an embodiment, the reciprocal blast hit comparison comprises the following:
a) run blastp using QF as the query and DTP as the database. Determine the E-value E1B of the best BLAST hit, and also the E-value E1F of the match between QF and HF. Also, let S be the E-value rank of the E1F. In other words, if E1F is the nth smallest E-value, then S=n.
b) run blastp using HF as the query and DQoP as the database. Determine the E-value E2B of the best BLAST hit, and also the E-value E2F of the match between QF and HF.
c) let R=“yes” if QF and HF are reciprocal BLAST hits, and “no” otherwise; if E1B=E1F and E2B=E2F, then R=“yes”; otherwise, R=“no”.
The series of comparisons can also be understood for example according to the following:
Assume, to start, that the NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are reciprocal BLAST hits. This assumption is maintained until proven otherwise. Blastp is executed using the NTO phosphorylation polypeptide sequence as the query and the TO proteome as the database. If the E-value against the cognate TO phosphorylation polypeptide sequence is not equal to the best (smallest) E-value of all the hits against the TO proteome, then the two proteins are not reciprocal BLAST hits. Then, blastp is executed using the TO phosphorylation polypeptide sequence as the query and the NTO proteome as the database. If the E-value against the NTO phosphorylation polypeptide sequence is not equal to the best (smallest) E-value, then the two proteins are not reciprocal BLAST hits.
The comparisons may be performed, for example, as a blastp search, in which case the first similarity value is an E-value, wherein similar sequences will have a small E-value and dissimilar sequences will have a large E-value, of the match recorded. If the similarity value is large for example if the E-value is large, then the two proteins may not be orthologues. Although percent identity can be used, E-value is preferred for determining orthology as it takes into account sequence length, database size, etc.
In an embodiment, the similarity value comprises an E-value. In another embodiment, the E-value is selected at less than 10−2, 10−3, 10−4, 10−5, between 10−2 and 10−5 or any number in between.
The comparison may be performed, for example, as a blastp search using the first known NTO phosphorylation polypeptide sequence as a query, in which case the first similarity value is an E-value, wherein similar sequences will have a small E-value and dissimilar sequences will have a large E-value, of the match recorded. If the similarity value is large for example if the E-value is large, then the two proteins may not be orthologues. Although percent identity can be used, E-value is preferred for determining orthology as it takes into account sequence length, database size, etc.
In an embodiment, the similarity value comprises an E-value. In another embodiment, the E-value is selected at less than 10−2, 10−3, 10−4, or 10−5, or between 10−2 and 10−5 or any number in between.
In an embodiment, the plurality of output values is displayed.
In an embodiment, the plurality of output values is outputted to form an entry for the species-specific phosphorylation site database. The plurality of output values may be outputted electronically to allow easy importing into a spreadsheet program. For example, the output may be in tab-delimited plain text format, comma-delimited plaint text format, or any other delimited format for easy importing.
Since the method may be repeated for a large number of NTO phosphorylated site sequences, for example thousands of sequences, output values for a large number of database entries may be prepared. In an embodiment, the method further includes a method of filtering the table so that one can intelligently choose which peptides for example to include on the kinome array. For example, the user may wish to view only entries where the number of low-throughput references is greater than two, or to eliminate entries where the similarity value is greater or lesser than a certain threshold.
In an embodiment, the method is computer implemented. In an embodiment the method is carried using the “DAPPLE” program described herein which uses for example, a reciprocal BLAST hit (RBH) component to ascertain orthology or the ABC program described under the heading ABC, which specifies an E-value threshold for determining orthology. The DAPPLE program also allows selection of an E-value threshold. In another embodiment, a computerized system implements the method described above. In an embodiment, the computerized system carries out the “DAPPLE” program for example as more particularly described under the headings DAPPLE and Detailed description of DAPPLE methodology or the ABC program described below under the heading ABC.
In an embodiment, the BLAST searches can be parallelized and the computer method (e.g. DAPPLE) can be run on a workstation cluster or computer grid to reduce its computational time.
In another embodiment, a non-first match is used, especially if the full protein corresponding to one of these matches is orthologous to the full protein corresponding to the query.
In another embodiment, the substitution matrix is based on the evolutionary relatedness between the target organism and the organism corresponding to a given known phosphorylation site.
A further aspect comprises a non-transitory computer-readable storage medium comprising a plurality of instructions, wherein the instructions, when executed, cause a processor to perform the following:
a) querying a dataset comprising a plurality of target organism (TO) polypeptide sequences with a selected plurality of known NTO phosphorylation site sequences (query phosphorylation site sequences) to identify for each of the plurality of NTO phosphorylation site sequences a matching TO phosphorylation site sequence;
b) obtaining for each of the matching TO phosphorylation site sequences a cognate TO phosphorylation polypeptide sequence corresponding to the matching TO phosphorylation site sequence, the cognate TO phosphorylation polypeptide sequence comprising the matching TO phosphorylation site sequence;
c) determining a plurality of output values, one or more of the output values being indicative of a degree of matching between the TO phosphorylation site sequence and the NTO phosphorylation site sequence; and
d) determining a similarity value between the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence, wherein the similarity value provides an indication of whether the first known NTO phosphorylation polypeptide sequence and the cognate TO phosphorylation polypeptide sequence are orthologues of each other.
In an embodiment, the instructions are further for performing the step of displaying the matching TO phosphorylation site sequences and/or cognate TO sequence accession numbers when the similarity value is below a preselected threshold.
In an embodiment, the instructions are further for performing the step of displaying one or more of these pieces of information of the plurality of output values. In an embodiment, the instructions are further for performing the steps of creating a species-specific phosphorylation site database entry.
In an embodiment, the instructions stored on the non-transitory computer-readable medium are further for performing the step of carrying out the steps of any one or more of the methods described herein.
A further aspect comprises a system for preparing one or more species-specific phosphorylation site database entries for a target organism, the system comprising:
a) a memory for storing a plurality of instructions; and
b) a processor coupled to the memory for:
In an embodiment, the similarity value is an E-value and the preselected threshold is 10−3.
In an embodiment, the program comprises the DAPPLE scripts described below under the heading Detailed description of DAPPLE methodology.
Another aspect includes a computerized control system for controlling and receiving data, the computerized control system comprising at least one processor and memory configured to provide:
a) a control module for:
b) an analysis module for:
c) a display module for displaying the matching TO phosphorylation site sequences and/or cognate TO sequence accession numbers when the similarity value is below a preselected threshold.
Using the cow as a test species, a protocol for designing kinome microarrays for species with few known phosphorylation sites was recently proposed (Jalal et al., 2009). Taking advantage of sequence homology between human proteins and bovine proteins, this study used known human phosphorylation sites as BLAST (Altschul et al., 1997) queries in order to identify probable bovine sites. If a given query's best match in the bovine proteome had few sequence differences relative to the query, it was a candidate for inclusion on a bovine-specific kinome microarray. While useful, several aspects of this protocol could be improved.
First, the manual nature of the protocol makes it time-consuming and tedious to perform, and also limits the amount of known phosphorylation data that can be used. Second, the protocol uses only known phosphorylation sites from human. This is problematic because it is possible, for instance, that a given bovine phosphorylation site might be homologous to a known rat phosphorylation site, but not to any known human site. By using only known phosphorylation sites from human, this bovine site would be missed. Third, the method used by the protocol to identify possible non-orthologous proteins (comparing the annotations of those proteins) has several drawbacks, including the subjective nature of comparing annotations, the difficulty of automating these comparisons, and the fact that protein annotations are often inaccurate or incomplete. Fourth, the protocol described in Jalal et al. (2009) has no facility for choosing which peptides should be included on the array once the BLAST searches have been performed.
ABC is a collection of Perl scripts that addresses these concerns, ultimately allowing the user to easily, quickly, and accurately identify potential phosphorylation sites in an organism of interest.
To test ABC, it was used to identify phosphorylation sites in the cow (Bos taurus), just as was done in Jalal et al. (2009). The PhosphoSitePlus database was downloaded on Feb. 14, 2011, and contained 97679 known phosphorylation sites (83860 of them unique). The International Protein Index (IPI) (Kersey et al., 2004) bovine proteome was downloaded from ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.BOVIN.fasta.gz on Dec. 20, 2010 in FASTA format and contained 29384 protein sequences. These two files were then used as input to ABC. For comparison purposes, the output table produced by ABC was used to generate summary data similar to that in Table 1 of Jalal et al. (2009). Note that the methodology employed by ABC is not identical to that employed by Jalal et al., so the results that it produces are not expected to be exactly the same. Table 1 compares the summary results given by Jalal and coauthors in Table 1 of their paper with the results produced by ABC.
As can be seen, the percentages of known phosphorylation sites that had a given number of sequence differences with their best bovine BLAST match were similar between the two approaches. The percentage of queries for which no homology was found in the bovine proteome was also similar, despite the different approaches used for detecting non-homology.
As kinome microarrays become a more popular tool for studying cellular signaling, the ability to design kinome microarrays suitable for studying different species will become increasingly important.
ABC improves upon an already-successful method for designing kinome microarrays. Compared to the previous protocol, it is far less time-consuming and tedious, yet is able to make use of 100 times more information. Through its use of all known phosphorylation sites in the PhosphoSitePlus database, rather than just those from human, ABC is more robust and thorough. Finally, the program greatly improves the ability to identify non-orthologous matches. As such, ABC should prove to be a useful tool for designing species-specific kinome microarrays.
ABC requires two input files: the proteome of the target organism (for which the user wants to design the kinome microarray) in FASTA format, and the phosphorylation site data from PhosphoSitePlus, which can be obtained from www.phosphosite.org/downloads/Phosphorylation site dataset.gz. As the PhosphoSitePlus data file contains entries with identical sequences (from different organisms), duplicate sequences are first removed. A FASTA file containing the nonredundant phosphorylation sites is then created, and the sequences in this file are used as queries to the stand-alone version of blastp (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST), with the target organism's proteome as the database. Unlike in Jalal et al., the queries are not limited to those from human. The output from blastp is then parsed using the BioPerl (Stajich et al., 2002) module SearchIO, and the accession number and sequence of the best match, if any, for each query are saved. The number of sequence differences—or, more formally, the Levenshtein edit distance—between the entire query sequence (not just the portion of the query sequence that matched) and the best hit sequence (only the portion that matched) is then calculated.
Due to the short length of the query sequences (between eight and fifteen amino acids), the full protein corresponding to the best match may not be orthologous to the full protein corresponding to the query sequence. In Jalal et al. (2009), this problem was addressed by manually comparing the annotations of the proteins corresponding to the query and the match. However, this approach suffers from the drawbacks described in the introduction; thus, ABC instead uses the full protein corresponding to each known phosphorylation site (i.e. each of the original queries) as a blastp query against the target proteome. The match against the same accession number as was matched by the corresponding phosphorylation site is then identified, and the E-value of this match recorded. If this E-value is large, then the two proteins may not be orthologues. The output of ABC is a table in which each row represents the result of a BLAST search using, as a query, one of the phosphorylation sites in the PhosphoSitePlus data file. The table is in a tab-delimited plain text format that can easily be imported into a spreadsheet program. This table contains several columns, including: query accession (the accession number of the protein corresponding to the known phosphorylation site), query description (a description of that protein), query organism (the organism corresponding to that protein), query sequence (the amino acid sequence of the known phosphorylation site), hit accession (the accession number of the best match in the target proteome), hit sequence (the amino acid sequence of this match), sequence differences (the number of sequence differences between the entire query sequence and the portion of the hit protein that matched), protein E-value (the E-value between the entire protein corresponding to the query accession, and the entire protein corresponding to the hit accession), low-throughput references (the number of low-throughput references corresponding to this phosphorylation site), and high-throughput references. The rows are listed in increasing order of sequence differences.
Since the output table will contain thousands of possible phosphorylation sites, the user needs some method of filtering the table so that he or she can intelligently choose which peptides to include on the array. For example, the user may wish to view only rows where the number of low-throughput references is greater than two, or to eliminate rows where the E-value is greater than a certain threshold. ABC contains a number of scripts allowing the output table to be filtered in these and other ways, further aiding the user in designing species-specific kinome microarrays.
DAPPLE (Design Array for PhosPhoryLation Experiments) is a collection of Perl scripts that addresses the concerns listed for example in the description of ABC, ultimately allowing the user to easily, quickly, and accurately identify potential phosphorylation sites in an organism of interest.
DAPPLE requires several input files: the proteome of the target organism (for which the user wants to design a kinome microarray) in FASTA format; the proteomes of the organisms represented in the database of phosphorylation sites, also in FASTA format; and the phosphorylation site data. If a particular organism represented in the phosphorylation site data does not have a proteome available, then the known phosphorylation sites from that organism can still be used; however, DAPPLE will be unable to output information for the “RBH?” column of the output table (see below). The phosphorylation site data could be obtained from a number of sources, including the PhosphoSitePlus database (Hornbeck et al., 2004), Phospho.ELM (DieIla et al., 2004, 2008), or the literature. This study used data from PhosphoSitePlus, which can be obtained from www.phosphosite.org/downloads/Phosphorylation site dataset.gz. As the PhosphoSitePlus data file contains entries with identical sequences (from different organisms), duplicate sequences are first removed. The sequences of the non-redundant phosphorylation sites are used as queries to the standalone version of blastp (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST), with the target organism's proteome as the database. Unlike in Jalal et al. (2009), the queries are not limited to those from human. The output from blastp is then parsed using the BioPerl (Stajich et al., 2002) module SearchIO, and the accession number and sequence of the best match, if any, for each query are saved. If there are multiple matches with the same E-value as the best match, then only the first result returned by BLAST is used. Additional information about the match is then saved or computed, and ultimately presented in the DAPPLE output table (described below).
Due to the short length of the query sequences (between eight and fifteen amino acids), the full protein corresponding to the best match may not be orthologous to the full protein corresponding to the query sequence. In Jalal et al. (2009), this problem was addressed by manually comparing the annotations of the proteins corresponding to the query and the match. However, this approach suffers from the drawbacks described in the introduction; thus, DAPPLE uses the well-established reciprocal BLAST hits (RBH) method to ascertain orthology (Moreno-Hagelsieb and Latimer, 2008). For a given known phosphorylation site X from organism A with best match Y in organism B (the target organism), let X′ be the full protein corresponding to X, and Y′ be the full protein corresponding to Y. DAPPLE will declare X′ and Y′ as orthologues if and only if Y′ is the best match when X′ is used as a query sequence and the proteome of organism B is used as the database, and X′ is the best match when Y′ is used as a query sequence and the proteome of organism A is used as the database. In this case, “the best match” is defined as any protein that has the smallest E-value. For instance, if X′ is not the first result returned by BLAST when Y′ is used as a query sequence and the proteome of organism A is used as the database, then X′ and Y′ can still be declared as orthologues if the E-value of the match against X′ is equal to that of the first result returned by BLAST.
The output of DAPPLE is a table in which each row represents the result of a BLAST search using, as a query, one of the known phosphorylation sites in the PhosphoSitePlus data file. The table is in a tab-delimited plain text format that can easily be subsequently manipulated. This table contains many columns. The following list describes each column, with X, Y, X′, and Y′ having the same meaning as above.
To test DAPPLE, phosphorylation sites in the cow (Bos taurus) were identified, just as was done in Jalal et al. (2009). The files described below, all of which were downloaded on Jun. 7, 2011, were used as input to DAPPLE. The PhosphoSitePlus database was downloaded from the URL given earlier, and contained 122031 known phosphorylation sites (104386 of them unique). The International Protein Index (IPI) (Kersey et al., 2004) bovine proteome was downloaded from ftp.ebi.ac.uk/pub/databases/IPI/current in FASTA format and contained 34273 protein sequences. If available, the proteome for each organism represented in the PhosphoSitePlus database was retrieved. The proteomes were downloaded from various sources depending on data availability: the human, mouse, and rat proteomes were downloaded from IPI; the fruit fly proteome was downloaded from UniProtKB; and the dog, ferret, goat, guinea pig, horse, pig, and sheep proteomes were downloaded from GenBank. No proteomes could be downloaded for the remaining organisms represented in the PhosphoSitePlus database (frog, hamster, monkey, quail, rabbit, starfish, and torpedo fish), either because the organism had few or no protein sequences available, or because the organism name refers to a group of organisms (e.g. frog) rather than a single species.
Table 2 compares the summary results given by Jalal and coauthors in Table 1 of their paper with the results produced by DAPPLE. Note that the methodology employed by DAPPLE is not identical to that employed by Jalal et al., so the results that it produces are not expected to be exactly the same. Nevertheless, the percentages of known phosphorylation sites that had a given number of sequence differences with their best bovine BLAST match were similar between the two approaches, with the greatest discrepancies occurring in the percentage of peptides having zero sequence differences. For DAPPLE, the percentage of peptides under the “no homology” category differed depending on the criterion for declaring two proteins as orthologues (see Table 1 caption), with the RBH method being less likely to declare two proteins as orthologues than the E-value method.
The gain in efficiency using DAPPLE compared to manually performing the procedure in Jalal et al. (2009) was considerable. DAPPLE took 63 hours (elapsed time) to run on a Mac OS X machine with a 2.4 GHz Intel Core 2 Duo processor and 4 GB of memory using all 104386 unique phosphorylation sites from the PhosphoSitePlus database. In contrast, manually running the web-based version of BLAST and then recording the results might take five minutes for a single peptide, or over 8,000 hours of labour for all of these known sites. Even the time taken to manually process a small subset of the PhosphoSitePlus data—say, 800 peptides, which was approximately what was used in Jalal et al. (2009)—is around 66 hours, exceeding the time required for DAPPLE to process the entire dataset.
Whereas manually processing 800 peptides would result in a few hundred peptides to choose from for a kinome microarray, the amount of useful information produced by DAPPLE is far greater. For instance, DAPPLE outputs more than 29000 peptides in the cow that have zero mismatches with a known phosphorylation site and for which “RBH?”=“yes”. Downstream selection criteria can therefore be much more restrictive.
The superiority of the orthologue detection procedure employed by DAPPLE can be illustrated using the following example. The human protein with accession number Q9NV56 has the annotation “MRG-binding protein”. A known phosphorylation site from this protein has, as its best match in the bovine proteome, a segment of the protein with accession number E1BHM1, which has the description “C13H20orf20 hypothetical protein LOC616297”. These two proteins are reciprocal BLAST hits and thus orthologues—a fact that would be difficult to ascertain by comparing the annotations. The use of reciprocal BLAST hits also eliminates the subjectivity inherent in comparing annotations. For instance, the two annotations “Guanylyl cyclase-activating protein 2” and “GUCA1B Uncharacterized protein” appear similar, but the two proteins corresponding to the above annotations are not reciprocal BLAST hits. DAPPLE's orthologue detection procedure also has the advantage that the output can easily be filtered so that peptides for which “RBH?”=“no” are eliminated, saving the user a great deal of time comparing annotations.
As kinome microarrays become a more popular tool for studying cellular signaling, the ability to design kinome microarrays suitable for studying different species will become increasingly important. DAPPLE improves upon an already-successful method for designing kinome microarrays. Compared to the previous protocol, it is far less time-consuming and tedious, yet is able to make use of 100 times more information. Through its use of all known phosphorylation sites in the PhosphoSitePlus database, rather than just those from human, DAPPLE is more robust and thorough. Finally, the program greatly improves the ability to identify non-orthologous matches. As such, DAPPLE will be a useful tool for designing species-specific kinome microarrays.
The following and
Let K denote the set of known phosphorylation sites. These could be derived from one or more of the following sources: PhosphoSitePlus [Hornbeck et al., 2004], Phospho.ELM [Diella et al., 2004, 2008], PHOSIDA [Gnad et al., 2007], the literature, or any other source of known phosphorylation data. Let QεK be a known phosphorylation site (i.e. sequence of amino acids) from organism QO, QL be the length of Q, QA be the accession number of the full protein corresponding to Q, QF be the sequence of the full protein with accession number QA, QC be the site (residue name and position in QF; e.g. Y352) of the phosphorylated residue, QLTR be the number of low-throughput references associated with Q, and QHTR be the number of high-throughput references associated with Q. Finally, let T be the target organism (the organism for which the user wants to obtain putative phosphorylation sites).
Depending on the source of a given phosphorylation site, some information may not be available. In such cases the information is recorded in the DAPPLE output table as “ND” (“not determined”). For example, currently QLTR and QHTR are available only if Q is from the PhospoSitePlus database.
DAPPLE performs the following procedure for each QεK. Referring now to
It will be understood that steps of DAPPLE do not necessarily have to be performed in the order shown in method 300 and according to various embodiments one or more steps may be performed out of order or omitted.
Step 302 Obtain Information from the Phosphorylation Database File.
Referring to
As shown in
Referring to
Use the makeblastdb program using TP 416 as input to create a BLAST database DTP Target proteome BLAST database 418 (if DTP 418 does not already exist).
Referring to
Referring back to
The number of sequence differences U Sequence differences 438 is equal to QL (410)−I (412).
The number of non-conservative sequence differences V Non-conservative sequence differences 440 is equal to QL (410)−P (430).
The position of this residue can be calculated using the expression HS−QS+8. As mentioned above, HC 442 cannot be determined if QL<15.
Step 320 Determine the 9-Amino-Acid-Long Peptide Corresponding to Q 406 with the Phosphorylated Residue as its Central Residue.
The 9-amino-acid-long substring of Q 406 with the phosphorylated residue at its center, denoted Q9 9-mer corresponding to query sequence 444, can be found by taking the substring between indices 4 and 12, inclusive. For example, if Q=ABCDEFGHIJKLMNO, then-Q9=DEFGHIJKL.
Step 322 Determine the 9-Amino-Acid-Long Peptide Corresponding to H with the Phosphorylated Residue as its Central Residue.
The 9-amino-acid-long substring of H 424 with the phosphorylated residue at its center, denoted H9 9-mer corresponding to hit sequence 446, can be found by taking the substring between indices (5−QS) and (13−QS), inclusive. For example, if H=CZEFGHIJKLMN and QP=3, then H9=ZEFGHIJKL. If H is less than nine residues long, then H9 cannot be computed, along with U9 9-mer sequence differences 448 and V9 9-mer non-conservative sequence differences 450 (see below).
The number of sequence differences U9 448 is the count of positions where the two residues are different in a gapless alignment between Q9 448 and H9 446. U9 448 cannot be determined if QL<15 or H is less than nine residues long.
The number of non-conservative sequence differences V9 450 is the count of positions where the two residues have a non-positive score in the BLOSUM62 matrix in a gapless alignment between Q9 444 and H9 446. V9 450 cannot be determined if QL<15 or H is less than nine residues long.
QoP 452 may be downloaded from any online source of protein sequence data, such as GenBank, UniProt, or IPI.
Use the makeblastdb program using QoP 452 as input to create a BLAST database DQoP 454 (if DQoP 454 does not already exist and QoP exists 452). If no proteome exists for QoP 452, then R 466, which denotes whether or not QF 413 and HF 436 are reciprocal BLAST hits (see steps 332-340), cannot be computed.
Variations of the methods include one or more of the following.
BLAST searches can be parallelized and the computer method (e.g. DAPPLE) can be run on a workstation cluster or computer grid to reduce its computational time.
Second, DAPPLE currently uses only the first match when running BLAST using a known phosphorylation site as the query. However, other matches might be of interest and could be used, especially if the full protein corresponding to one of these matches is orthologous to the full protein corresponding to the query.
Third, DAPPLE currently uses the BLOSUM62 substitution matrix to calculate non-conservative sequence differences. This could be improved by choosing the substitution matrix based on the evolutionary relatedness between the target organism and the organism corresponding to a given known phosphorylation site.
Comparison of ABC and DAPPLE
In ABC, the method for ascertaining orthology (or lack thereof) is based on the E-value between the TO phosphorylation polypeptide sequence when the NTO phosphorylation polypeptide sequence is used as a query against the TO proteome. DAPPLE contains this information as part of its output, so the user can still use the ABC method of ascertaining orthology. DAPPLE additionally comprises a reciprocal BLAST hits method of ascertaining orthology. Table 2 above provides information gathered using a reciprocal blast search and the E-value method. The E-value method can be for example, a more sensitive method of ascertaining orthology, and the RBH method can be more specific.
Peptides corresponding to the TO phosphorylation site sequences can for example be used to make species-specific arrays such as kinome arrays. Accordingly, in another aspect, the disclosure includes a method of making a plurality of species-specific isolated peptides comprising selecting a plurality of matching target organism phosphorylation site sequences according to the method described herein, and synthesizing a plurality of peptides each peptide comprising a sequence of one of the matching target organism phosphorylation site sequences.
The arrays can be for any species, optionally other than for human, rat and mouse. Species-specific arrays designed using methods described herein can be used to address specific biological questions including economically important biological questions. For example, a chicken species-specific array is disclosed comprising a plurality of peptides identified using the methods disclosed herein. Use of such an array is demonstrated.
Temperature stresses which occur during the transport of poultry are important from the perspectives of animal welfare and meat quality. Hot and cold stresses negatively impact the quality of both breast and thigh meat. As the mechanisms of phenotypes cannot be fully explained through traditional biochemical indicators we developed a tool, a chicken-specific kinome peptide array, to provide global insight into cellular signal transduction responses to temperature stress, including post-mortem activities, in chickens. Unique kinomic profiles are observed in breast and thigh tissues, reflecting their distinct cellular phenotypes. Against these backgrounds, in both breast and thigh tissues, greater changes are observed in response to cold, than heat, stress although the specifics of these responses differ in a tissue-specific manner. Metabolic pathways appear upregulated in thigh, and downregulated in breast, in responses to cold stress in living birds. Post mortem time course analysis of these tissues from the temperature stressed birds again verifies the greater impact of cold stress. Collectively this investigation brings forth a valuable tool for characterization of cellular responses in chickens as well as providing specific information to the cellular mechanisms of chickens to temperature stresses.
Transportation of broiler chicken is a stressful, but essential, component of the poultry processing industry. The temperature fluctuations which can occur during transport are of significant consequences to the industry from the perspectives of both animal welfare and meat quality. Both heat and cold stress have been shown to compromise the quality of both breast and thigh meat. Previous research from our group has shown that breast and thigh meat with dark, firm and dry (DFD) characteristics can develop as a result of extreme cold exposure during transportation. DFD incidence in breast and thigh muscles of the cold-stressed birds, accounted as quality defects for the poultry meat industry and resulting in economic loss.
Furthermore, in particular in Canada, the number of dead on arrival (DOA) is often higher in winter, where natural ventilation in trailers has been limited to maintain heat within the load. Paradoxically, this can lead to birds in the middle of the trailers experiencing heat stress while those near cold air ingress points must try to cope with the cold. The high DOA numbers in winter have both welfare and economic implications. The DOA value in Ontario for January 2009 was double the yearly national average representing a loss of over 93,000 birds.
Recent work has shown that the incidence of dark, firm, dry (DFD) breast meat was up to 8% of broilers that experienced cold conditions during transport. The value was even higher in thigh meat which is more sensitive to transportation stresses than breast meat. As both of these meat cuts are of equal value in the marketplace, the resulting inconsistencies in color and eating quality from pale, soft, and exudative (PSE) and DFD can decrease consumer confidence. Heat-stress induced PSE meat is also of lower quality for further processing as the impaired protein functionality leads to poor water holding capacity, cook yield and textural properties.
Traditional metabolic investigations have failed to offer a clear explanation of the mechanisms behind the dramatic drop in core body temperature, survival of birds and incidence of DFD breast and thigh meat in broilers. Specifically investigations failed to identify clear mechanisms or markers which explain these responses to temperature stress in birds. This indicates that novel, likely global, approaches are required to understand these complex, multi-faceted host responses.
There is considerable debate of the most appropriate level to perform characterizations of cellular responses. Transcriptional analysis, based on the experimental maturity of the approach and the relative ease of which arrays can be produced for novel species, is widely used but there are concerns that description of cellular responses at the level of transcription fail to accurately predict or describe cellular responses due to a multitude of post-transcriptional regulatory events. In contrast, protein post-translational modifications, in particular phosphorylation events, occur closer to the phenotype and are often more reliable indicators of phenotypes.
Peptide arrays have proven a valuable tool to enable high throughput characterizations of cellular kinase activity but have been limited to species with well-defined phosphoproteomes. The vast majority of characterized phosphorylation events are for human and mouse which represents a significant obstacle in the application of this approach to non-traditional research animals, including livestock.
The development of a chicken specific peptide array is described. This array consists of 292 peptides representing critical phosphorylation events associated with a broad spectrum of signaling pathways but with particular emphasis on pathways and processes associated with metabolic regulation. Application of these arrays revealed distinctive kinomic profiles associated with breast and thigh tissues and offered specific insight into the cellular changes which occur in these tissues upon exposure of birds to hot and cold stress including a time course investigation of changes which occur post mortem. In both breast and thigh tissues, greater changes are observed in response to cold, than heat, stress although the specifics of these responses differ in a tissue-specific manner. Metabolic pathways appear upregulated in thigh, and downregulated in breast, in responses to cold stress in living birds. Post mortem time course analysis of these tissues from the temperature stressed birds again verifies the greater impact of cold stress.
Peptide Arrays: Design, construction and application of the peptide arrays is based upon a previously reported protocol with modifications (Jalal, 2009). Notably the kinome experiments for all the animals were performed simultaneously in a single run minimizing the possibility of technical variances in the analysis. Briefly, approximately 10×106 cells were collected, pelleted and lysed by addition of 100 μL lysis buffer (20 mM Tris-HCL pH 7.5, 150 mM NaCl, 1 mM EDTA, 1 mM EGTA, 1% Triton, 2.5 mM sodium pyrophosphate, 1 mM Na3VO4, 1 mM NaF, 1 μg/mL leupeptin, 1 g/mL aprotinin, 1 mM PMSF) (all products from Sigma Aldrich unless indicated). Cells were incubated on ice for 10 minutes and spun in a microcentrifuge for 10 minutes at 4° C. A 70 μl aliquot of this supernatant was mixed with 10 μl of activation mix (50% Glycerol, 500 uM ATP (New England Biolabs, Pickering, ON), 60 mM MgCl2, 0.05% v/v Brij-35, 0.25 mg/mL BSA), incubated on the array for 2 hours at 37° C. Arrays were then washed with PBS-(1%) Triton.
Slides were submerged in phospho-specific fluorescent ProQ Diamond Phosphoprotein Stain (Invitrogen) with agitation for 1 hour. Arrays were then washed three times in destain containing 20% acetonitrile (EMD Biosciences, VWR distributor, Mississauga, ON) and 50 mM sodium acetate (Sigma) at pH 4.0 for 10 minutes. A final wash was done with distilled deionized H2O. Arrays were air dried for 20 min then centrifuged at 300×g for 2 minutes to remove any remaining moisture from the array. Arrays were read using a GenePix Professional 4200A microarray scanner (MDS Analytical Technologies, Toronto, ON) at 532-560 nm with a 580 nm filter to detect dye fluorescence. Images were collected using the GenePix 6.0 software (MDS) and the spot intensity signal collected as the mean of pixel intensity using local feature background intensity background calculation with the default scanner saturation level.
Datasets:
The dataset contains the signal intensities associated with each of 292 peptides for the animals under different treatments. For each animal and each treatment, there are nine intra-array replicates. All data processing and analysis was done as per Li, et al. 2012, with the following study specifics.
Animal-Animal Variability Analysis:
For each of the 300 peptides, an F-test was used to determine whether there are significant differences among the three animals under the same treatment condition.
Treatment-Treatment Variability Analysis:
Peptides identified by the F-test as having consistent patterns of response to the various treatments across the three animals were subjected to a paired t-test to compare their signal intensities under a treatment condition with those under control conditions. For each animal-independent peptide, the responses from all three animals were pooled to increase the statistical confidence. Peptides with significant (p<0.10) changes in phosphorylation were identified. This level of significance was chosen to retain as much data as possible and thus facilitate subsequent pathway analysis.
Cluster Analysis:
The preprocessed data were subjected to hierarchical clustering and Principal Component Analysis (PCA) to cluster peptide response profiles across animal-treatment combinations. For each of the 292 peptides in a single treatment and animal, the average was taken over the nine VSN-transformed replicates. For hierarchical clustering, each animal/treatment vector was considered as a singleton (i.e. a cluster with a single element) at the initial stage of the clustering. The two most similar clusters were merged and the distances between the newly merged clusters and the remaining clusters were updated, iteratively. The “Average Linkage+(1−Pearson Correlation)” (Pearson 1996) is the method used, as described by Eisen et al. 1998. It takes the average over the merged (i.e. the most correlated) kinome profiles and updates the distances between the merged clusters and other clusters by recalculating the correlations between them. The first two principal components, namely PC1 and PC2, which account for the largest variability within the sample data, were used to cluster the animal/treatment data points.
Pathway Analysis of Differentially Phosphorylated Peptides:
InnateDB is a publically available resource which, based on levels of either differential expression or phosphorylation, predicts biological pathways based on experiment fold change datasets
(Lynn et al 2008). Pathways are assigned a probability value (p) based on the number of proteins present for a particular pathway as well as the degree to which they are differentially expressed or modified relative to a control condition. For our investigation input data was limited to those peptides selected in the Treatment-Treatment Variability Analysis (above). Since InnateDB requires fold-change (FC) values as input (with p-values optional), the differences between the VSN transformed intensities under control and treatment are converted to fold-change values by the formula 2d where d=averagetreatment−averagecontrol.
Development of a Chicken-Specific Peptide Array:
The chicken-specific peptide arrays were developed through a bioinformatics approach developed by our group termed “Design Array for Phosphorylation-Mediated Experiments (DAPPLE)”. DAPPLE uses genomic information from the species of interest, in this case chicken, as well as publically available information of defined phosphoproteomes to predict phosphorylation sites within the species of interest. There is a moderate degree of conservation of phosphorylation sites between chickens and humans; approximately one quarter of the phosphorylation sites from human were perfectly conserved over a peptide of 15 amino acids (seven residues flanking each side of the phosphoacceptor site) [Table 4]. For the chicken array 292 peptides were selected on the basis of conservation of the phosphorylation sites as well as the interest in the associated biological events [Table 4]. For the final array each peptide is printed in triplicate within each block and each block is printed in triplicate to provide nine technical replicates of each peptide.
Cellular Responses to Temperature Stress:
Groups of chickens (n=5) were exposed to either hot or cold stress. A control group of birds was maintained at room temperature. After the indicated time period birds were sacrificed and tissue samples were collected from the breast and thigh for kinome analysis. Kinome data was processed through PIIKA, an in-house kinome peptide array data processing pipeline described in PCT/CA2011/000764 titled Methods for Kinome Analysis filed Jun. 30, 2011, which is hereby incorporated by reference in its entirety incorporated herein by reference
Cluster analysis of the kinome data demonstrates an absolute tendency for the samples to segregate on the basis of tissue type. This suggests that the cellular phosphorylation-mediated signal transduction occurring within thigh and breast tissue is sufficiently distinct that the samples can be discriminated on the basis of the tissue of origin with a high degree of confidence [
A closer examination of the clustering results within the breast and thigh samples relieves a strong tendency for the heat-stressed and control samples to cluster together while the samples corresponding to the cold stressed birds cluster distinctly. This occurred within both the breast and thigh samples. This suggests that the breast and thigh samples show a greater cellular response to cold as opposed to heat stress [
Pathway Analysis:
To identify conserved cellular responses initiated in chicken breast and thigh muscle following exposure of the animal to hot and cold stress the responses across the five birds were averaged to generate a representative bird for each of the control, cold stressed and heat stressed conditions. Pathway overrepresentation analysis was then performed utilizing InnateDB ( ). Pathways were evaluated based upon the p values for confidence that the pathway is differentially influenced (activated or repressed) under the treatment condition relative to the control as well as the number of differentially phosphorylated peptides within the pathway that supported the involvement of the pathway.
Within the breast and thigh tissues of birds exposed to the heat stress there was a greater number of pathways which were found to be activated rather than repressed [Table 5A]. Within this general common trend there was a unique compliment of pathways which were activated within each tissue. Within breast tissue, heat stress resulted in the activation of a number of calcium regulated events including phosphorylation of CREB through CAMKII as well as calmodulin dependent kinase activation. PGC-1 alpha is a transcriptional coactivator that regulates the genes involved in energy metabolism. This protein interacts with the nuclear receptor PPAR-γ, which permits the interaction of this protein with multiple transcription factors. This protein can interact with, and regulate the activities of, cAMP response element-binding protein (CREB) and nuclear respiratory factors (NRFs). It provides a direct link between external physiological stimuli and the regulation of mitochondrial biogenesis.
Within thigh tissues heat stress activates a distinct compliment of pathways which are also involved in metabolic events including adipocytokine signaling, insulin signaling and MTOR activation [Table 5A]. The mammalian target of rapamycin (mTOR), an evolutionarily conserved serine-threonine kinase, promotes anabolic cellular processes such as protein synthesis in response to growth factors, nutrients (amino acids and glucose), and stress (Biondi et al., 2004; Wullschleger et al., 2006).
In response to cold response the responses between breast and thigh are more divergent. In breast tissue cold stress results in downregulation of a number of pathways associated with metabolic activity including insulin receptor signaling as well as leptin induced signaling [Table 5B]. In contrast the cold stress induced responses in thigh tissues associated with greater metabolic activity (carbohydrate digestion and absorption) as well as activation of cell cycle regulation as well as stress responses [Table 5B].
AMPK:
The importance of understanding thermal stress at the level of phosphorylation-mediated signal transduction activity is supported by the observations that a number of kinases have been specifically implicated in responses to thermal stress. For example, AMP-activated protein kinase (AMPK), which is subject to regulation through phosphorylation, serves to increase the rate of glycolysis. Several studies with a mouse model have also shown that a decrease in AMPK activity resulted in a slower rate of glycolysis due to a slower release of glucose residues from the glycogen stores, resulting in a higher ultimate pH (Shen and Du, 2005). Thus, AMPK has a very important role in both living tissue and in postmortem events. To-date only one study (Sibut et al., 2008) has looked at AMPK activity in poultry, but their results were opposite that observed from work with rats or pigs.
Livestock researchers are faced with highly complex biological problems and are often disadvantaged by an absence of cutting edge research technologies. As a disproportionate amount of research is devoted towards humans and mice, the traditional species of laboratory investigations, so too are the available research tools. Unfortunately, the species-specificity of many of these tools limits that application to investigations of other species. For example, there is an ongoing trend within the field of human medicine to monitor and influence cellular responses at the level of phosphorylation-mediated signal transduction. These phosphorylation reactions are mediated by a class of enzymes called kinases. Kinome analysis, as it has been dubbed, is proving a highly effective strategy for understanding complex biological responses.
Unfortunately, the species-specificity of the kinome research tools has made it exceedingly difficult to apply this perspective to investigations of livestock. To address this limitation our group developed a protocol which enables creation of peptide arrays for kinome analysis of non-traditional animal species (Jalal et al 2009). The genome sequence of the species of interest is the only required prerequisite information. Since then, peptides arrays for cattle, described herein, and honeybees have been developed (for example see PCT [BEE ARRAY] herein incorporated by reference in its entirety). [These arrays will prove to be highly valuable and cost-effective tools in investigations of production-limiting diseases and/or phenotypes of priority to these industries.
The immediate application of these arrays to understand cellular changes associated with events involved in the transport and post slaughter events, for example in describing patterns of signal transduction resulting from hot and cold stresses as well as describing cellular changes which occur post mortem.
Preslaughter transport and handling could increase stress on the birds by decreasing muscle glycogen reserves and therefore affecting the rate and extent of pH decrease, which could affect the resultant meat quality (Owens and Sams 2000; Debut et al., 2003). It is reported that preslaughter temperature affects the postmortem metabolism of muscle via adrenal or other physiological responses or simply by fatigue of the birds (Petracci et al., 2001). Preslaughter heat stress has been reported to accelerate the rate and extent of rigor mortis development (Sams 1999), postmortem glycolysis, and postmortem metabolism and biochemical changes in the muscle, resulting in undesirable changes in meat characteristics similar to the pale, soft, and exudative (PSE) condition (McKee and Sams 1997; Sams 1999; Sandercock et al., 2001). Exposure of chickens to heat stress before slaughter results in breast meat with lower ultimate pH (pHu; Holm and Fletcher 1997; Sandercock et al., 1999), reduced water-binding capacity (WBC; Sandercock et al., 1999; Petracci et al., 2001), and reduced tenderness (Froning et al., 1978; Holm and Fletcher 1997; Petracci et al., 2001). On the other hand, a cold environment before slaughter also causes stress to the bird and may affect meat quality, resulting in meat with dark, firm dry (DFD) characteristics (Dadgar et al., 2010).
The first column indicates the number of sequence differences between a known phosphorylation site from the PhosphoSitePlus database, and its best match in the chicken proteome. The second column represents, for all sites in these database, the percentage that had that number of sequence differences. The third column represents the percentage of peptides actually chosen for inclusion on the array having a given number of sequence differences
Accordingly as demonstrated above with the chicken species array, peptides corresponding to the TO phosphorylation site sequences can for example be used to make species-specific arrays such as kinome arrays. Accordingly, in another aspect, the disclosure includes a method of making a plurality of species-specific isolated peptides comprising selecting a plurality of matching target organism phosphorylation site sequences according to the method described herein, and synthesizing a plurality of peptides each peptide comprising a sequence of one of the matching target organism phosphorylation site sequences
In another aspect, the disclosure includes a method of making a species-specific array comprising selecting a plurality of matching target organism phosphorylation site sequences according to the method described herein, synthesizing a plurality of peptides each peptide comprising a sequence of one of the matching target organism phosphorylation site sequences and attaching the plurality of peptides to a substrate surface.
In an embodiment, the method is for making a plurality of bovine specific peptides and/or a bovine specific array. In another embodiment, the method is for making a plurality of chicken specific peptides and/or a chicken specific array.
The methods were used to identify a number of bovine and chicken specific peptides and design a bovine specific and a chicken specific array. Accordingly the plurality of peptides and/or array can be determined or designed for any species for which proteome sequence exists. A bee species-specific peptide array and uses thereof is described in PCT/IB2012/001254 filed Jun. 24, 2012 titled METHODS AND COMPOSITIONS FOR CHARACTERIZING PHENOTYPES USING KINOME ANALYSIS, which is hereby incorporated by reference.
Species-specific isolated peptides and species-specific arrays are useful for identifying economically important traits. For example, the chicken species-specific array was demonstrated to be useful for probing responses to shipping stress and could be used to identify markers associated with desirable traits (e.g. increased resistance to shipping stress). The arrays can be used to obtain phosphorylation profiles and for classifying chickens with desirable characteristics.
Accordingly, in other aspects, the disclosure includes an isolated peptide whose sequence is identified using a method described herein, a plurality of said peptides (for example a plurality of isolated peptides) that are specific for a species and a species-specific array comprising a plurality of peptides attached to a substrate surface, each peptide comprising a sequence of one of a matching target organism phosphorylation site sequence selected according to a method described herein, wherein the similarity corresponds to or is below a preselected threshold.
In an embodiment, the isolated peptide comprises an isolated chicken peptide (e.g. peptides found in chicken). In another embodiment, the plurality of peptides is a plurality of chicken species peptides. In an embodiment, the array is a chicken specific array.
In an embodiment, each isolated peptide comprises a sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids, optionally wherein the sequence comprises a contiguous sequence present in a peptide sequence selected from the group of SEQ ID NOs: 1 to 292, said contiguous sequence comprising a chicken phosphorylation site sequence. For example, each of the sequences in Table 6 (SEQ ID NOs: 1-292) comprise a chicken phosphorylation site sequence. The isolated peptide for example comprises minimally about 6 amino acids and the portion of a sequence in Table 6 that comprises said phosphorylation site sequence.
Each peptide for example comprises at least one serine, threonine or tyrosine amino acid residue.
Each of the peptides comprising sequences selected from Table 6, can for example, comprise 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 or more amino acids. For example, if SEQ ID NO:1 is selected, the peptide can comprise 8, 9, 10, 11, 12, 13, 14 or 15 of SEQ ID NO:1 as long as the phosphorylation site is included. Preferably, the phosphorylation site is centered or about centered in the peptide length selected. Typical phosphorylatable amino acids include serine, threonine and tyrosine residues.
The peptides can also for example comprise linkers (e.g. flexible linkers) or other sequence not present in the surrounding sequence, for example for attaching to a support surface.
In another aspect, the disclosure includes a plurality of peptides (e.g. a collection), each peptide comprising a sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids, optionally wherein the sequence comprises a contiguous sequence present in an amino acid sequence selected from the group of SEQ ID NOs: 1 to 292, said contiguous sequence comprising a chicken phosphorylation site sequence.
In an embodiment, the plurality of peptides comprises at least 25 peptides, at least 50 peptides, at least 100 peptides, at least 200 peptides, at least 300 peptides, at least 400 peptides, at least 500 peptides or at least 1000 peptides or any number in between. In an embodiment, each peptide has a sequence of a matching target organism phosphorylation site sequence.
In an embodiment, the plurality of peptides comprises a subset (e.g. two or more) of the peptides or parts thereof (the parts comprising a chicken phosphorylation site sequence) listed in Table 6, for example, about 5, 10, 15, 20, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275 or 292 of the peptides listed in Table 6. In an embodiment, the plurality of peptides comprises a subset (e.g. 2 or more) of the peptides listed in Table 6. In a further embodiment, the plurality of peptides comprises the set of peptides in Table 6.
Each of the plurality of peptides is for example an isolated peptide, for example an isolated synthetic chemically peptide synthesized using for example commercially available methods and equipment. Methods of synthesizing peptides are well known in the art, and include for example liquid phase peptide synthesis and solid phase peptide synthesis (SPPS), including for example Fmoc SSPC and Boc SPPS.
In another embodiment, the plurality of peptides (e.g also referred to as peptide targets) is attached to a support surface, each peptide comprising a sequence of a chicken phosphorylation site sequence selected for example according to a method described herein, wherein the similarity is below a preselected threshold.
Additional chicken specific sequences (e.g. not listed in Table 6) identified using the described methods can also be included in the plurality of peptides. Further specific subsets of phosphorylation targets can selected for inclusion in the plurality.
A further aspect includes a composition comprising one or more peptides listed in Table 6 and a diluent. The peptide can for example be attached to a bead or spotted on a slide and can for example be used in methods described herein. In an embodiment, the composition comprises 1 to 292 peptides listed in Table 6, or any number of peptides between 1 and 292.
In another aspect, the disclosure includes an array comprising a plurality of peptides. In an embodiment, the array comprises a plurality of peptides, each comprising an amino acid sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids, optionally wherein the sequence comprises a contiguous sequence present in an amino acid sequence selected from the group of SEQ ID NOs: 1 to 292, said contiguous sequence comprising a chicken phosphorylation site sequence.
Generally, since the peptide molecules are typically pre-formed and spotted onto the support as intact molecules, they are comprised of 5 or more amino acids, and are peptides, polypeptides or proteins. For the most part, the peptide molecules in the present arrays comprise about 5 to 100 amino acids, for example 5 to 50 amino acids, preferably about 5 to 30 amino acids. A phosphorylation motif comprises for example 4 amino acids. The amino acids forming all or a part of a peptide molecule may be any of the twenty conventional, naturally occurring amino acids, i.e., alanine (A), cysteine (C), aspartic acid (D), glutamic acid (E), phenylalanine (F), glycine (G), histidine (H), isoleucine (I), lysine (K), leucine (L), methionine (M), asparagine (N), proline (P), glutamine (Q), arginine (R), serine (S), threonine (T), valine (V), tryptophan (W), and tyrosine (Y).
In an embodiment, each of the array plurality of peptides comprises a sequence that is about 8 to about 15 amino acids of a peptide sequence selected from SEQ ID NO: 1-292.
In an embodiment, the peptide array comprises at least 2 peptides, at least 3 peptides, at least 4 peptides, at least 5 peptides, at least 25 peptides, at least 50 peptides, at least 100 peptides, at least 200 peptides, at least 300 peptides, at least 400, at least 500 or at least 1000 or any number in between 2 and 1000. Each peptide is optionally spotted in at least two replicates, or at least 3 replicates per array, optionally as replicate blocks. For example, the peptide can be spotted, 4, 5, 6, 7, 8 or 9 times or more. For example up to 15 replicates.
In another embodiment, the array comprises a plurality of peptides each peptide comprising a peptide sequence selected from the group listed in Table 6.
Each peptide (e.g. target peptide) corresponds to a protein which can be identified for example by an accession number.
Subsets of the plurality of peptide can be selected for inclusion on the array. For example, depending on the dataset to be obtained, the plurality of peptides can comprise peptides with known phosphorylation motifs, optionally phosphorylation motifs for proteins that are found in a signaling pathway or related pathways. For example, as indicated for the chicken specific array, peptides corresponding to proteins involved in metabolic pathways were selected.
The plurality of peptides can also comprise for example peptide sequences of a selected group of molecules, for example proteins involved in immune responses, specific signaling cascades or can be related molecules, e.g. sharing a particular sequence identity.
Such peptide arrays can be useful for deciphering peptides phosphorylated or signaling pathways activated by a stressor such a physical treatment (e.g. cold/hot stress), an infectious agent or a macromolecule. Alternatively, the peptide array can comprise random peptide sequences comprising putative phosphorylation sites wherein the plurality of peptides or a subset thereof comprises at least one of a serine, threonine or tyrosine residue.
In an embodiment, the array further comprises a negative control peptide and/or a positive control peptide. In an embodiment, the negative control peptides do not contain any Ser, Thr or Tyr residues. Positive control peptides could include for example peptides comprising phosphorylation sites of histones 1 through 4, bovine myelin basic protein (MBP), and/or α/β casein. Alternatively, the peptides can be either random sequences (e.g. control peptide), not necessarily always containing a Ser/Thr or Tyr, or represent known or predicted phosphorylation sites (e.g. peptides comprising Ser/Thr or Tyr residues).
In an embodiment the control peptide is selected according to a selected test condition. For example, a negative control could be an irrelevant peptide sequence optionally containing a T, Y, or S amino acid at the centre position. A positive control could be for example a peptide corresponding to a protein known to be phosphorylated by a given treatment in the experiment. The positive controls can be any length for example, they can be full length proteins.
Any of the non-phosphorylation site amino acids in the peptide molecules may be replaced by a non-conventional amino acid. In general, conservative replacements are preferred. Conservative replacements substitute the original amino acid with a non-conventional amino acid that resembles the original in one or more of its characteristic properties (e.g., charge, hydrophobicity, stearic bulk; for example, one may replace Val with Nval). The term “non-conventional amino acid” refers to amino acids other than conventional amino acids, and include, for example, isomers and modifications of the conventional amino acids, e.g., D-amino acids, non-protein amino acids, post-translationally modified amino acids, enzymatically modified amino acids, constructs or structures designed to mimic amino acids (e.g., .alpha,.alpha.-disubstituted amino acids, N-alkyl amino acids, lactic acid, .beta.-alanine, naphthylalanine, 3-pyridylalanine, 4-hydroxyproline, O-phosphoserine, N-acetylserine, N-formylmethionine, 3-methylhistidine, 5-hydroxylysine, and nor-leucine). The peptidic molecules may also contain nonpeptidic backbone linkages, wherein the naturally occurring amide —CONN— linkage is replaced at one or more sites within the peptide backbone with a non-conventional linkage such as N-substituted amide, ester, thioamide, retropeptide (—NHCO—), retrothioamide (—NHCS—), sulfonamido (—SO.sub.2NH—), and/or peptoid (N-substituted glycine) linkages. Accordingly, the peptide molecules of the array include pseudopeptides and peptidomimetics. The peptides can be (a) naturally occurring, (b) produced by chemical synthesis, (c) produced by recombinant DNA technology, (d) produced by biochemical or enzymatic fragmentation of larger molecules, (e) produced by methods resulting from a combination of methods (a) through (d) listed above, or (f) produced by any other means for producing peptides.
A peptide can for example comprise up to 1, 2 3, 4, or up to 5 conservative changes for every 15 amino acid sequence. For example, each peptide can comprise up to 70%, 75%, 80%, 85%, 90%, 95% sequence identity with a peptide selected from Table 6.
The chicken specific array can be used to measure protein kinase activity in a chicken sample, for example for analyzing cellular signaling events, for example under test conditions. The array enables for example investigation of phosphorylation-mediated signal transduction activity in a sample from a chicken and can be used to identify biomarkers for marker assisted selection and/or to understand some of the biology associated with particular phenotypes. For example, the arrays can be used to identify chicken phenotypes that have increased tolerance to a stressor and/or to identify strategies that reduce stress response. For example, it is demonstrated that signaling changes occur upon cold and heat stress in chickens. Chickens exhibit differences in cellular signalling pathways discernable using an array comprising chicken specific peptides comprising known or putative phosphorylation sites. The arrays can be used to identify conditions that minimize the stress for example what time and temperatures can be tolerated with minimal stress induction. The methods can also be used to identify phenotypes that are more resistant to a stressor. For example, the profiles obtained for a specific phenotype are reproducible and specific profiles can be obtained for use in identifying chickens of unknown or otherwise unconfirmed characteristics. Chickens having the desired phenotype can then be cross-bred according to the desired traits.
For example the technology can be applied to chicken breeding programs and used to identify phenotypes of interest for example susceptibility/resistance to pathogenic organisms and/or cellular responses to infection or stressors.
A further aspect comprises a method of determining a phosphorylation profile of a test sample comprising:
a) incubating a species-specific array comprising a plurality of peptides, wherein the plurality of peptides are selected according to a method described herein, with the test sample to provide a test array and optionally incubating a second array with a comparator sample such as a control sample or a second test sample to provide a comparator array; and
b) measuring a phosphorylation level signal intensity for each of the plurality of peptides for the test array and optionally the comparator array, the phosphorylation level signal intensity resulting from the interaction of the sample with each of the plurality of peptides;
to provide the phosphorylation profile.
In an embodiment, the phosphorylation profile comprises a plurality of data values, for example, each value representing a phosphorylation level of a peptide and/or the direction of change (e.g. an indication of increased or decreased phosphorylation level of one or more of the plurality of peptides on the test array compared to the comparator array or internal control) and/or the magnitude of said increase or decrease.
The increase or decrease can for example be relative to an internal control or controls, e.g. relative to background. Alternatively, the increase or decrease can be relative to a comparator array such as a control array contacted with a suitable control sample or a different test sample, e.g. which is treated differently or comprises a different test subject.
In an embodiment, the method for determining a phosphorylation profile for a sample optionally from a subject, said method comprising the steps of:; a) incubating a sample optionally obtained from said subject with ATP and/or other suitable ATP source and a plurality of peptides, for example, wherein each of the plurality comprises a sequence of about 5 to about 100 amino acids, for example about 5 to about 50 amino acids or about 5 to about 30 amino acids, wherein the sequence comprises a contiguous sequence present in a peptide sequence selected from Table 6, wherein said contiguous sequence comprises a chicken phosphorylation site sequence; and, b) measuring for each peptide a phosphorylation level signal intensity resulting from the interaction of the sample with the plurality of peptides, thereby providing a phosphorylation profile for the sample.
In an embodiment, the method further comprises calculating the direction and/or magnitude of change compared to an internal control or a comparator array.
In an embodiment, the sample is from a subject and the method further comprises first obtaining a sample from the subject.
The plurality of peptides incubated with the sample can for example be any plurality of peptides described herein, including for example peptides attached to a solid support such as in an array. Accordingly in an embodiment, the plurality of peptides is comprised in an array described herein.
In another embodiment, the plurality of peptides is comprised in a composition that is contacted with ATP and/or other suitable ATP source and the level of phosphorylation is detected by a method known in the art. For example, the composition can be separated electrophoretically and probed with a phosphospecific antibody, or visualized using labeled ATP of a phosphor specific stain. Slot blots, immunohistochemical and the like can also be used. This method can be used for example with a subset of peptides and/or corresponding proteins are being assessed for example about 2, 3, 4, 5, 6 to 10, 11-15 or more peptides or corresponding proteins.
A compound that functions as ATP can also be used instead of ATP in the methods described. For example, other suitable ATP sources such ATP analogs can be used. GTP can also be used in place of ATP or ATP source.
The sample from the subject can alternatively be a cell sample from a cell line, for example treated with a stressor.
Kinotyping can be used for identifying cell, tissue and organism level phenotypes. Accordingly, in an embodiment, an array comprising a plurality of peptides or parts thereof selected from Table 6 is used to identify a chicken cell, chicken tissue or chicken at the organism level, phenotype.
In an embodiment, the method comprises: a) determining a detectable phosphorylation profile of a sample obtained from the subject, said phosphorylation profile resulting from the interaction of said sample with a plurality of peptides described herein; b) comparing said phosphorylation profile to one or more reference phosphorylation profiles, each reference phosphorylation profile corresponding to a known phenotype and c) classifying the subject according to the probability of said phosphorylation profile falling within a class defined by said reference phosphorylation profile.
In an embodiment, the method for classifying a subject for example as having or not having a phenotype, comprises a) obtaining a sample of the subject; b) incubating said sample with ATP and/or other suitable ATP source and a plurality of peptides, for example comprising sequences or parts thereof selected from Table 6 and/or other peptides, each peptide comprising a phosphorylation site sequence; and c) determining a detectable phosphorylation profile, said phosphorylation profile resulting from the interaction of the sample with the plurality of peptides; d) comparing said phosphorylation profile to one or more reference phosphorylation profiles of a known phenotype (e.g. one or more phenotype reference phosphorylation profiles); wherein a difference or a similarity in the phosphorylation profile of the plurality of peptides between the sample and said one or more reference phosphorylation profiles is used to classify the subject for example as having or not having the phenotype.
In an embodiment, the similarity is assessed by calculating a measure of similarity.
The subject is identified as having or likely having the phenotype of the phenotype reference phosphorylation profile most similar to said subject phosphorylation profile. For example, if a subject has a higher similarity to a first phenotype reference phosphorylation profile, the subject is identified as having said first phenotype; if a subject has a higher similarity to a second phenotype reference phosphorylation profile, the subject is identified as having said second phenotype. The phosphorylation levels can also be used to determine a threshold, wherein if a subject is above or below a threshold, the subject is identified as having the phenotype corresponding to above or below the threshold.
In an embodiment, the method of classifying a subject comprises: (i) calculating a first measure of similarity between a first phosphorylation profile, said first phosphorylation profile comprising the phosphorylation levels of a plurality of peptides described herein, in a cell sample taken from said subject and a first phenotype reference phosphorylation profile, said first phenotype reference phosphorylation profile comprising phosphorylation levels of said plurality of peptides that are for example, average levels of said respective peptides in cells of a plurality of subjects having said first phenotype; and (ii) classifying said subject as having the first phenotype if said first phosphorylation profile has a similarity to said first phenotype reference phosphorylation profile that is above a predetermined threshold, classifying said subject as not having said first phenotype if said first phosphorylation profile has a similarity to said first phenotype reference phosphorylation profile that is below a predetermined threshold,
In an embodiment, step (i) further comprises: calculating a second measure of similarity between said first phosphorylation profile and a second phenotype reference phosphorylation profile, said second phenotype reference phosphorylation profile comprising phosphorylation levels of said plurality of peptides that are average phosphorylation levels of the respective peptides in cells of a plurality of subjects having said second phenotype; and classifying said subject as having said second phenotype if said first phosphorylation profile has a similarity to said first phenotype reference phosphorylation profile that is below a predetermined threshold and said first phosphorylation profile has a similarity to said second phenotype reference phosphorylation profile that is above a predetermined threshold.
Similarity can be determined for example using clustering analysis.
Similarity can also be determined by calculating a similarity score or threshold.
In a further embodiment, the method includes displaying; or outputting to a user interface device, a computer-readable storage medium, or a local or remote computer system, the classification produced by said classifying step.
The phosphorylation profile can be determined using known methods for example methods for array analysis. In particular, the phosphorylation profile can be determined using methods described in PCT/CA2011/000764 titled Methods for Kinome Analysis filed Jun. 30, 2011, which is hereby incorporated by reference in its entirety.
PCT/CA2011/000764 describes for example, the signal intensities measuring specific phosphorylation events of the peptides on a kinome array are subjected to variance stabilization transformation to bring all the data onto the same scale while alleviating variance-mean-dependence. Spot-spot and subject-subject variability are examined using χ2 and F-tests to identify and eliminate inconsistently regulated peptides due to technical and biological factors of the experiments, respectively. One-sided paired t-test is used to identify differentially phosphorylated peptides relative to the control from the preprocessed kinome data. The information from the differential peptides can be used to probe gene ontology (GO) annotations and known signaling transduction pathways from online database to discover treatment-specific cellular events from various biological aspects. For comparative visualization of the global kinome profiles induced by selected stimuli, hierarchical clustering and principal component analysis are applied to the data after averaging the replicate intensities. The results from the differential analyses and clustering are compared to draw further insights from the data and/or to classify subjects. The results can be presented for example in pseudo-images generated based on the p-values from the one-sided t-tests for phosphorylation or dephosphorylation of each peptide. Each peptide is represented for example by one small colored circle. The depths of the coloration in the colors, for example red and green, are inversely related to the corresponding p-values.
A further aspect includes a kit comprising a plurality of peptides described herein comprising sequences present in a peptide selected from Table 6, an array comprising a support and the plurality of peptides, and/or a kit control.
In an embodiment, the kit further comprises instructions for use.
In an embodiment, the kit comprises about 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 125, 150, 175, 200, 225, 250, 275, 300 or more peptides, optionally selected from Table 6. In an embodiment, the peptides are comprised in a composition, or attached to a solid support such as in a microarray.
Another aspect includes a phosphorylation profile comprising for each of a plurality of peptides selected from Table 6, one or more phosphorylation characteristics, for example signal intensities, fold change, and/or phosphorylation status, associated with a phenotype and/or treatment.
While the present disclosure has been described with reference to what are presently considered to be the preferred embodiments, it is to be understood that the disclosure is not limited to the disclosed embodiments. To the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
All publications, patents and patent applications are herein incorporated by reference in their entirety to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
PCT/IB2012/001254 | Jun 2012 | IB | international |
This application is a PCT claiming priority to U.S. provisional application 61/537,941 filed Sep. 22, 2011, US Provisional application filed Apr. 3, 2012 and PCT application PCT/IB2012/001254 filed Jun. 24, 2012, all of which are herein incorporated by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CA2012/000893 | 9/21/2012 | WO | 00 | 3/21/2014 |
Number | Date | Country | |
---|---|---|---|
61537941 | Sep 2011 | US | |
61619902 | Apr 2012 | US |