The invention relates to a method of determination of identity of an entity from mass spectra. The method is useful in proteomics, metabolomics and its applications in proteomics, metabolomics, genomics and transcriptomics.
Discovery proteomics contains wealth of rare information, often resistant to reliable interpretation. In shotgun proteomics, discovery-oriented subfield of bottom-up proteomics, proteins are enzymatically cleaved into peptides and the digested samples gradually introduced into an analyser, most commonly into a mass spectrometer using liquid chromatography. In mass spectrometry analysis, typically, in each cycle, masses of intact molecules are analyzed, those of further interest isolated, fragmented and second mass analyses performed on fragments, giving MS/MS spectra. The goal of identification is the peptide producing observed MS/MS spectrum and mapping of peptides to proteins concludes the protein identification task.
Because of immense proteome complexity, number of potential interpretations of each fragment spectrum is enormous. Size apart, even unbiased consideration of all interpretations is of limited use as high spectral homology draws many interpretations equal—a phenomenon often witnessed in peptide sequencing de novo. In practice, identification is intentionally biased towards much smaller database of reference peptides, commonly using search engines such as Sequest or X!Tandem. Reference search leaves many spectra uninterpreted, around 75% as recently evaluated on projects in PRIDE repository. Although many unmatched spectra were confidently found to contain peptides with post-translational modifications, majority still remains without interpretation. Several explanations remain open, e.g., splicing alterations, re-arranged or novel genes, however unmatched spectra often exhibit properties of peptides. As human proteomes differ from reference in around ten thousand peptide amino acid sites, some spectra might be naturally attributed to variants and their identification is the aim of the present invention.
The same problems with interpretation of outcomes of analyses of complex samples are encountered in many other fields of chemical, biochemical and biological research dealing with complex mixtures, irrespective of the analytical methods used and irrespective of the structural character of the samples.
The present invention relates to a method for determination of identity of at least one entity from a mass spectrum of said at least one entity and optionally from additional data from chemical, biochemical or biological analysis of said at least one entity, for each entity comprising the steps of:
The candidate identities selected in step b) comprise candidate identities which are a possible or admissible interpretation of the mass spectrum and optionally of the additional data.
The score calculated in step c) and used for finally determining the identity in step d) may have a form of a numerical value (then in step d), usually the highest value of the score determines the identity which is finally determined to be the correct one for the analyzed entity), or another form, such as an interval of numbers, a non-numerical entity, entities with established order, a number with probabilistic interpretation. The skilled person will appreciate that when a form of the score is selected, also the score which would correspond to the true identity of the entity (the ideal score) is selected or determined by the form of the score or by its calculation. E.g., for a number with probabilistic interpretation, 100% probability (or value 1) corresponds to the true identity of the entity. The “true identity” is meant as the real identity of the entity, which is however unknown at the beginning of the process.
Preferably, in the step c), the calculation involves calculating maximal probability of candidate identity. The maximal probability may be the score, or it may be a variable in the calculation of the score.
Preferably, in the step c), the calculation involves calculating probability of candidate identity. The probability may be the score, or it may be a variable in the calculation of the score.
Preferably, in the step c), the calculation involves calculating probability of candidate identity using Bayes' Theorem.
In one preferred embodiment, in the step b) the value of prevalence is calculated based on at least one of population frequency of said entity, probability of modification of said entity in the environment, probability of modification of said entity during the analysis step.
Preferably, in the steps b) and c), the value of prevalence is expressed as prior probability or as prior-like probability.
In a preferred embodiment, in the step d) the determination of identity comprises evaluating whether multiple forms of isotopically labeled peptides were present.
Preferably, the entity is selected from a molecule having the molecular weight of up to 2000 mol/g, a peptide, a protein, a lipid, a nucleic acid, a metabolite.
Preferably, the entity is a peptide, and the method used to obtain the mass spectrum is tandem mass spectrometry (also referred to as MS/MS).
In preferred embodiments which are further referred to as comprising “enumeration”, in step b), the obtaining of the candidate entities and/or of prevalence of the candidate identities comprises enumeration which comprises the steps of:
In the enumeration embodiments, preferably said candidate identities are peptides; said prevalence is expressed as a prior-like probability; said initial entities are N-terminally-cleaved linear subsequences of reference proteins; said applicable events comprise modification, substitution and cleavage; said limiting condition is minimal prior-like probability of given form of peptide.
In the enumeration embodiments, preferably said candidate identities are proteins; said prevalence is expressed as a prior-like probability; said initial entities are reference exon-based protein models; said applicable events comprise exon exclusion and exon inclusion; said limiting condition is minimal prior-like probability of exon-based model; said transformation of entities into hypotheses is concatenation of exons into protein-coding sequence and translation in silico.
The method of the present invention has a number of potential utilizations, which may involve additional steps upstream or downstream, or may involve the utilization of the determined identity of one or more entities by the method of the present invention in known methods.
The method of the present invention, wherein the entities are proteins, wherein the step of obtaining the candidate identities of an entity in step b) includes database search in database of peptide variants, may be used for identification of mutant and polymorphic proteins from mass spectra of proteome, with alterations already observed globally on nucleotide level.
The method of the present invention, wherein the entities are peptides, further comprising the steps of: e) matching of entities determined as polymorphic peptides to database of origins, may be used for determination of identity on the basis of variability of known prevalence, in particular for authentication of cell lines or identification of an individual from mass spectra of proteome.
The method of the present invention, wherein the entities are non-host peptides, wherein in the step b) the prevalence is expressed as prior or prior-like probability and wherein the prevalence of non-host peptides is scaled down according to prevalence of non-host organism, may be used for identification of non-host organism of known prevalence from mass spectra of proteome of host organism.
The method of the present invention, wherein the entities are non-host peptides, wherein in the step b) in obtaining the candidate identities, peptides uniquely mapping to non-host organism are added to enumerated peptides of host organism and prevalence of non-host peptides is lower than of any host peptide, may be used for identification of non-host organism of unknown prevalence from mass spectra of proteome of host organism.
The method of the present invention, wherein the entities are donor peptides, wherein in the step b) the prevalence of donor peptides is scaled according to their prevalence among recipient peptides, may be used for identification of proteins originating from grafted tissue in allograft or xenograft.
The method of the present invention, wherein the entities are peptides, the method further comprising the step of: e) selecting somatic mutant peptides attributable to tumour, may be used for identification of presence of a tumour and estimation of its stage from mass spectra of circulating proteins.
The method of the present invention, wherein the entities are peptides, the method further comprising the step of: e) selection and quantification of polymorphic peptides attributable to donor, may be used for monitoring organ transplantation and early detection of transplant rejection from mass spectra of blood plasma or serum of recipient.
The method of the present invention, wherein the entities are peptides, said method further comprising the step of: e) calculating significance of match between two individuals based on polymorphic peptides, may be used for determination of presence of genetic relationship between two individuals from measured mass spectra of proteome.
Furthermore, the invention encompasses a data processing system comprising means for carrying out the steps of the method of any one of the preceding claims.
The invention also encompasses a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method of any one of the preceding claims.
Yet furthermore, the invention encompasses a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of the method of any one of the preceding claims.
In the drawings which schematically represent the method or its steps and sub-steps, lines with arrows refer to direct or indirect connection between individual units. Dotted lines with arrows correspond, in general, to alternative embodiments. Alternative embodiments are further indicated with addition of alphabetical letters grouping particular alternative embodiment. Reference numbers of subunits within units are formed as a concatenation of the reference number of the main unit, period and reference number of the subunit. Units depicted on drawings are assumed to be either standalone, or part of some larger units. Dotted outlines of blocks correspond to steps.
“Entity” herein refers to a chemical or biological entity, such as a molecule, substance or organelle. In particular, entity may be selected from a substance, a compound, a lipid, a metabolite, a peptide, a protein and a nucleic acid.
“Prevalence” herein refers to frequency of occurrence of an entity. The frequency of occurrence of an entity refers to its frequency of occurrence in the nature, or in a specific part of the nature which was the source of the measured sample, such as organism, part of organism, specific environment, etc.
Prevalence can be expressed in relative terms, e.g., entity A being more prevalent than entity B, or in absolute terms, such as percentages or amounts of the entity per unit of the sample or of the part of the nature. Prevalence also includes prior probabilities of entities. Prevalence also includes relative probabilistic terms, referred here as prior-like probabilities, wherein the relative differences between entities are the same as for prior probabilities of entities.
“Identity of an entity” herein refers to the determination of structural information about the entity, such as its chemical structure, sequence of amino acids or nucleotides. The structural information may refer to assigning a known structure to the entity, or determining its structure or part of its structure even though previously unknown.
“Candidate identity” herein refers to possible or admissible explanation (or interpretation) of the observed mass spectra and optionally additional chemical or biological data.
“Enumeration” herein refers to method of construction of candidate identities and their prevalence which is based on initial candidate entities and events for their combination. Such events include modifications of the initial entities which may have occurred.
“Score” is a value calculated for each candidate identity. The score may have a form of a numerical value, vector or array of numerical values, interval of numbers, non-numerical entity, entities with established order. Score also includes number with probabilistic interpretation, for example, probability of correctness, p-value, E-value, q-value, maximal probability and their intervals. The skilled person will appreciate that when determining the form of the score, its value which would correspond to the true identity of the entity is also determined. E.g., for a score corresponding to probability, the value corresponding to the true identity of the entity is from 1 to 100%.
“Mass spectrum” refers to mass spectrum (MS) obtained by introducing the entity into a mass spectrometer and performing the mass spectrum measurement, or to MS/MS spectrum. Analytical data from mass spectrum are typically the data about the fragment peaks shown in the spectrum (m/z values, intensities). Additional criteria from the mass spectrum may also be used, such as precursor mass difference, isotopic distribution difference, protein evidence.
“Chemical, physical, biochemical or biological analysis” include any analytical methods allowing to obtain data useful for the determination of identity of the entity. Such methods include spectroscopic analytical methods such as NMR spectroscopy, X-ray diffraction spectrometry, IR spectroscopy; immunochemical methods; optical observation methods; methods relying on interaction with further agents, such as antibodies, labels.
“Explanation” and “interpretation” are herein used to designate the assignment of the identity of at least one entity to an analytical method outcome, i.e. to a mass spectrum and optionally additional data.
The present invention describes a method of determination of identity of an entity based on their mass spectrum data, and optionally additional data from other analytical methods, said method utilizing prevalence data and prevalence or probabilistic calculations. Use of prevalence provides additional layer of discrimination and thus helps in resolution of otherwise indistinguishable situations. For instance, it is often the case that there are many explanations which agree equally well with an observed mass spectrum and additional data. The use of prevalence models might enable between these explanations if one explanation is much more prevalent than the rest. In effect, the utilization of prevalence reduces the complexity of the identification task.
If the way the prevalence is modelled, which is herein referred to as a “prevalence model”, exhibits desirable properties (e.g., the prevalence is expressed in relative probabilistic terms), the candidate interpretations might be often assigned a probability of correctness, or a maximal probability of correctness. Probability of correctness of an explanation has in turn the advantage of being usable in real-life scenarios as it enables long-term modelling of decision-making processes. Similarly, maximal probability of correctness provides strong grounds on which to rule out candidate explanations with direct real-life applicability. This might be shown in contrast to statistical significance of agreement (e.g., p-value or E-value), which does not possess such quality and even highly significant agreement might be often assigned to incorrect interpretations. This behaviour is practically impossible for a properly derived probability of correctness and thus gives practically much more desirable guarantees. The utilization of prevalence models especially helps in reliable identification of rare events. In case the prevalence of candidate explanations varies significantly (e.g., spanning multiple orders of magnitude as is the case in bottom-up proteomics), incorporation of prevalence might be necessary to obtain reliable results.
In other embodiments 102, the identification system 102.1 is separate from a system 102.2 comprising prevalence model, in this configuration the system 102.2 comprising prevalence model process results from said identification system 102.1. Such embodiments are usable for instance to derive maximal probability of candidate identity or probability of candidate identity. More specific embodiments of this kind are further illustrated on
Yet other embodiments 103 comprise an identification system 103.2 and a system 103.1 comprising prevalence model in which the identification system 103.2 works with the selection of candidate identities influenced by the prevalence model 103.1. Such embodiments can be used to preselect candidate identities in a way which improves the behavior of the identification system. More specific embodiments of this kind are illustrated on
In alternative embodiments 403.B, the selected candidate identities are at least as prevalent as the candidate identities accepted initially for testing (hypotheses 401). An example in bottom-up proteomics is when candidate identities 401 for testing correspond to variant peptides, and candidate identities which are at least as prevalent as the variant peptides 402 are selected in step 403.B (based on particular assumptions over prevalence of individual candidate identities).
The first step of the present invention comprises collecting analytical data. The methods for collecting analytical data, in particular mass spectrometry data, are well known to those skilled in the art. For example, in shotgun proteomics, the sample preparation protocols are well established and in general process samples into mixture of proteolytic peptides; see for instance an article comparing three protocols FASP, SP3 and iST (Sielaff et al. (2017): Journal of Proteome Research, 16(11): 4060-4072). Identification of the substances starts with their physical separation using liquid chromatography, coupled with mass spectrometer. Substances eluting at a particular time (retention time) enter mass spectrometer, undergo ionization and their masses are measured, giving a precursor spectrum. In data dependent acquisition, after each such measured precursor spectrum, masses of several most abundant precursors are selected, the ions are isolated, fragmented and MS/MS spectra (also fragment, or product spectra) acquired. These fragment spectra are of interest because calculation of the score may comprise evaluation of the agreement of theoretical spectra of candidate identities and the observed spectra of the entities.
In the second step of the method of the invention, candidate identities of the analyzed entity are obtained. This step can be performed obtained in multiple ways.
In the usual scenario, candidate identities are obtained through a database search of entities for the given samples. For example, the search may be for peptides or nucleic acids or lipids or compounds or metabolites for given analyzed organism. Often, candidate identities are obtained through reference database search containing reference entities (e.g., peptides) for the analyzed organism. Examples of such databases are UniProt and ENSEMBL. If the analyzed entities are proteins or peptides, then proteins from these databases are in silico digested with a protease used in the experiment. As reference proteolytic peptides are of highest prevalence, they are self-contained in the sense that all more prevalent peptides (than the peptide of the lowest prevalence) are considered as well. However, if some modifications of the reference entities are considered, care must be taken such that all modifications at least as prevalent as the modification of lowest prevalence are considered as well.
In some embodiments, the candidate identities may be obtained using enumeration of candidate identities.
Example of embodiments related to shotgun proteomics, i.e., wherein the analyzed entities are proteins, is used to describe some preferred features of the enumeration. The enumeration shown in
In some embodiments, prior-like probabilities are involved in the prevalence models and/or in the calculation of the score. Prior-like probabilities are also referred to in literature as relative probabilities. For prior-like probabilities, the relative proportions between individual prior-like probabilities are the same as for prior probabilities. Thus the following holds: suppose selection of large number n of outcomes (e.g., MS/MS spectra) and their interpretation by candidate identities q and r with prior-like probabilities Prq and Prr, respectively. The proportion of correct interpretations by candidate identity q as compared to r is then
Pr
q
·Pr
r
−1. (1)
Thus prior-like probabilities preserve relative differences between prior probabilities.
As an example of preferred features and prior-like probabilities, the establishment of prior-like probabilities in proteomics is described. Prior-like probabilities can be derived from experimental data under these assumptions: the measured data represent the whole population; and the subset of data which is assumed to be correctly interpreted does not change the distribution.
Function Γ is used from measurements to peptides
Γ→ (2)
which gives for each measurement s (MS/MS spectrum) a true interpretation q (peptide) of the spectrum (assuming that there is exactly one true interpretation).
For including the modifications of the peptides, the following is used: Suppose peptide q represented as a tuple
q=
a
1
⊕m
1
, . . . ,a
n
⊕m
n
(3)
where each ai is a coded amino acid residue and each mi is a modification applicable to the residue ai. The set of applicable modifications to ai is denoted as Φ(ai) and for technical brevity, existence of empty modification is considered.
Denoting by uppercase Q the set of all modified forms of peptide q, the proportion rq of a particular form of peptide q to all its forms Q is expressed as
In practice, there is not enough data to model each peptide separately. It is assumed that their behaviour is independent of peptide sequence and depends only on the modification m itself. For the purpose, peptides can be considered with exactly one residue to which m is applicable, denoting such set as Hm. Denote
m
+⊆m (5)
the peptides with the modification. Then, for a particular modification m, the
is equal to (4) for each peptide q and can be calculated from larger set of data.
Furthermore, the approach can be extended to account for peptides with varying number of modifiable residues. Such extension behaves in the same way on peptides with exactly one residue and enables the utilization of the whole set of interpretations. Specifically, the proportion for modification m on residue a is derived as the total number of a residues modified with m to the total number of a residues with any applicable modification (also the empty one). Suppose
n(a⊖m) (7)
refers to the number of a residues with modification m. Then the proportion rm can be derived as
As another example, probabilities of DNA/RNA substitution are derived. Derivation is analogous as for modifications but with the following difference in the modelling approach. Due to the low ratio of substitutions in the data, the substitution event is modelled in an aggregated manner (independent of the residue). Specifically, the proportion r of all altered residues to all residues is obtained
and interpret it as a probability of amino acid substitution.
As yet another example, for derivation of cleavage probabilities (after particular amino acid) peptides with missed cleavages and semi-specific cleavage (specific at N-terminus and not specific at C-terminus) were utilized. Denoting ncleavage(a) as the number of residues a followed by cleavage, and n(a) as the total number of residues a, the proportion ra of cleavages after residue a is
As yet another example, calculation of prevalence of peptides in allografts or xenografts is shown, for identification of peptides originating from grafted tissue in tumour xenograft model. In this case, the prevalence of peptides from different organisms (donor and recipient) is different and needs to be taken into account. Often the recipient would be an animal model and the donor would be a human; peptides from the animal are expected to be more prevalent than those of the human, for instance to monitor tissue/organ grafting or rejection or identification of donor's peptides in the recipient. Alternatively, the recipient can be a patient undergoing tissue transplantation and the donor can be a tissue/organ donor. In the following, it is shown how to estimate the difference in the prevalence of donor and recipient peptides.
The relative difference in the prevalences of donor peptides to recipient peptides can be estimated through derivation of the origin of homologous peptides of the donor and the recipient. Suppose a homologous peptide attributable to both donor and recipient was identified. The interest is in knowing whether the peptide is from the donor or the recipient. For this purpose, protein evidence (of donor proteins and of recipient proteins) of a given peptide can be used which provides the evidence of the origin of the peptide. The proportion p is estimated as the proportion of the homologous peptides with the donor protein evidence as compared to those with the recipient protein evidence. In the construction of the protein evidence, the protein evidence is restricted to heterologous peptides only. In another approach, the proportion p is estimated as the ratio of detected heterologous peptides. Both approaches can be used when there is a limited homology between donor and recipient, which is often the case in xenografts. In allografts, the proportion can be set equal. From a practical perspective, the relative difference between prevalence of donor and recipient peptides is rather small; for instance the number of the donor peptides is in the order of tens of percent of those from recipient. This is important to note as it simplifies identification of donor peptides as there is no other organism (other than donor) expected to be of higher prevalence than that of recipient.
As another option, determination of prevalence of peptides of non-host organisms is described. The situations when identification of non-host organism is of interest include for example detection of microbial presence in an organism, for example for diagnosis of microbial infection.
In some situations, prevalence is known. In general, it is necessary to take the prevalence (or its estimate) of the non-host organism into account. The situation is partially similar to allografts or xenografts, however with the difference that the prevalence of peptides of the non-host organism is generally lower than that of a grafted tissue and non-host peptides are phylogenetically more distant. This has some consequences, notably that all non-host organisms of higher prevalence need to be considered as well (among other at least as prevalent peptides). If the estimate of prevalence (po) of the non-host organism o is known and also the prevalences of all organisms q at least as prevalent as the organism o, po≤pq, then the prevalence model can be easily configured as follows. The prevalence should be expressed in prior or prior-like probabilities and then the prevalence of non-host peptides of the organism o is multiplied with the value of prevalence po.
Often, the prevalence of the non-host organism is unknown, which complicates the identification task. It is nevertheless possible to increase the specificity of identification, if it can be ascertained that all at least as prevalent organisms are considered as well. In case of unknown prevalence, one possible resolution is to consider all known organisms. This is because it is rather unlikely that organisms which were not described yet are more prevalent than the one we are interested in identification.
In the strictest scenario, all peptides of the non-host organism of interest are assumed of the lowest prevalence among all peptides of all considered organisms. An example of realization of such scenario is to enumerate all organisms with a limiting prior-like probability r (e.g., and estimated prior-like probability, such as r=4·10−6) and further scale down prevalences of non-host organisms. In particular, the prevalence of the host remains identical, but the prevalence of all non-host organisms except the one of interest is multiplied by r, and the prevalence of the non-host organism of interest is multiplied by r2. In such case, the non-host peptides of interest are of strictly the lowest prevalence. As the prior-like probabilities are just very roughly established, it is preferable to calculate the maximal probability Pmax instead of the probability; furthermore in the calculation of Pmax, only the relative order of prior-like probabilities is preferred to be assumed. This scenario requires enumeration of peptides for all known organisms. In such case, the database for r≈4·10−6 will likely have thousands of terabytes.
A more preferred approach to the problem of identifying organisms of unknown prevalence was developed. The computational aspect can be significantly simplified under the following assumptions: i. if a peptide of the non-host organism was measured, it was a fully specific (e.g., tryptic) reference peptide, and ii. all non-host organisms are of equal prevalence which is lower than that of any enumerated peptide of the host. Under these circumstances, the peptides exclusively attributable to the non-host organism of interest can be preselected and added to obtained peptides in enumeration of host.
The third step of the method of the invention, the score is calculated for each candidate entity.
First, agreement metrics used for observed and theoretical spectra is described. Various metrics of agreement between observed and theoretical spectra exist, such as simple number of matching peaks consisting of singly charged ions (b, y ions for CID and HCD). The matching of peaks occurs for prespecified fragment mass tolerance depending on instrumentation used (e.g., 0.3 Th for linear ion trap). The skilled person is aware of the available options and is capable of choosing the suitable one. Derivation of maximal probability of candidate identity can be based on adequacy of the agreement model. The following corresponds to the assertion that a higher agreement of predicted and observed spectrum leads to an increase in the probability of interpretation being correct. Thus for every spectrum o ∈O, given two agreements
α≤β (11)
it is assumed that
P(p=Γ(o)|Φ(p,o)=α)≤P(q=Γ(o)|Φ)(q,o)=β) (12)
where
Γ:→ (13)
is a function giving correct interpretation for a spectrum and
Φ:×→ (14)
an agreement model (for some set X, on which there is order). Furthermore, it is assumed that the same agreement for different candidate identities gives an equal probability of being the correct interpretation from the viewpoint of agreement:
P(p=Γ(o)|Φ(p,o)=α)=P(q=Γ(o)|Φ(q,o)=α) (15)
For the candidate interpretation q of o, with the prior-like probability Prq, there is a need to obtain a set R of all at least as good interpretations. R is of the following form
R={
p,Pr
p
|p∈ and Prp≥Prq} (16)
wherein p is a candidate identity and PrP its prior-like (or prior) probability. Thus for the establishment of Pmax, only candidate identities which are a priori equal or more likely need to be considered.
If the correct order of prior-like probabilities is considered, then the maximal probability Pmax of q is inversely proportional to the number of the at least as good interpretations, thus
P≤|R|
−1
=P
max. (17)
If numerical aspects of prior-like probabilities are assumed, then Pmax is the proportion of Prq among all the at least as good interpretations, thus:
The value of Pmax is independent of search space size.
Preferred methods for derivation of the probability of candidate identity, with prior-like (or prior) probabilities, are described herein. The probability that a particular candidate identity h is the true candidate identity h=δ(o) of the spectrum o, given its agreement d=Φ(h,o) is:
P(h=Γ(o)|Φ(h,o)=d). (19)
Such probability can be derived, for example, using Bayes' Theorem, wherein (19) equals to
The equation (20) can be easily changed to incorporate prior-like probabilities, instead of prior probabilities. In case of prior probabilities, the following holds for each o ∈O, if exhaustively all candidate identities h ∈H are considered:
In case of prior-like probabilities, the sum (21) might be different, but because prior-like probabilities preserve relative differences, they can be always normalized by resealing.
Prior-like probabilities are easier to establish, however it might be not clear how they should be rescaled. If candidate identities are selected, such that the true candidate identity is among them, then prior-like probabilities can be rescaled to sum to 1 and then are equivalent to prior probabilities. If the true identity of the analyzed entity is not within the given candidate identities
0⊆ (22)
and therefore
The variable c in (23) then corresponds to the probability that the true identity of the analyzed entity is within the selected candidate identities H0. Then, prior-like probabilities of selected candidate identities H0 can be rescaled (their sum) to c and will be equal to prior probabilities.
For instance, in shotgun proteomics, not all candidate interpretations of the spectrum are considered and therefore the true interpretation might be not among them. However, around 25% of spectra are usually correctly interpreted in standard bottom-up proteomics experiment. Thus the value of c≥0.25 and c≤1. This also limits the possible range of prior probabilities in shotgun proteomics and in turn the range of probability of correctness.
In mass spectrometry, multiple additional (supporting) criteria, e.g., precursor mass difference, can be directly used in identification. Furthermore, these criteria are useful for identification of rare events, for instance variant peptides. For practical purposes, it is often desirable for these criteria to have statistical interpretation for simplification of making decisions. In particular, the probability was modelled that the true interpretation of a spectrum has a specific additional/supporting criterion at least as extreme as was observed. This in effect enables removal of interpretations.
An example of selection of true interpretations in proteomics field in experiments over a particular organism (for example, human) is mentioned here. There is no need for the interpretations to be correct, rather that the distribution of these criteria to be same as for correct interpretations. Therefore some degree of incorrect results (e.g., 10%) will very likely not invalidate the results. It is assumed that all tryptic reference peptides of expected organism (e.g., human) with sufficient spectral significance (herein, E-Value of 0.1 in X!Tandem in the reference peptide database search) selects suitable interpretations for modelling of supporting criteria.
In some embodiments, precursor mass difference is used as an additional criterion. Distribution of differences between observed and calculated mass of peptide for true interpretations can be readily calculated. Further, association of probabilistic interpretation to differences enables their direct use in identification.
It is assumed that the distribution D of n differences between observed and calculated precursor mass for correct spectral interpretations matches
D=
d
1
, . . . ,d
n
. (24)
The number n is often rather large (order of thousands, or tens of thousands) for a particular sample, or even for a single run on modern instruments (such as Orbitrap). Therefore it is not even necessary to model the distribution and thus it is possible to work directly with data, e.g., through percentiles. For probabilistic interpretation of the difference d, D is utilized to calculate pd as proportion of true matches having at least as extreme difference as is d. Thus for difference d, we are interested in
Thus if it is unlikely (e.g., pd at most 0.01) for true match to have at least as extreme precursor mass difference, it provides probabilistic grounds for removal of interpretation.
Mass spectrometry is in modern settings coupled to liquid chromatography which enables utilization of the predicted and observed retention time, similarly as the precursor mass difference. In practice, it is also beneficial to have a statistical interpretation of the difference between those two. In the simplest cases, the retention time difference can be modelled exactly as a precursor mass difference explained above. The prediction of the retention time can be done, for example, via BioLCCC (Liquid Chromatography of Biomacromolecules at Limiting Conditions; http://theorchromo.ru/).
It is preferred, however, to model the behaviour more locally, especially because the observed retention time often shows a non-linear behaviour as related to predicted time. The modelling can be performed for each predicted time ti separately and for each ti construct the distribution Di. Each Di consists of an experimental times (experimental counterparts of tj, tj being neighbors of ti). Each Di contains 2·w neighbors, where 2·w is the window size (the preferred size is 500):
D
i
=
e
i−w
, . . . ,e
i
, . . . ,e
i+w
. (26)
Then for some theoretical time t and experimental time e the position of e within distribution Di is obtained wherein Di is selected such that its corresponding ti is closest to t. A percentile q is obtained
representing where e is within distribution Di. It is of interest to remove an expected proportion of true results based on their retention time. It is assumed that a symmetric removal of those which are on both sides of the distribution is needed and then q is converted to
p=2·(½−|½−q|), (28)
which gives the required proportion. Thus, for instance, selecting results with p≤0.1 is then expected to give 10% of results with the largest differences (on both sides).
In tandem mass spectrometry, precursor spectra are often also measured and thus difference between the theoretical isotopic distribution and the observed one can be readily calculated as well. The difference can be also associated to statistical interpretation, analogously as for precursor mass difference. The software Isotopic Pattern Calculator (http://isotopatcalc.sourceforge.net/) can be used for the prediction of theoretical isotopic distributions.
Although multiple ways to calculate difference between distributions exist, we will utilize very simple one. For calculation of difference, peaks of theoretical and experimental distributions are initially mass-matched to some precursor mass tolerance (for instance 5 ppm on Orbitrap). Intensities of both distributions (experimental and theoretical) are normed to the sum to one and from this alignment the sum of squares is calculated from difference of intensities. The proportion of true results having at least as extreme difference as expected is then calculated. The calculation can be carried out in the same way as for precursor mass difference.
In the specific example of bottom-up proteomics, proteins are enzymatically digested into peptides and therefore in the resulting mixture it is expected that all peptides (of a particular protein) are present. This is called “protein evidence”. It is therefore unlikely that just one peptide of a protein is identified, and this behavior may be modelled. Although multiple options for modelling of protein evidence exist, the modelling is restrained just to the presence or absence of different protein evidence (e.g., by assigning zero and one, respectively). Thus the probability of true match having p as the extreme protein evidence is:
In practice, the p≤0.1 for no protein evidence and p=1 for protein evidence. The task can be performed even before the step of protein inference, stating whether there exists a particular reference protein isoform for which there exists another peptide.
Additional/supporting criteria (e.g. precursor mass difference, retention time, isotopic distribution difference, protein evidence) can be combined to obtain a single criterium with a desirable statistical interpretation. This criterium is built in a way that it is expected to remove a desired proportion of true matches.
For the particular criteria ci of interest (e.g. precursor mass difference and retention time) of some peptide-spectrum match
c
1
, . . . ,c
n
(30)
a single value is calculated
c=Πc
i. (31)
For each c the proportion f(c) of the retained true matches among the results is calculated:
For a new instance of peptide-spectrum match with criteria
d
1
, . . . d
n
. (33)
its single value d is calculated as in (31), and then its value of f(d). This can be done by looking up the closest value c to d, and obtaining f(c) (this, for example, equals the behavior of k-nearest neighbor with one neighbor). Of course, in this step various options exist, however, as often a lot of data is available, k-nearest neighbor with one neighbor is preferred due to simplicity and obvious interpretation. Then we can filter out the results with the expected loss of a desired proportion of true interpretations; based, however, on all additional/supporting criteria used.
The fourth step of the method of the invention relates to determining the identity of the analyzed entity.
Maximal probability Pmax of interpretations provides a rationale for removal of unlikely matches with a predictable long-term behaviour. For instance, selection of a large number n of candidate interpretations with Pmax=p is expected to result with at most n·p correct interpretations. Therefore, the knowledge that a particular interpretation is less likely than a pre-determined value provides a rationale for its removal.
The most strict settings of Pmax corresponds to interpretations of Pmax=1, and are a preferred embodiment.
The probability P of candidate interpretations provides a rationale for selection of matches with a predictable long-term behaviour. For instance, selection of a large number n of candidate interpretations with a probability higher than p, is expected to result in at least n·p correct interpretations.
The probabilistic interpretation for the additional/supporting criteria is built in a way to express how likely it is that the true interpretation has the supporting criteria as extreme as observed. If it is therefore unlikely (e.g., up to 10%) that the true interpretation would have as extreme criteria, then by removal of these interpretations it is expected that the same proportion (e.g., up to 10%) of correct matches can be removed.
In the following paragraphs, several uses of the method of the invention are described.
In one preferred embodiment, the method of the present invention can be utilized for matching to databases of origins. The following section describes matching of identified peptide or nucleic acid variants of known prevalence to database of origins, with each origin containing set of variants;
For an analyzed sample s, we are interested in its true origin Γ(s) and agreement Φ(s,Ci) of the sample s and a candidate origin Ci can be used for its establishment. Further, the sample s is considered as a set of variants {v1, . . . , vk} identified in the sample s, denoted as
v(s)={v1, . . . ,vk}. (34)
The agreement Φ(s,Ci) can be, for instance, a number of matching variants. However, it is more preferable to define the agreement as
wherein φ represents the prevalence of a variant, because it provides the probabilistic interpretation over the obtained results. Rescaling of (35) such that the sum over all Ci equals one (the true origin is within the considered origins), gives the probability of the origin Ci being the true origin:
P(Γ(s)=|Φ(,s)=x). (36)
Herein, a rationale for the use of the formula (35) for derivation of (36) is provided. Foremost, consider the agreement Φ(s,Ci) to be the actual matching variants:
Φ(i,s)=i∩v(s). (37)
Said agreement (37) is used to enable derivation of the probability of determination of origin. Suppose, v(s)={va,vb} variants identified in sample s and suppose they are identified correctly (the true origin has these variants). Consider two origins
a
=
∪{v
a},b=∪{vb},va∉,vb∉,va≠vb. (38)
Let us first set up the following notation: denote C+ all origins in a population having variants C. Probability of the true origin being within the sets of origins a+ and Cb+ might be considered equal, given the same number of matching variants
P(Γ(s)∈a+|Φ(a,s)={va})=P(Γ(s)∈b+|Φ(b,s)={vb}) (39)
Thus if the true origin is within C+a or C+b with the same probability, the size of both sets can be investigated. Suppose that the individual origins (within C+a and C+b) are equally likely a priori. Then the relative differences in the probability of the origins Ca, Cb are inverse to number of entities having the corresponding variants:
Furthermore, the relative difference between the number of the entities might be derived using population frequencies as:
The relative differences between the individual origins are preserved, and if the true origin is within the considered origins, they can be normalized to the value of one and thus (36) is readily established. Although, the assumption (39) was based on same number of matching variants, it might be assumed to hold, in general, for any sets of matching variants.
Another use of the method of the invention is in diagnosis of cancer by identifying somatic mutant peptides attributable to tumour in a sample taken from the body of a patient, e.g., blood or other fluids. The identification of the somatic mutant peptides attributable uniquely to tumour can be used for non-invasive diagnosis and monitoring of progression and recurrence of the disease.
For determination of the status of a variant (somatic or germline), various criteria can be used. Herein, global nucleotide alterations for the purpose are used.
Germline variants are considered as follows: a variant is present in dbSNP (v. 147), or ExAC (version of ExAC compilation without TCGA) and is preferably of a population frequency higher than 1·10−4 (in any of dbSNP or ExAC). Somatic variants are defined as those present in COSMIC, ICGC or TCGA, but not present in dbSNP and also not present in ExAC.
The presence of the somatic mutant protein variants (e.g., in blood of individual) can be in itself a sign of presence of a tumour. This is especially true for tumours with a high mutation rate, e.g., a melanoma.
For a more precise identification of the somatic mutant proteins exclusively attributable to tumour, the sample (e.g. of blood or blood plasma or blood serum or tears or urine or saliva or stool or breath condensate or lavages or effusions or liquors, etc.) of the patients may be analyzed before and after treatment (e.g., surgery, radiation, chemotherapy, biological therapy, immunotherapy, etc.). The drop in somatic mutant proteins after treatment establishes their exclusive correspondence to tumour and ultimately the tumour response. This can be done for establishing standards for such measurements or for the monitoring of a patient.
Another possible use of the method of the invention is in the monitoring the response of a recipient after transplantation by selection and quantification of peptides of the donor in a sample taken from the body of the recipient. Identification of increasing quantities of donor peptides in the sample (e.g. of blood or blood plasma or blood serum or tears or urine or saliva or stool or breath condensate or lavages or effusions or liquors, etc.) of the recipient is a sign of rejection or a risk of rejection of the transplanted organ.
Analysis of polymorphisms in the proteins is performed both on donor and recipient separately. Once these polymorphisms are associated with donors and recipients, identification of polymorphic peptides is uniquely linked to donor and recipient as well. Non-exclusive polymorphisms are not considered.
The quantification can be done using any label-free quantification method, for example by integration of area under curve in LC/MS spectra. For precise quantification, targeted quantitative methods such as SRM/MRM can be used. Once polymorphisms are established and transitions of polymorphic peptides available, their monitoring in individuals can be readily performed.
Furthermore, calculation of pair-wise variant-based agreement between individuals is performed. The method is schematically shown in
The agreement may be based plainly on number of matching variants identified using particular methods ma,mb, for example, as follows:
Φ(sa,sb)=|ma(sa)∩mb(sb)| (42)
Alternatively, the agreement may be in probabilistic terms. For the purpose, Γ is a function from sample to its true origin, wherein the origin e is a subset of all variants (neglecting the possibility that two distinct origins have the same variants). The probability of two samples having the same origin, given the observed agreement, is:
P(Γa(sa)=Γb(sb|Φ(sa,sb)=x). (43)
Yet alternatively, the probability of at least as extreme match as x at random may be used:
P(Φ(sa,sb)≥x). (44)
Method m of identification of variants applied to a sample may identify exactly the variants in origin, and the origin is the same if and only if the identified variants are equal in both samples, however such situation is less likely in practice.
In some embodiments, the method m applied to the sample identifies a proportion r of variants in the sample. The proportion might be unknown in advance (or it might depend on the concentration of the sample, etc.), but the fact that samples are drawn from a known population may be utilized for its derivation. In such case, the expected number of variants in the sample with known population frequencies is
Σϕ(v)=n. (45)
The probability of both the presence of variant v and its identification using method m is designated as Pm+(v).
In some embodiments, identification of a variant might be independent of the variant itself, and therefore the probability of the identification is equal for each variant. In other embodiments, the probabilities may be different. Nevertheless, if n variants were identified using the method m in a sample, then the probabilities of identification can be expressed as the expected number of identified variants being the actual number of identified variants:
ΣPm+(v)=n. (46)
In shotgun proteomics, it is beneficial to model the probability of identification of a variant. Such identification might be modelled as a function of variant abundance because more abundant proteins are more likely to be measured. This is also important for establishment of agreement due to chance, because identification of variants of high population frequency in highly abundant proteins is quite likely even for random individuals. The probability of identification of a variant can be modelled on per-protein basis as its abundance. However, it is preferred to model it as a coverage C(p) of protein p (by identified reference peptides), which effectively normalizes the abundance within zero-one range. Proteins which are highly covered by peptides give a high probability of identification of a variant (if the variant is present) as opposed to proteins with a low coverage. For further simplification, the coverage might be calculated over genes and restricted to peptides uniquely alignable to genes (around 90%). The coverage for a gene might be then defined as the average of coverages of proteins (corresponding to the gene). This is followed by a further normalization of the probabilities of identification, such that (46) holds, as follows:
Depending on the agreement model, at least as good match (44) may be calculated using different approaches. In general, the probability may be numerically calculated using viable methods, for example, Monte Carlo simulation. The following paragraphs focus on number of matching variants (42).
In general, the probability of a match of particular variant v at random using methods ma,mb is
P
m
(v)·Pm
if these events are statistically independent. The situation is, however, more complicated in shotgun proteomics, because some peptides are more suitable for identification (e.g. due to ionisation characteristics) than others. In other words, identification of variant in first sample changes, usually increases, the probability of identification of a variant in the second sample.
This effect can be modelled on pairwise comparison of coverage of matching proteins in both samples. Suppose Ca(p) is a coverage of protein p in sample a and Cb(p) in sample b. The expected shared coverage if distributed uniformly is Ca(p)·Cb(p). However, as there are preferences for individual peptides, the actual shared coverage is usually higher. The relationship could be modelled in variety of ways. Given the large set of available data, it can be modelled also using k-nearest neighbor regression. Here, the regression model is represented as a function k (5 neighbors, Euclidean distance). Therefore instead of multiplication of probabilities as in (48), it is calculated as
k(Pm
If the agreement model is a number of matching variants (42), the calculation of (44) may be approximated, for example, using Binomial distribution, with the probability of success in one trial being equal to mean value of (49) over all variants.
The determination of the identity of the entity solves the problems of interpretation of mass spectra commonly encountered in shotgun proteomics and many other fields.
The method of the present invention may be also used for determination of identity, in particular for authentication of cell lines or identification of an individual from mass spectra of proteome.
The method may be also used for identification of non-host organism from mass spectra of proteome of host organism, in particular for diagnosis of microbial infection or colonization.
The method may be also used for identification of presence of a tumour from mass spectra of body fluid proteins or estimation of tumour characteristics through presence or absence of somatic mutations.
The method may be also used for monitoring of organ transplantation and early detection of transplant rejection from mass spectra of biological materials of recipient.
This example illustrates fragment mass spectrum of analytical data collected for unknown peptide in shotgun proteomics. This particular example of MS/MS spectrum is shown on
The candidate identities for spectrum on
In the next step, all candidate identities of prevalence higher than 4·10−6 were obtained, and only candidates within 5 ppm (parts-per-million) of calculated precursor mass were considered and are ordered from highest prevalence, an excerpt shown in the following table:
The precursor mass difference of 5 ppm was selected due to accuracy of the employed mass spectrometer (Orbitrap Elite). Depending on experimental conditions, the precursor tolerance can be much wider (e.g., 500 Da) as is the case of open search or total (all candidate identities independent of precursor mass are considered). In these cases, the mass difference is further localized (or decomposed into multiple modifications and their localization) as is usual in open search, but prevalences of candidate identities with localized masses are further updated by corresponding prevalences of modifications.
The limiting prevalence of 4·10−6 should be low enough for most experiments at present. The number of spectra per experiment is in order of hundreds of thousands; in such case we would expect at most few peptides identified if lower minimal prevalence was taken into account.
This section describes agreement of theoretical spectrum of peptide and experimental (measured) spectrum. Number of matching peaks (of experimental and theoretical spectrum) is used as particular agreement model (
The following table illustrates determination of identity using maximal probability of candidate identities (Pmax column), calculated from agreement and prior-like probabilities.
The following description explains the association of probability of correctness (P) using Bayes' Theorem. For this purpose, model of true agreement and random agreement is defined.
Agreement of true interpretations is modelled as follows. Agreement is evaluated on interpreted spectra from spectral database of X!Hunter, which are assumed to be true interpretations. The behaviour (
Agreement of random interpretations for this spectrum is visualized on
In derivation of prior probability from prior-like probabilities, the c in (23) equals 1.0 (this is assumption that true interpretation is within the candidates), which enables establishment of probability. The associated probability of candidate identities is shown in the following table:
The run corresponding to the analytical data was used for selection of true interpretation as explained earlier (tryptic reference human peptides, 5 ppm precursor mass difference, statistical significance of 0.1).
The
The
Similarly as for precursor mass difference,
In case of protein evidence, 8.129% of assumed true interpretations were without existence of other peptide from the same protein.
The
The ROC curve on
The following table contains associated evidence from supporting criteria and their combined values.
The combined p can be used for removal of matches which are unlikely to be correct. In this case, selecting expected removal of 10% of correct results, the first interpretation (highest scoring from viewpoint of spectral match) is not removed.
Determination of identity in this example is based on selecting interpretation which is of maximal agreement and has Pmax=1. Such interpretation can be at most one, and it is the best candidate for given agreement and prevalence modeling; it is the first one in the previous table.
Determination of identity in this example is based on selection of the interpretation which has higher probability than 0.5; such interpretation can be at most one and it is the most likely interpretation. In this example, it is again the first one and the identity determined is the same as in previous example using Pmax and highest agreement.
The system overview (
In general, the identification system 1901 corresponds to 102.1 and the rejection system 1902 corresponds to system comprising prevalence model 102.2. The search database for X!Tandem is represented in form of variant peptide fasta file constructed by translation of variant mRNA, and excerpt from it looks as follows:
The deep database 1902.1 corresponds to prevalence model, was obtained through enumeration (
Further, it is preferred to construct the database first for a wide range of masses (for example, 700-2500 Da) and further index peptides into smaller ranges (for example, 0.01 Da), to save computational time.
The rejection system 1902 is an example of incorporation of prevalence model for reevaluation of candidate identities (
The process is illustrated step by step on identification of variant peptides on samples measured on colorectal cancer cell line HCT116. The steps can be split into three phases: i) spectral match using database search, ii) assignment of additional information, iii) obtaining additional candidate identities. In the first step, the variant peptide database is searched using database search method, herein using X!Tandem. Matching of spectra and variant peptides gives initial results, an example is illustrated in the following table, ordered by most significant matches first (E-Value).
There are many highly significant matches of variant peptides, which however, does not mean they are correct interpretations.
In the second step, variant peptides are aligned to reference protein-coding sequences (ENSEMBL, human genome), their distance to reference genome is calculated and additional information attached. Only reference peptides which can be result of a single nucleotide variation are considered in this example (this is also because prevalence of such peptides is much higher and simplifies identification task). Furthermore, here, only peptides which can be aligned to one genomic location are considered (such decision has some benefits, for example, it is easier to establish peptide-derived nucleotide variation, which has further benefits of deriving population frequency, or calculating correspondence to nucleotide sequencing of matching sample). Excerpt of results of this procedure is illustrated by the following table:
In the third step, for each spectrum with variant peptide interpretation, all candidate identities (minimal prior-like probability of 4·10−6) are enumerated (as described previously).
Maximal probability of interpretations is established. Results of this procedure are illustrated in the following table:
Further, column “support” contains cell-line matched sequencing support; showing whether such variant was observed in sequencing. It can be seen that many of statistically significant results are not supported by sequencing. However, they have also low Pmax and can be rejected. The only two sequencing supported results in the example have Pmax=1.
The criterium for determination of identity used here was maximal spectral match and Pmax=1.
The method was used for identification of variants in human family members (
Note that the exome sequencing of particular sample was not used in construction of the global database. The evaluation of sequencing support against exome sequencing is most meaningful for germline variants as those are always present in substantial proportion.
Comparison with Proteome from Translated Exome
The previous table also shows comparison of number of identified variants, if knowledge of exome sequencing was used to create proteome with all variants. In such case, germline variants were based on exome sequencing in a following way: variant was found in at least one parent and one kid. The results suggest that even if sequencing of sample is available, its benefits are limited as around 80% of germline variants are identified with the use of global nucleotide database (at around 95% of sequencing correspondence).
In some cases when sequencing support is evaluated, the results might contain some of correct, but sequencing unsupported interpretation. This is because some variants are likely not sequencing supported because of low sequencing coverage of surrounding area, as illustrated on
This example shows utilization of claimed method for identification of cell line. The analysis is performed on publicly available data of NCI60 panel (Gholami et al. (2013) Cell Reports, 4(3): 609-620). Variants were identified as in previous example (the system architecture on
The identification of origin was performed against NCI60 exome database (Shankavaram et al. (2009) BMC Genomics, 10(1):277) and it was assumed that the true origin is within considered origins (thus within NCI60 exome database).
The process of matching of database of origins is further illustrated on a specific sample (P0001751) from NCI60 proteomes. An excerpt of results of variant peptide identification restricted to polymorphic peptides is shown in the following table:
Polymorphic peptides were used to calculate match with exome sequencing data and used for calculation of probability of correct determination of origin, with an excerpt of results illustrated in the following table:
The data indicates that the most likely cell line within considered origins is PR:PC 3. In the metadata of P0001751, it can be seen that the cell line is claimed as RE:SN12C, therefore it can be concluded that it is likely, that the cell line is mislabeled.
The procedure was further performed over all proteomes in the dataset; the results are visualized in the following table and point out two likely mislabeled cell lines:
Identification of mislabeled cell lines is of high importance, because it prevents propagation of incorrectly drawn conclusions and contributes to reproducibility of science outputs.
This example shows utilization of the method for identification of a person. The analysis is performed on in-house data of family of particular structure (
The example is analogous to matching of cell lines. The database of origins corresponds to sequencing database of family members. The same methods are used for assignment.
Results
The results of identification are shown in the following table:
The only misidentification happened in case of monozygous twin and was indicated by high probability of error (0.4).
This example illustrates the utilization of the identified variants for determination of genetic relationship. For this purpose, variants were identified in blood lymphocytes of family members (
The calculation of significance of match is further illustrated here. Similarly as in identification of genetic origin, only variants more prevalent than 1% (as population frequency in dbSNP) were used for calculation of match (these variants are more likely germline variants).
For calculation of pairwise match, it is beneficial to establish tabular structure, which helps in organization of data. Small subset of such tabular structure is visualized in the following table:
In the table, each line corresponds to particular variant. The “p+” refers to population frequency of a variant from database, and Pma+(v) to multiplication of coverage of individual gene in sample a and population frequency. Analogously for b. The k(Pma+(v), Pmb+(v)) refers to probability of its identification in both samples.
The calculation of significance of match was approximated by Binomial distribution with mean value of
k(Pm
The results of this method (the approximation by Binomial distribution) are further illustrated here:
The method was further applied in pairwise manner to all family members (
This example illustrates embodiment for identification of tumour-specific circulating proteins in blood serum. In this example, publicly available data accessible on PRIDE (identifiers: PXD004624, PXD004625, PXD004626) were used for identification of mutant proteins. For identification of variants, the same method which corresponds to
In this example, mutations attributable to tumour were assumed to be all somatic mutations identified.
The results (
In this example, human reference and variant proteins are identified in blood serum from murine xenografts. Configuration of the experiment is based on
In enumeration of candidates, peptides are enumerated for both organisms (here, mouse and human), limiting condition being prior-like probability of 4·10−6. Prior-like probabilities of peptides enumerated for human are multiplied (herein and in practice linearly scaled down) by relative difference of prevalence of human to mouse. The number is derived for particular experimental circumstances.
The proportion in this example of p=0.25 was estimated from homologous peptides and heterologous protein evidence, explained earlier. Tabular structure corresponding to homologous peptides and their heterologous protein evidence is illustrated here:
Peptides with Pmax=1 and maximal spectral agreement are retained.
The identification method was used for identification of human protein biomarkers across wide range of cancer tissues transplanted to mice. The results (
This example illustrates utilization of prevalence for diagnosis of mycoplasma in host organism. In this case, the prevalence of non-host organism is assumed to be unknown and thus refers to the more complicated situation as described earlier.
For particular mass spectrum, peptides mapping exclusively to reference mycoplasmal peptides (among all organisms) and all human peptides (prior-like probability of 4·10−6) were obtained. As previously described, mycoplasmal peptides were defined to be of strictly lower prevalence than any enumerated human peptide.
Only mycoplasmal peptides of highest agreement and Pmax=1 were retained.
This identification approach was applied to subset of projects in PRIDE repository and results of diagnostics of Mycoplasma are illustrated on
The following example illustrates utility of presence of both light and heavy isotopic forms using Stable isotope labeling with amino acids in cell culture (SILAC) of sample for identification of variants. The variant identification was performed as in previous examples (
The additional criterium in this case is identification of both light and heavy forms of peptide of interest.
The analysis of SILAC-pairing and its effect on sequencing support of variants was analyzed for two criteria (initial significance: E-Value ≤0.1) and after rejection (Pmax=1). The results in the following table show that peptides which were identified in pairs have much higher sequencing support (when evaluated against sequencing of matching sample).
This difference was most remarkable in variants of low population frequency (more likely somatic mutations). Therefore, isotopic labels can be utilized to increase specificity of identification of somatic mutations.
The scheme (
The enumeration for particular protein starts with reference exon-based protein model 2701, for which individual exons of corresponding gene are either present or not. Such a model can be represented by a binary vector representing presence of exon in the model. Different protein models 2702 are constructed by exon inclusion or exon exclusion events 2703 with associated effect on prevalence.
In some embodiments, prevalence might be expressed in prior-like probabilities and then exon inclusion, or exon exclusion are assigned probabilities of these events.
The enumeration process continues until limiting minimal prevalence condition 2704 is met. The protein models are transformed by concatenation of individual exons and translated into proteins 2705 with their associated prevalence, which then further constitutes the prevalence model 2706. Proteins constructed in this way might be used, for example, directly in top-down proteomics in identification, or proteins might be further digested for use in bottom-up proteomics.
This example illustrates embodiment for identification of tumor size and associated disease stage. Herein, publicly available data from Clinical Proteomic Tumor Analysis Consortium, specifically TCGA Colorectal Cancer were used for identification of mutant proteins using the same method which corresponds to
The parameters of determination of somatic and germline variants are as follows. Germline variants are considered as follows: a variant is present in dbSNP (v. 147), or ExAC (version of ExAC compilation without TCGA) and is preferably of a population frequency higher than 1·10−4 (in any of dbSNP or ExAC)
The results on
Number | Date | Country | Kind |
---|---|---|---|
18184710.4 | Jul 2018 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2019/069552 | 7/19/2019 | WO | 00 |