The contents of the electronic sequence listing (W-4379-US02.xml; Size: 80,150 bytes; and Date of Creation: Nov. 4, 2022) is herein incorporated by reference in its entirety.
The present disclosure relates to methods, techniques and processes for selecting one or more peptides for virus detection in clinical samples using mass spectrometry. More specifically, this technology relates to methods, techniques, and processes for selecting peptides for detection by mass spectroscopy that are indicative of positive infection of a disease state, such as influenza, in a subject or in a pooled sample of multiple subjects. This technology also relates to methods, techniques, and process for selecting peptides for detection by mass spectroscopy that are indicative of a number of virus protein variants.
Seasonal influenza is caused by influenza A and B viruses which circulate globally. In temperate and many tropical and subtropical regions, season influenza viruses are typically transmitted during widespread outbreaks coinciding with winter or rainy periods. Other tropical or subtropical countries may experience prolonged seasonal influenza epidemics or year-round circulation. A hallmark of influenza viruses is their ability to undergo antigenic drift in response to population immunity. Because of antigenic drift, influenza vaccine antigen composition is reformulated regularly to match the vaccine strains as closely as possible to currently circulating influenza viruses.
Occasionally, influenza A viruses undergo antigenic shift. These changes may be caused by reassortment between different influenza A subtypes, such as between animal and human subtypes. Pandemic influenza can result if there is very limited or no immunity in the population, if sustained person-to-person transmission occurs, and if infection causes clinical illness (e.g., SARS-CoV-2).
In addition to influenza and corona viruses, other viruses, such as human rhinoviruses cause distress for the respiratory system. These viruses, while common and responsible for more than one-half of cold-like illnesses, can be particularly troublesome for patients with asthma, infants, elderly, and immunocompromised.
It is not uncommon for a human subject to be infected with multiple viruses at the same time. And during times of pandemic or at the height of flu season, treatment options and contact tracing efforts may require rapid identification of the source of a subject's symptoms. With rapidly changing viruses and variants, and the possibility of infection with multiple different viruses, detection protocols and tools are needed to be able to rapidly test and identify the source of infection.
In general, the present technology is directed to the selection of peptides for detection by mass spectroscopy that are indicative of positive infection of a disease state, such as a viral disease state, e.g., influenza. The present technology, in some instances, can be used to determine which peptides to analyze from a protein sample to clinically determine using mass spectrometry the presence or absence of a disease state. For example, the present technology can be used to provide a workflow extending from sample preparation through mass spectrometry analysis of a single or pooled sample to analyze the presence of one or more peptides as an indication of the disease/infection state. In some embodiments, the workflow identifies two or more peptides (e.g., 2, 3, 4, 5, 6, 7, or more) that when detected together in a sample indicate the presence of the disease/infection state.
In general, the present technology can be utilized to identify a combination of peptides and even processing conditions for sample preparation and analysis of the selected peptides for an efficient mass spectrometry based workflow for the detection of a diseased state providing broad coverage across different variants of disease. The present technology can be conducted in silico without the need for performing trial and error experimentation in a mass spectrometer. In some embodiments, the present technology is directed to the selection of peptides and creation of a workflow based on the selected peptides for a yes/no result in a clinical analysis of a disease state of an individual subject. In certain embodiments, the present technology encompasses the selection of peptides and creation of a workflow for a yes/no result in a clinical analysis of a disease state present in a pooled sample (i.e., a sample including digested proteins from a number of subjects pooled together). In other embodiments, the present technology can be utilized to determine the workflow for detecting if one or more of multiple, different disease states (e.g., the presence of more than one virus or infection, such as, SARS-CoV-2 and/or influenza) in a subject. The technology could be used to select workflow peptides and create workflows that indicate the presence or absence of the one or more virus or infection (e.g., a simple yes that at least one virus or infection is present, or no virus or infection is detected). In other embodiments the technology could be utilized to select workflow peptides and create workflows that indicate, if a virus or infection is detected, which virus variant (or variant group) or infection is detected.
Some embodiments of the present technology include collection steps followed by one or more selection steps to select peptides for the particular disease/viral state and then to create the workflow for efficient mass spectrometry detection of the disease state. The collection steps can include use of knowledge databases to identify known or partially known protein sequences of specific viruses, such as influenza. From this collected information, appended with meta data, the primary protein amino acid sequence can be used to conduct in silico tryptic digestion of the proteins to form possible workflow peptides. Ionization efficiencies, retention times, and/or other physicochemical properties for the possible workflow peptides are then determined (e.g., calculated using algorithms based upon physicochemical properties of the peptides and proteins) and added to the collection of information. Using the data (both collected and calculated) from these initial collection steps, selection steps (e.g., filtering) are then applied to select the actual workflow peptides.
Selection steps can include selecting the peptides and proteins based on one or more physicochemical properties and/or homology. For example, length of peptides, amino acids or amino acid motif of the possible workflow peptides, retention time of the possible workflow peptides in LC, and ionization and separation of the possible workflow peptides in a mass spectrum can be used to filter (potential elimination of some, increase importance of others) possible workflow peptides. Further selection steps can also be applied to filter data for the selection of workflow peptides. For example, a frequency (% of sequencing containing a given peptide) filter can be applied to analyze the number of peptides that are shared between strains and types of disease states (e.g., strains and variant types of influenza or other virus) and/or filtering based on available meta data associated with the protein (variants) of interest. This filter can also be used to assess a minimal number of peptides to find the largest number of indicative (virus) proteins of the disease state to monitor.
In a final selection step, statistical analysis, such as Markov Chain Monte Carlo techniques, genetic algorithms, Bayesian Inference, nested sampling or other statistical analysis tools including those employed in Bayesian Inference, can be applied to the filtered possible workflow peptides, to select the actual workflow peptides for determining the broadest variant coverage of the selected disease state (e.g., for the yes/no determination of influenza). Once the actual workflow peptides have been identified (e.g., coverage determined), the physicochemical properties of these peptides together with information from the in silico tryptic digestion for those actual workflow peptides can be provided for use in the final workflow (i.e., sample preparation (digestion, clean-up etc.) sample separation (liquid chromatography) and mass spectrometry (ionization conditions and m/z information). In other embodiments, once the actual workflow peptides have been identified/selected, the values for a given set of experimental conditions will be calculated based upon particular equipment and experimental set-up to be utilized as well as other techniques known to those of ordinary skill in the art.
In some embodiments, once the workflow has been formed using the methods and processes of the present technology, external resources and expert input (e.g., virologists) can be used for validation.
One aspect of the technology is directed to a method for selecting a combination of peptides to identify one or more disease states in a subject using mass spectrometry. The method includes: selecting the one or more disease states; collecting information on proteins associated with the one or more disease states being present within the subject; in silico digesting individual proteins associated with the one or more disease state to obtain possible workflow peptides; collecting filtering data (e.g., physicochemical data, homology data, abundance of peptide data, data associated with the one or more disease states, any combination of the foregoing, etc.) associated with the possible workflow peptides; analyzing the possible workflow peptides for coverage of the one or more disease states and for mass spectrometry detection and resolution; and selecting the combination of peptides from the possible workflow peptides based on the analyzing step.
The above aspect can include one or more of the following features. In some embodiments, the method features selecting two or more disease states (e.g., two disease states, three disease states, four disease states, etc.). In certain embodiments, the method features selecting two or more variants of the one or more disease states. In some embodiments, the filtering data comprises physicochemical data. The physicochemical data can comprises ionization efficiency data and/or retention time data. In some embodiments, the physicochemical data comprises one or more of (a) length of possible workflow peptides; (b) MRM transition data on co-eluting or close eluting possible workflow peptides; and (c) amino acid sequences contained within the possible workflow peptides. In some embodiments, the filtering data comprises data on methionine being present within the amino acid sequences contained within the possible workflow peptides. Some embodiments of the method can further include eliminating possible workflow peptides using the filtering data prior to analyzing the possible workflow peptides for coverage. Other embodiments can further include eliminating possible workflow peptides after analyzing the possible workflow peptides for coverage. The step of analyzing the possible workflow peptides for coverage can include applying a statistical approach, such as, for example Bayesian inference or Markov Chain Monte Carlo algorithm. In some embodiment, the step of analyzing the possible workflow peptides for coverage includes analyzing coverage for a yes/no result for the disease state. In certain embodiments, the step of analyzing the possible workflow peptides for coverage includes analyzing coverage for a determination of a particular variant of the disease state (e.g., Influenza A versus Influenza B). In some embodiments, the step of analyzing the possible workflow peptides for coverage includes analyzing coverage for a determination of a particular disease state from a group of possible disease states (influenza versus SARS-CoV-2 versus RSV).
In general, the present technology for the selection of peptides for determination of the presence or absence of a disease state through mass spectrometry provides many advantages. For example, the present technology can be used to capture, calculate, and evaluate hundreds of thousands of possible peptides for identification of a particular disease state. By applying the collection and selection techniques to the calculated data, efficient mass spectrometry based workflows can be generated to positively identify a disease state. The technology eliminates the trial and error timeline and the costs of conducting laboratory testing to identify and select appropriate surrogate peptides that are indicative of a disease state and can also be easily resolved and/or processed in an efficient workflow.
The technology will be more fully understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
In general, the present technology is directed to the selection of peptides for detection by mass spectroscopy that are indicative of a disease state (e.g., influenza infection, corona virus infection, human rhinovirus infection, etc.). The present technology provides processes in which data are first collected and/or calculated and then filtered using physicochemical properties and other characteristics and traits of the diseased state to provide an efficient workflow for the mass spectrometry detection of the diseased state in a sample.
One challenge associated with efficiently diagnosing an influenza infection is the variability that exists among influenza types and subtypes. Due to this variability, reagents and methods designed to detect one type or subtype of the virus may not detect an infection with a different influenza type of subtype. Three types of influenza viruses infect human subjects: influenza A, influenza B, and influenza C. Influenza A and B are typically associated with seasonal flu, while influenza C generally causes mild disease. A fourth type, influenza D, can infect certain non-human mammals such as cattle. Influenza viruses are further divided into subtypes/strains based on the composition of surface proteins that make up the viral capsid. For influenza A, the subtypes are based on the expression of particular variants of the surface proteins hemagglutinin (H) and neuraminidase (N). There are 18 known different variants of hemagglutinin (H1-H18), and 11 known different variants of neuraminidase (N1-N11), combinations of which result in 198 possible different influenza A subtypes (over 130 of which have been detected in nature). In view of the prevalence of influenza infection, and the overlap of influenza symptoms with other common pathogenic infections (e.g., SARS-CoV-2 infection, human rhinovirus infection), methods of accurately and efficiently detecting and diagnosing viral infection are needed.
One objective of the present technology is to identify a set of peptides that can be used to detect the presence of an influenza virus in a sample (e.g., a biological sample such as nasal swab, saliva, sputum, blood, plasma, etc.), derived from a subject known or suspected of having an influenza infection. In exemplary embodiments, the set of peptides can include peptides having sufficient homology commonality among influenza types or subtypes to allow detection of multiple influenza types or subtypes. In addition, in some embodiments, the peptides have minimal to no homology with non-influenza proteins, thereby serving as specific marker(s) of influenza infection. By way of example, the present technology can be used to identify one or more (e.g., two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, ten or more, etc.) peptides that comprise a sequence that is shared among one or more influenza types and/or one or more influenza subtypes. In some embodiments, the present technology can be used to identify peptides that have significant homology (i.e., having at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 99%, or 100% sequence identity) in two or more influenza types (e.g., peptides that have significant homology in influenza A and influenza B). In some embodiments, the present technology can be used to identify peptides that have significant homology (i.e., having at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 99%, or 100% sequence identity) in two or more influenza subtypes (e.g., peptides that have, for example, significant homology in influenza A H1N1 and influenza A H3N2). In some embodiments, the technology provides a set of peptides, wherein each peptide in the set of peptides has significant homology in two, three, or four influenza types selected from influenza A, influenza B, influenza C, and influenza D, or even novel, mutated variants. In some embodiments, the technology provides a set of peptides, wherein each peptide in the set of peptides has significant homology in two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, or ten or more influenza subtypes. In some embodiments, each peptide in the set of peptides has minimal homology (e.g., 60% or less, 50% or less, 40% or less, 30% or less, 20% or less, or 10% or less sequence identity) to non-influenza peptides or proteins. For example, in some embodiments, each peptide in the set of peptides can have minimal homology to a human protein, and/or a protein of animal origin, e.g., a canine protein or a feline protein. Selection of peptides in accordance with the methodology described herein can be used to identify a minimal set of peptides that can be used to detect multiple types or subtypes of influenza in a sample, e.g., a patient sample.
In general, processes of the present technology include collection/calculation steps and selection/weighting steps.
As an initial matter, processes of the present technology begin with the identification of the viruses or disease states for detection in a clinical analysis as well as the type of information required for a result from the clinical analysis. For example, in one embodiment, the virus to be detected is influenza in humans, and the result desired from the clinical analysis is a simple yes/no for any type of seasonal influenza infection.
As described above, there are 4 types of influenza virus (types A, B, C, and D). Only types A-C are known to infect humans. Type D primarily affects cattle. Further, type C infections generally result in mild symptoms. Thus, type C is generally not considered to be associated with seasonal pandemic outbreaks. As the present embodiment is directed to detecting influenza in human (and more particularly to harmful influenza), the result of the clinical analysis of the sample would indicate the presence or absence of an influenza infection (either or both type A/type B) in a sample from the human subject. In other embodiments, the workflow need not be restricted to variants A and B. In still yet further embodiments, the workflow is directed to other viruses or other viruses in combination with influenza (e.g., human rhinovirus and/or SARS-CoV-2).
Influenza viruses can be further categorized by subtype (for type A) or lineage (for type B). Additional information and classification of the virus can be discerned through genetic clades and sub-clades as shown in
Returning to
From this first step of collecting data relating to the proteins associated with viral infection (e.g., influenza), additional meta data is collected on each of the proteins. The meta data is then associated with the in silico tryptic digestion of the proteins identified in the first collection step to determine the possible workflow peptides. (See Beynon, R. J.; Bond, J. S.; Proteolytic Enzymes: A Practical Approach, 2nd Edn. Oxford University Press, Oxford, U K 2001, pp. 149-183). Having access to this list, further calculations using the physicochemical properties of the peptides are computed to determine their ionization efficiencies and other peptide characteristics (for mass spectrometry) and retention times (for LC separation prior to MS). For example, Geromanos et al., in “Simulating and validating proteomics data and search results” Proteomics 2011, 11, 1189-1211 describe an in-silico method for the generation proteomes, which utilizes the underlying physicochemical properties of peptides and proteins to compute peptide characteristics. Additionally, M. Gilar et al. describe various prediction models for peptide separation in their published paper entitled “Utility of retention prediction model for investigation of peptide separation selectivity in reverse-phase liquid chromatography: impact of concentration of trifluoroacetic acid, column temperature, gradient slop and type of stationary phase” published in Analytical Chemistry, Vol. 82, No. 1, Jan. 1, 2010 and also in M. Gilar et al. “Solvent selectivity and strength in reversed-phase liquid chromatography separation of peptides” published in Journal of Chromatography A, 1337, (2014), pages 140-146. All of this information is calculated and collected so that it can be used in the selection/weighting steps applied in the second stage of the process shown in
As mentioned above, the processes of the present technology include both collection/calculation steps followed by selection/weighting steps. In the selection stage of the process, the information calculated and collected on possible workflow peptides is evaluated and ranked to arrive at actual workflow peptides for viral detection using mass spectrometry.
After calculation and collection of ionization efficiency and retention times, as well as other physicochemical properties, filtering/weighting based on physicochemical properties and homology determination begins. Specifically, using information regarding the length of the peptides, the amino acid sequences therein (i.e., their motif), retention times to determine co-eluting or near eluting (isobaric) peptides, and MRM transitions of the co-eluting or near eluting peptides, are evaluated and used to discount (i.e., filter out, down score, reduce weighting, etc.) at least a portion of the possible workflow peptides generated from the previous collection and calculation steps. For example, the length of the possible workflow peptides is evaluated. Possible workflow peptides having a longer length are disfavored due to difficulties in liquid chromatography or mass spectroscopy analysis. Therefore, a threshold cut-off is applied as a filter (e.g., peptides having a sequence length longer than 15 amino acids are eliminated in this example). Other criteria include retention time (e.g., selecting peptides that elute at different times) and/or MRM transition differences, i.e., non-isobaric peptides and/or MRM transitions, should the peptides co-elute and potentially give rise to detection interference.
In addition to the physicochemical properties and homology filter criteria described above, other filtering or scoring functions can be applied. For example, the frequency % of a peptide (i.e., the % protein amino acid of sequences containing a given peptide) can be used to generate a score function for selecting possible workflow peptides. As a simple example, a score function (scaled between 0 and 1) is calculated by multiplying the particular possible workflow peptide frequency by the number of publications describing the protein function the peptide is associated with. As a result, peptides that are common between strains and types of influenza (high frequency) that are described extensively can be taken into account, such that a minimal number of possible workflow peptides can be selected as a way to positively detect one or more proteins indicative of viral infection.
Using one or more of physicochemical properties, homology, and/or frequency scoring factors, a scoring-function, resulting from the inference analysis and accumulating to a comprehensive, a final, multi-factorial, weighted score can be applied to filter the results. In some embodiments, instead of using a scoring factor to capture all aspects, a more simplistic threshold cut-off can be applied, such as eliminating all possible workflow peptides having more than 15 amino acids. In other embodiments, the scoring factor can be refined to capture all aspects used in the selection steps (e.g., used in addition to or in replacement of cut-off values). Other ways of calculating a scoring function or different scoring functions are also within the scope of the preset technology. For example, some data can be excluded from the analysis due to expert or domain knowledge regarding sample preparation, behavior/properties, and detection of certain peptides. For example, peptides that contain the amino acid methionine (M), an amino acid with a hydrophobic side chain, are particularly challenging to resolve as they may reside in two forms, in an oxidized and non-oxidized form. Thus, using filtering/weighting steps of the present technology, these peptides can be excluded.
In one embodiment, a Bayesian statistics based analysis is utilized for the final selection of the actual workflow peptides. Once the actual workflow peptides are determined, the data used in modeling these actual peptides through the in silico tryptic digestion, ionization efficiency and retention time, and other meta data is provided, such that the workflow for the mass spectrometry detection of all actual workflow peptides can be reviewed and applied.
In another embodiment, a Markov Chain Monte Carlo (MCMC) algorithm is utilized for targeting peptides for selection of the actual workflow peptides. MCMC is a method that allows for the efficient exploration of high-dimensional probability distributions by obtaining random samples (Monte Carlo) from the distribution using an iterative process in which each iteration depends only on the properties of the distribution at the current position and possible destinations (Markov Chain). This method can be used to limit the enormous number of possible experiments in checking each individual target list of peptides to arrive at the selected or actual workflow peptides.
After eliminating peptide b and peptide c using the filtering methods described above, statistical analysis regarding the coverage of detection of the four variants with the least amount of peptides is undertaken to make the final selection of the actual workflow peptides. In the illustration provided in
As mentioned above,
The process of the present technology can optionally include a final step after the collection/computational steps and selection steps. In the embodiments shown in
The above illustrations of the present technology are directed to obtaining a Yes/No result of virus detection from a single subject. However, the present technology is not limited to just a single class of infection, such as, for example, influenza. For example, the present technology can be applied to a clinical analysis (a mass spectrometry based analysis) of a sample to determine if more than one virus is present in a single sample. That is, the processes of the present technology can be utilized to select actual workflow peptides for a MS determination of whether the subject is infected with SARS-CoV-2, influenza (type A or type B, or novel variant) and/or any other coronavirus or season viral infection (e.g., human rhinovirus). In addition, the present technology is also applicable to providing a more detailed result (i.e., not just Yes or No). For example, the present technology can be used to select the actual workflow peptides which MS presence will indicate not just the presence of an influenza infection, but also whether the subject is infected with type A or type B, and potentially which variant (e.g., which subclass, etc. by incorporating information regarding the variant of interest in the coverage selection steps).
Further, the present technology is not limited to detection of viral infections. For example, the processes of the present technology can be used to select actual workflow peptides for the detection of metabolic diseases (e.g., Gaucher disease, phenylketonuria) or other types of disorders that can be detected using clinical LC/MS.
While the above illustrative embodiments were directed to a sample of a single subject, the present technology can also be employed to detect infection in pooled samples (samples stemming from the collection/pooling of numerous subjects to form a pool of subjects). Pooling can be useful when testing large sections of a population for the presence of a disease state. By pooling a number of individuals together, the hope is that a greater segment of the population can be tested regularly and discounted from having the infection, thus saving resources.
In embodiments of the technology, statistical Bayesian Inference analysis is utilized to identify target peptides that represent the protein variants of interest (e.g. coverage). Technology of the present invention (see
For example, each peptide from a candidate list of peptides can be either excluded or included in an experiment for arriving at or determining the selection. There are therefore 2N possible experimental target lists Li where 1≤i≤2N that provide varying degrees of protein variant coverage C(i) where 0<Ci<1 and when C(i)=1, 100% coverage is achieved. In order to assign relative merit to these potential target lists, other constraints must be considered, such as peak capacity of the separation system and duty cycle of the MS/MS detection system along with the requirement to minimize the number of peptides utilized. In reality, because there is an enormous number of possible experiments, the target lists cannot all be checked individually (to arrive at selection of actual workflow peptides). And instead, embodiments of the present technology apply a statistical approach in the selection step of the method.
In some embodiments, such as some of the embodiments discussed above, one or more statistical approaches may be used, such as those employed in Bayesian inference. Other approaches are also possible. For example, a Markov Chain Monte Carlo (MCMC) algorithm can be utilized in the selection step of the present technology. MCMC methods allow for efficient exploration of high-dimensional probability distributions by obtaining random samples (Monte Carlo) from the distribution using an iterative process in which each iteration depends on the properties of the distribution at the current position and possible destinations (Markov Chain).
When employing a MCMC approach in the present technology, the figure of merit, which must simultaneously encompass the goals of increasing protein coverage, minimizing the number of peptides and experimental compatibility, is not strictly a probability as such and there is considerable freedom in how to define it. As a result, in some embodiments, the figure of merit is split into two concepts: a “likelihood” function which is the protein coverage C(i) raised to some power S and an exponential “prior probability distribution” for the number of peptides in the target list having a mean that controls the relative importance of this parameter. The experimental constraints are incorporated by imposing a maximum on the number of simultaneously eluting target peptides in the method. For a given target list Li this can be calculated by producing a scoring function in the form of a “virtual chromatogram” for the target list using the retention time of the targeted peptides and the system peak capacity.
In some embodiments the MCMC sampling of the included/excluded statuses of peptides may comprise Gibbs sampling, or Metropolis-Hastings sampling.
In some embodiments, the objective is to uniquely identify as many proteins as possible using a minimal number of peptides. However, owing to similarity and redundancy in the list of proteins provided, it may not be possible to find peptides that can uniquely identify certain proteins. It is therefore useful to introduce a numerical measure of “degeneracy” that can be included in the figure of merit employed in the optimization process. To give some examples, “degeneracy” could be the average number of proteins identified by a peptide sequence, the variance of this quantity or some combination of these.
In line with standard Bayesian analysis, certain embodiments employ a “posterior probability distribution” which is given by the product of the “prior probability distribution” with the “likelihood” function. In the early stages of the statistical analysis, the “likelihood” is softened (a procedure often referred to as simulated annealing) to reduce the probability of the analysis becoming trapped in local minima.
Another advantage of utilizing MCMC based approaches in selection steps is their ability to provide many alternative solutions to the problem, either by re-running the analysis several times with different random seeds, or by taking several representative samples in the final stages of the analysis. This provides some flexibility in the event that a promising looking method performs less well in practice than is predicted through simulation. To give just one example, actual retention times of peptides will differ from simulated values, which could lead to the experimental capacity being exceeded.
In some embodiments, several explorations may be carried out in parallel, and the individual exploration “objects” may interact with each other at certain times during the exploration, for example, in a genetic algorithm, nested sampling or a particle swarm optimization.
The following Examples illustrate a MCMC approach utilized in the methods of the present technology during the selection steps. Prior to the selection steps, collection steps including in silico digestion were performed.
Example 1: Selection of Peptides for Differentiation of Protein Virus Variants—Analysis Using MCMC Approach. In this example, initial selection criteria were simulated to identify actual workflow peptides for the differentiation analysis of known human coronaviruses (e.g., SARS-CoV-2, SARS-CoV MERS-CoV, Coy 229E, CoV OC43, CoV NL 63, and CoVHKU1). The protein virus variant amino acid sequences were obtained from the UniRef section (100% sequence identity clusters) of the UniProt Protein Knowledge Database and processed using the collection steps illustrated in
Example 2: Selection of Peptides for Detection of all Known Protein Virus Variants. In this example, the targeted detection is the presence or absence of a disease state, specifically influenza. That is, is a protein complement of Influenza A or B detected. This method can be adopted for other disease states, such as the presence of a corona virus protein variant(s). In the present example, reviewed (i.e., manually annotated) amino acid sequences were obtained from the UniProt Protein Knowledge Database, were analyzed and pre-processed using the same filter criteria as described in Example 1.
Two additional protein virus subsets were created to demonstrate the effect of variant inclusion on peptide sample sets, restricting the analysis to only the most frequently observed and the gene translation products that undergo mutation, requiring 100% and 95% variant coverage respectively. The top three panels (
Example 3: Selection of Peptides for Detection of all Known Protein Virus Variants. In this example, the analysis of Example 2 is extended to including circulating influenza virus proteins based on WHO vaccine development recommendations (https://www.who.int/influenza/vaccines/virus/recommendations/202002_recommendation.pdf). The proteins and amino acid sequences from predicted circulating influenza viruses were obtained from the UniRef section (100% sequence identity clusters) and the reviewed entries of the UniProt Protein Knowledge Database.
In this particular example, demonstrating that tailoring of the pre-analysis step, based on, for example, expert domain knowledge, can be readily achieved, the filtering step prior to the MCMC analyses, included limiting the in silico generated peptides (digestion) to a subset of peptides with a sequence length of 5 to 20 amino acid, and allowing for one missed cleavage.
In silico determined physicochemical analyte properties included normalized retention time and relative ionization efficiency (predicted abundance of double (2+) and triply (3+) charged peptide ions of over 10% of total abundance). For the peptide selection of both circulating influenza (A and B) variants and circulating variants plus UniProt reviewed influenza (A and B variants 100% and 95% variant coverage were considered, representing four cases in total. For all these cases, three possible solutions were determined. For the 100% variant coverage, each possible solution (i.e., Solution 1, Solution 2, and Solution 3) included 146 or 149 sequences. And for the 95% variant coverage, each possible solution included 84 or 87 sequences.
Additional information reflected in the graph of
The above examples illustrate the effects filtering criteria and statistical analysis approaches to derive a scoring function imparts on number of selection results. This approach can be used to analyze the simulation of an enormous amount of laboratory experiments to achieve actual workflow peptides for the desired analysis.
Example 4: Selection of Peptides for Detection of all Known Protein Virus—Respirator Syncytial Virus (RSV). As another example, solutions for RSV with the proteins obtained from UniRef section (100% sequence identity clusters) and reviewed entries of the UniProt Protein Knowledge Database. The pre-analysis filtering steps were identical to Example 3 and 95% and 100% variant coverage solutions determined. Three possible solutions for both cases were determined. For the 100% variant coverage, each possible solution (i.e., Solution 1, Solution 2, and Solution 3) included 62 sequences. And for the 95% variant coverage, each possible solution included 53 or 54 sequences.
Table 1 below summarizes Examples 1˜4 and identifies the optional protein subsets.
†abbreviations (nucleocapsid (N) and nucleoprotein (NP) are interchangeably used in literature and protein knowledge databases):
In additional embodiments, further separation or filtering steps may be employed in the analysis, including, but not limited to ion mobility separation. These separations may be modelled as part of an in silico experimental design workflow. In the case of ion mobility separation, arrival times of peptides or peptide fragments at a particular point in the instrument (for example a mass filter) may be determined using calibration information and/or values from previous experiments or literature.
This application claims priority to and benefit of U.S. provisional patent application No. 63/276,783, filed Nov. 8, 2021 entitled “Virus Peptide and Protein Variant Selection Workflow,” the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63276783 | Nov 2021 | US |