This invention was made with government support under Grant No. AI114958 awarded by the National Institutes of Health. The government has certain rights in the invention.
Exposure to infectious agents results in expansion of specific B- and T-cell clones, characterized by epitope-specific antibody and T-cell receptor (TCR) genes. High-throughput single-cell repertoire sequencing can be used to determine the frequency of each clone in a blood sample, but not which expanded clones are specific to a given exposure, as opposed to past/intercurrent exposures, or to bystander activation. If the specific clones were known, their presence and frequency could be used to detect specific exposures. Because clonal expansion is a signal amplifier, such an approach to diagnosis may be more sensitive than direct pathogen detection, especially in chronic infections such as, for example, tuberculosis (TB), in which the immune response plays a prominent role in disease pathogenesis, and in infections such as Lyme disease, in which the organism is difficult to grow in the laboratory. If the specific clones were known for many different exposures, those exposures could be diagnosed simultaneously from a single sequence-based test. In addition to diagnostic utility, the sequences that define specific clones could be explored as potential reagents and therapeutics. This is the promise of high-throughput repertoire sequencing.
The challenge is to determine which clones are specific to a given exposure. One strategy is to sequence repertoires from many exposed individuals to identify sequences that are seen more often than by chance, using repertoires from unexposed individuals as controls. However, the diversity of TCR and especially antibody sequences means that repertoires from different exposed individuals rarely contain the same expanded clones, necessitating very large cohorts.
There is, therefore, an ongoing need to provide automated techniques of finding patterns in, and of classifying, immune repertoires.
In accordance with an embodiment of the invention, there is provided a computer-based system and method for associating immune system repertoires with specific stimuli (exposures) based on the biophysical properties of the repertoire's receptors. Sequences of a training repertoire are converted into a set of biophysical properties, and a computer-based compact representation of the training repertoire is built using maximum entropy modeling. In one version, an “immunome-wide association study” is performed by computer scoring a test repertoire using several such models to classify the test repertoire as being associated with a biological condition or not. In another version, one or more sets of parameters from the computer-based models are found that together classify each model as being from an individual that has the condition or from an individual that does not.
In one embodiment according to the invention, there is provided a computer-implemented method of classifying an immune system repertoire. The computer-implemented method comprises providing a data structure representing a plurality of training biological sequences that are included in at least one training immune system repertoire; and, for the training biological sequences represented by the data structure, associating, in a manner automated by a processor, one or more biophysical properties and operatively indicating the biophysical properties in a plurality of training repertoire biophysical feature data structures. The training repertoire biophysical feature data structures computationally represent the one or more biophysical properties of the training biological sequences based on expectation values of at least one biophysical composite measure for each of a plurality of feature components. The plurality of feature components includes feature components corresponding to an amino acid sequence of the training biological sequences. A maximum entropy model is formed, in an automated fashion by the digital processor, based on the training repertoire biophysical feature data structures. The formed maximum entropy model comprises a bias parameter for each feature component of the plurality of feature components. A data structure is provided representing a plurality of test biological sequences that are included in at least one test immune system repertoire. Based on the formed maximum entropy model and the data structure representing the plurality of test biological sequences, the test immune system repertoire is classified, in an automated fashion by the processor. The classifying includes classifying the test immune system repertoire as being associated with at least one biological condition or as not being associated with the at least one biological condition.
In further, related embodiments, classifying the test immune system repertoire may comprise scoring, in an automated fashion by the processor, the data structure representing the plurality of test biological sequences against both (i) at least one biological condition-positive maximum entropy model determined based on a training repertoire biophysical feature data structure that is known to be associated with the at least one biological condition, and (ii) at least one biological condition-negative maximum entropy model determined based on a training repertoire biophysical feature data structure that is known not to be associated with the at least one biological condition. The method may further comprise forming, in an automated fashion by the processor, an all-model score classifier module implemented by the processor, the forming of the all-model score classifier module comprising determining with the processor a plurality of all-model scores against both the at least one biological condition-positive maximum entropy model and the at least one biological condition-negative maximum entropy model. The all-model classifier module may permit generating, in an automated fashion by the processor, data structures representing at least one of: a histogram of the plurality of all-model scores versus a fraction of the test biological sequences, and a two or more dimensional cloud of the all-model scores. Forming the all-model score classifier module may comprise dividing, in an automated fashion by the processor, the plurality of scores against the at least one biological condition-positive maximum entropy model by the plurality of scores against the at least one biological condition-negative maximum entropy model, the dividing comprising desired weighting and normalizing. The computer-based method may further comprise classifying, in an automated fashion by the processor, the test immune system repertoire based on an increased probability density beyond expected probability density determined based on at least a portion of at least one of: the data structure representing the histogram of the plurality of all-model scores, and the data structure representing the two or more dimensional cloud of the all-model scores.
In other related embodiments, classifying the test immune system repertoire may comprise determining, in an automated fashion by the processor, a reduced subset of the bias parameters of the maximum entropy model that permit classifying the test immune system repertoire with a desired level of accuracy as being systematically associated with, or not systematically associated with, the at least one biological condition. The reduced subset of bias parameters may be determined in an automated fashion by the processor based at least on the bias parameters of the maximum entropy model using a Metropolis-Hastings Markov-Chain Monte-Carlo procedure. The reduced subset of bias parameters may be determined, in an automated fashion by the processor, based at least on the bias parameters of the maximum entropy model, using at least one of a principal component analysis procedure, an independent component analysis procedure, a maximum accuracy separator module, such as a linear support-vector machine classifier, or other cost-minimizing procedure, implemented in an automated fashion by the processor, to separate at least one biological condition-positive maximum entropy model from at least one biological condition-negative maximum entropy model.
In further related embodiments, the at least one biophysical composite measure may comprise a result of a dimensionality reduction of a plurality of individual amino acid measures. The dimensionality reduction may comprise at least one of: a principal components analysis dimensionality reduction, an independent components analysis dimensionality reduction, a t-distributed stochastic neighbor embedding dimensionality reduction, a non-negative matrix factorization dimensionality reduction, a linear discriminant analysis dimensionality reduction, a generalized discriminant analysis dimensionality reduction, and an autoencoder dimensionality reduction. The plurality of individual amino acid measures may comprise physical measures and chemical measures of each of twenty naturally-occurring amino acids, or of at least one artificial amino acid. The at least one biophysical composite measure may comprise ten or fewer biophysical composite measures. The plurality of feature components may further include a plurality of feature components corresponding to at least one of: nearest neighbor pairs of the amino acid sequence of the training biological sequences; next-nearest neighbor pairs of the amino acid sequence of the training biological sequences; third-nearest neighbor pairs of the amino acid sequence of the training biological sequences; fourth-nearest neighbor pairs of the amino acid sequence of the training biological sequences; symmetric cross pairs of the amino acid sequence of the training biological sequences; asymmetric cross pairs of the amino acid sequence of the training biological sequences; amino acid triples of the amino acid sequence of the training biological sequences; a complementarity-determining region length distribution of the amino acid sequence of the training biological sequences; consecutive quadruples of amino acids of the amino acid sequence of the training biological sequences; at least one stem property of the amino acid sequence of the training biological sequences; at least one loop property of the amino acid sequence of the training biological sequences; and at least one complementarity-determining region property of the amino acid sequence of the training biological sequences.
In other, related embodiments, the training biological sequences may comprise at least one of antibodies and T-cell receptors, and may comprise both antibodies and T-cell receptors. The at least one biological condition may comprise at least one of: a vaccination, an infection, an autoimmune condition, a disease, a transfusion reaction, a transplant rejection, aging, a cancer, a gender, a geographical background and a species, strain or genotype. The method may further comprise determining, in an automated fashion by the processor, a probability of the test immune system repertoire having been generated by the maximum entropy model. The method may further comprise determining, in an automated fashion by the processor, similarity scores comparing at least two different test immune system repertoires with each other based on the maximum entropy model, or similarity scores comparing at least two different sequences with each other based on the maximum entropy model. Forming the maximum entropy model may comprise training, in an automated fashion by the processor, the maximum entropy model on the plurality of feature components using a Metropolis-Hastings Markov-Chain Monte-Carlo procedure.
In another embodiment according to the invention, there is provided a computer-implemented method of generating a biological sequence data structure corresponding to an immune system repertoire, using a maximum entropy model previously generated by: providing a data structure representing a plurality of training biological sequences that are included in at least one training immune system repertoire; for the training biological sequences represented by the data structure, associating, in a manner automated by a processor, one or more biophysical properties and operatively indicating the biophysical properties in a plurality of training repertoire biophysical feature data structures; the training repertoire biophysical feature data structures computationally representing the one or more biophysical properties of the training biological sequences based on expectation values of at least one biophysical composite measure for each of a plurality of feature components, the plurality of feature components including feature components corresponding to an amino acid sequence of the training biological sequences; and forming, in an automated fashion by the processor, a maximum entropy model based on the training repertoire biophysical feature data structures, the formed maximum entropy model comprising a bias parameter for each feature component of the plurality of feature components. The computer-implemented method comprises, based on a maximum entropy model so determined, forming, in an automated fashion with a processor, a new biological sequence data structure representing an immune system repertoire comprising similar biophysical properties to the at least one training immune system repertoire, based on at least the bias parameters of the maximum entropy model.
In another embodiment according to the invention, there is provided a computer system for classifying an immune system repertoire. The computer system comprises a training sequence module configured to provide, in a manner automated by a processor, a data structure representing a plurality of training biological sequences that are included in at least one training immune system repertoire. A feature translator module is configured to associate, for the training biological sequences represented by the data structure, in a manner automated by a processor, one or more biophysical properties and to operatively indicate the biophysical properties in a plurality of training repertoire biophysical feature data structures. The training repertoire biophysical feature data structures computationally represent the one or more biophysical properties of the training biological sequences based on expectation values of at least one biophysical composite measure for each of a plurality of feature components, the plurality of feature components including feature components corresponding to an amino acid sequence of the training biological sequences. A modeling module is configured to form, in an automated fashion by the processor, a maximum entropy model based on the training repertoire biophysical feature data structures, the formed maximum entropy model comprising a bias parameter for each feature component of the plurality of feature components. A test sequence module is configured to provide, in a manner automated by a processor, a data structure representing a plurality of test biological sequences that are included in at least one test immune system repertoire. A classifier module is configured to, based on the formed maximum entropy model and the data structure representing the plurality of test biological sequences, classify, in an automated fashion by the processor, the test immune system repertoire, the classifying including classifying the test immune system repertoire as being associated with at least one biological condition or as not being associated with the at least one biological condition.
In further, related embodiments, the classifier module may be further configured to classify the test immune system repertoire by scoring, in an automated fashion by the processor, the data structure representing the plurality of test biological sequences against both (i) at least one biological condition-positive maximum entropy model determined based on a training immune system repertoire that is known to be associated with the at least one biological condition, and (ii) at least one biological condition-negative maximum entropy model determined based on a training immune system repertoire that is known not to be associated with the at least one biological condition. The system may further comprise an all-model score generator configured to form, in an automated fashion by the processor, an all-model score classifier module implemented by the processor, the forming of the all-model score classifier module comprising determining with the processor a plurality of all-model scores against both the at least one biological condition-positive maximum entropy model and the at least one biological condition-negative maximum entropy model. The all-model classifier module may permit generating, in an automated fashion by the processor, a data structure representing at least one of: a histogram of the plurality of all-model scores versus a fraction of the test biological sequences, and a two or more dimensional cloud of the all-model scores. The all-model score generator may be further configured to form the all-model score classifier by dividing, in an automated fashion by the processor, the plurality of scores against the at least one biological condition-positive maximum entropy model by the plurality of scores against the at least one biological condition-negative maximum entropy model, the dividing comprising desired weighting and normalizing. The classifier module may be further configured to classify, in an automated fashion by the processor, the test immune system repertoire based on an increased probability density beyond expected probability density determined based on at least a portion of at least one of: the data structure representing the histogram of the plurality of all-model scores, and the data structure representing the two or more dimensional cloud of the all-model scores.
In other, related embodiments, the classifier module may be further configured to classify the test immune system repertoire based on determining, in an automated fashion by the processor, a reduced subset of the bias parameters of the maximum entropy model that permit classifying the test immune system repertoire with a desired level of accuracy as being systematically associated with, or not systematically associated with, the at least one biological condition. The classifier module may be further configured to determine the reduced subset of bias parameters, in an automated fashion by the processor, based at least on the bias parameters of the maximum entropy model using a Metropolis-Hastings Markov-Chain Monte-Carlo procedure. The classifier module may be further configured to determine the reduced subset of bias parameters, in an automated fashion by the processor, based at least on the bias parameters of the maximum entropy model using at least one of a principal component analysis procedure, an independent component analysis procedure, a linear support-vector machine classifier, or other cost-minimizing procedure. The system may further comprise a maximum accuracy separator module configured to separate, in an automated fashion by the processor, at least one biological condition-positive maximum entropy model from at least one biological condition-negative maximum entropy model. The maximum accuracy separator module may comprise a linear support-vector machine classifier.
In further related embodiments, the at least one biophysical composite measure may comprise a result of a dimensionality reduction of a plurality of individual amino acid measures. The dimensionality reduction may comprise at least one of: a principal components analysis dimensionality reduction, an independent components analysis dimensionality reduction, a t-distributed stochastic neighbor embedding dimensionality reduction, a non-negative matrix factorization dimensionality reduction, a linear discriminant analysis dimensionality reduction, a generalized discriminant analysis dimensionality reduction and an autoencoder dimensionality reduction. The plurality of individual amino acid measures may comprise physical measures and chemical measures of each of twenty naturally-occurring amino acids, and may comprise at least one artificial amino acid. The at least one biophysical composite measure may comprise ten or fewer biophysical composite measures. The plurality of feature components may further include a plurality of feature components corresponding to at least one of: nearest neighbor pairs of the amino acid sequence of the training biological sequences; next-nearest neighbor pairs of the amino acid sequence of the training biological sequences; third-nearest neighbor pairs of the amino acid sequence of the training biological sequences; fourth-nearest neighbor pairs of the amino acid sequence of the training biological sequences; symmetric cross pairs of the amino acid sequence of the training biological sequences; asymmetric cross pairs of the amino acid sequence of the training biological sequences; amino acid triples of the amino acid sequence of the training biological sequences; a complementarity-determining region length distribution of the amino acid sequence of the training biological sequences; consecutive quadruples of amino acids of the amino acid sequence of the training biological sequences; at least one stem property of the amino acid sequence of the training biological sequences; at least one loop property of the amino acid sequence of the training biological sequences; and at least one complementarity-determining region property of the amino acid sequence of the training biological sequences.
In other related embodiments of the computer system, the training biological sequences may comprise at least one of antibodies and T-cell receptors, such as both antibodies and T-cell receptors. The at least one biological condition may comprise at least one of: a vaccination, an infection, an autoimmune condition, a disease, a transfusion, a transplant, aging, a cancer, a gender, a geographical background and a species, strain or genotype. The classifier module may further comprise a probability determination module configured to determine, in an automated fashion by the processor, a probability of the test immune system repertoire having been generated by the maximum entropy model. The classifier module may be further configured to determine, in an automated fashion by the processor, similarity scores comparing at least two different test immune system repertoires with each other based on the maximum entropy model. The modeling module may be configured to form the maximum entropy model by training, in an automated fashion by the processor, the maximum entropy model on the plurality of feature components using a Metropolis-Hastings Markov-Chain Monte-Carlo procedure.
In another embodiment according to the invention, there is provided a non-transitory computer-readable medium configured to store instructions for classifying an immune system repertoire. The instructions, when loaded and executed by a processor, cause the processor to classify the immune system repertoire by: providing a data structure representing a plurality of training biological sequences that are included in at least one training immune system repertoire; for the training biological sequences represented by the data structure, associating, in a manner automated by a processor, one or more biophysical properties and operatively indicating the biophysical properties in a plurality of training repertoire biophysical feature data structures; the training repertoire biophysical feature data structures computationally representing the one or more biophysical properties of the training biological sequences based on expectation values of at least one biophysical composite measure for each of a plurality of feature components, the plurality of feature components including feature components corresponding to an amino acid sequence of the training biological sequences; forming, in an automated fashion by the processor, a maximum entropy model based on the training repertoire biophysical feature data structures, the formed maximum entropy model comprising a bias parameter for each feature component of the plurality of feature components; providing a data structure representing a plurality of test biological sequences that are included in at least one test immune system repertoire; and based on the formed maximum entropy model and the data structure representing the plurality of test biological sequences, classifying, in an automated fashion by the processor, the test immune system repertoire, the classifying including classifying the test immune system repertoire as being associated with at least one biological condition or as not being associated with the at least one biological condition.
The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.
A description of example embodiments follows.
The adaptive immune system consists of B cells, which make antibodies, and T cells, which target infected or cancerous cells. Its power is in being able to respond to almost any stimulus. As a result it plays important roles in many conditions and health interventions, including vaccination, infection, autoimmunity, cardiovascular disease, transfusion, transplant, aging, and cancer. The key to this breadth is that each new B or T cell makes a unique receptor—the antibody in B cells and the T-cell receptor in T cells—that targets that cell to specific parts of specific molecules, called antigens or epitopes, related to the stimulus. When cells encounter their stimuli, they divide, producing more cells with their specific receptors. Repeat encounters thus result in a signal related to this specific stimulus, which should be detectable against the background of non-specific B- and T-cells that make up the rest of the B- and T-cell repertoire. The primary obstacle to detection is that there are so many receptors that a person's cells can make—in fact, orders of magnitude more than the number of B or T cells that a person actually has—that the same stimulus may stimulate different receptors in different people. Thus, there may be no common signal at the level of detail of receptors' nucleotide or amino-acid sequences. Yet because receptors work though binding, i.e. through having shapes complementary to their antigens or epitopes, it is reasonable to expect a signal in the biophysical properties that determine the shapes of receptors, which receptors that differ at the sequence level may share.
In accordance with an embodiment of the invention, there is provided a method for associating B- and T-cell repertoires with specific stimuli based on the biophysical properties of the repertoire's receptors. Specifically, a machine-learning approach of maximum entropy modeling is used that: (i) takes as input a list of antibody or T-cell receptor sequences (a “training repertoire”); (ii) converts each sequence into a set of biophysical properties; and then (iii) builds a compact representation (“model”) of the training repertoire that can be used to score a test repertoire. In one embodiment (iv), referred to as an immunome-wide association study (IWAS), the test repertoire is scored by several models, some of which were trained on repertoires from individuals who have a certain condition (e.g. cytomegalovirus [CMV] infection) and others of which were trained on repertoires from individuals who do not have that condition (e.g. uninfected controls), to classify the test repertoire as being associated with that condition or not. In a second embodiment, (v) one or more sets of parameters from the models are found that together (including through transformations such as principal components analysis or independent components analysis) classify each model as being from an individual that has the condition or from an individual that does not.
In the embodiment of
In the embodiment of
In accordance with an embodiment of the invention, the maximum entropy model 122 is composed of a set of parameters called biases, indicated as b1, b2 . . . bN in the maximum entropy model 122 of
It should be noted that, in accordance with an embodiment of the invention, the bias for a feature differs from the measurement of the feature in the training repertoire 118. For datasets as complex as immune repertoires, describing them requires many features; for example, a model in accordance with an embodiment of the invention can contain on the order of 103 to 104 features.
In accordance with an embodiment of the invention, one or more sets of parameters from the models are found that together (including through transformations such as principal components analysis or independent components analysis) classify each model as being from an individual that has a biological condition or from an individual that does not. Together, the biases that comprise a given model describe a given repertoire. Exposure to an infection, e.g. cytomegalovirus (CMV), will result in changes to the sequence composition of a repertoire. The biases describe this composition (indeed, the biases can be used to generate a repertoire that is statistically indistinguishable from the repertoire, and in this sense, as a shorthand, the model is a generative model that can re-create its repertoire). Therefore, exposure affects the biases. However, all sorts of other interpersonal differences will also affect the biases, so some of the biases will differ systematically between people exposed to e.g. CMV and people who are not, and other biases will differ randomly between those people. In accordance with an embodiment of the invention, the subset of biases is found that differ systematically. (For example, we might find that in CMV bias #1 is >0.32, bias #2 is ≤0.4, bias #3 is either less than 0.4 or greater than 1.2, etc.). This subset is used as a classifier to classify unknown repertoires' models as being positive or negative for a biological condition, e.g. CMV+ or CMV−. In an investigation of cytomegalovirus (CMV) performed in accordance with an embodiment of the invention, an independent-components analysis (ICA) was performed to reduce a set of 5-20 biases to two dimensions. It was found that a linear support-vector machine classifier (the line that best separates CMV+from CMV− models) gives an uncorrected accuracy of ˜90% relative to the existing gold-standard test (serology). However, because the gold standard itself is imperfect, with an accuracy of 92-97% depending on the study, the corrected accuracy of an embodiment according to the invention was found to be also 92-97%—as good as the industry standard.
An embodiment according to the invention applies to finding patterns in any ensemble of biological sequences based on biophysical patterns.
In accordance with an embodiment of the invention, the training repertoire biophysical data structures 120 are based on biophysical properties, instead of on amino acids or nucleotide sequences. There are many fewer features (each requiring a parameter that must be fit) than conventional techniques. For example, conventional methods that modeled 20 amino acids required 20 parameters (19 independent) to represent amino-acid frequencies, another 400 (399 independent) to represent nearest-neighbor amino-acid pairs, yet another 400 for next-nearest-neighbor pairs, and so on. Accurate sampling of large numbers of features requires impractically large training sets. In contrast, in one embodiment, an embodiment according to the invention uses five parameters for amino-acid properties (which themselves summarize dozens of specific biophysical measurements) in place of the 20 for amino-acid frequencies, resulting in 5×5=25 for nearest-neighbor pairs, another 25 for next-nearest-neighbors, and so forth. This allows an embodiment according to the invention to consider more features (e.g. more distant pairs; sets of three or four positions) with training sets that are sufficiently small to usefully be derived from clinical samples. Models in accordance with an embodiment of the invention can easily generate new sequences that have similar biophysical properties to those in training repertoires. This includes generation of sequences that have similar properties to multiple training repertoires (e.g. from different infections), and that differ from multiple others (e.g. autoimmune diseases), simultaneously. An embodiment according to the invention outputs the probability of each sequence being generated by a model or set of models. (The sum of probabilities for all sequences equals 1.) Having probabilities makes it possible to calculate relative probabilities that any given sequence is consistent with one or another repertoire, which is potentially useful for generating candidate sequences with desired properties. In addition, an embodiment according to the invention permits various similarity scores between different repertoires, e.g. repertoires from two different people or over time, which may be useful for discovering new relationships with various health conditions.
Techniques in accordance with an embodiment of the invention can, for example, be used to provide information that assists with diagnostics, and for therapeutics, and reagents. In diagnostics, for example, an embodiment can be used to assist with classifying a test repertoire as being consistent with certain conditions. In this use case, a repertoire is obtained from a test subject and sequenced. The repertoire is then scored by sets of models that have been trained on repertoires from subjects with various conditions. If the test repertoire scores highly, the information can be used to assist with diagnosing a test subject with that condition. Note that the test subject can be tested simultaneously for any condition for which models exist. Thus, a single test could indicate whether, e.g., the test subject's vaccinations are achieving their desired effect, whether the test subject has been exposed to or is infected with any of a wide range of agents, whether they have an immune response to cancer or cancer therapy, whether they are at risk for a transfusion reaction or transplant rejection, and whether their immune system indicates premature aging.
Other embodiments can be used for therapeutic products. For example, to identify biological drug candidates, antibodies that score well according to models of repertoires from a given viral infection would serve as candidates for a drug that could prevent or treat that infection. In another embodiment, a system and method can be used in the same way to generate potential reagents, since antibodies are a major class of reagents in biomedical research and in clinical diagnostic testing.
By contrast, for diagnosis, the presently available standard of care in the field, for most infectious and autoimmune conditions, is use of a kit that contains an antibody-based reagent that is used to stain cells in a blood or tissue sample, or a reagent (which may be cells or a protein) derived from the agent that is mixed with patient serum to detect antibodies to the agent. For leukemias, the standard of care diagnostic is flow cytometry, usually following the appearance of unusual white cells on microscopy and disturbances in counts of white-cell subsets (again on routine flow cytometry). For lymphomas and most other cancers, it is biopsy and staining, usually with antibody-based reagents, occasionally supplemented by narrow-target sequence-based testing. By contrast with such conventional techniques, an embodiment according to the invention provides (i) the ability to provide information that assists with diagnosis of many conditions in a single “universal” test and (ii) to propose many new potential candidate drugs or reagents based on biophysical properties.
As used herein, a “biological sequence” is a sequence including a protein (such as, for example, a protein of a T-cell receptor or an antibody), or a nucleic acid.
As used herein, a “protein” is a biological molecule consisting of one or more chains of amino acids. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of the encoding gene. A peptide is a single linear polymer chain of two or more amino acids bonded together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues; multiple peptides in a chain can be referred to as a polypeptide. Proteins can be made of one or more polypeptides. Shortly after or even during synthesis, the residues in a protein are often chemically modified by posttranslational modification, which alters the physical and chemical properties, folding, stability, activity, and ultimately, the function of the proteins. Sometimes proteins have non-peptide groups attached, which can be called prosthetic groups or cofactors. It will be appreciated, in addition, that a biological sequence can include non-natural bases and residues, for example, non-natural amino acids inserted into a biological sequence.
As used herein, “nucleic acid” refers to a macromolecule composed of chains (a polymer or an oligomer) of monomeric nucleotide. The most common nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). It should be further understood that the present invention can be used for biological sequences containing artificial nucleic acids such as peptide nucleic acid (PNA), morpholino, locked nucleic acid (LNA), glycol nucleic acid (GNA) and threose nucleic acid (TNA), among others. In various embodiments of the present invention, nucleic acids can be derived from a variety of sources such as bacteria, virus, humans, and animals, as well as sources such as plants and fungi, among others. The source can be a pathogen. Alternatively, the source can be a synthetic organism. Nucleic acids can be genomic, extrachromosomal or synthetic. Where the term “DNA” is used herein, one of ordinary skill in the art will appreciate that the methods and devices described herein can be applied to other nucleic acids, for example, RNA or those mentioned above. In addition, the terms “nucleic acid,” “polynucleotide,” and “oligonucleotide” are used herein to include a polymeric form of nucleotides of any length, including, but not limited to, ribonucleotides or deoxyribonucleotides. There is no intended distinction in length between these terms. Further, these terms refer only to the primary structure of the molecule. Thus, in certain embodiments these terms can include triple-, double- and single-stranded DNA, PNA, as well as triple-, double- and single-stranded RNA. They also include modifications, such as by methylation and/or by capping, and unmodified forms of the polynucleotide. More particularly, the terms “nucleic acid,” “polynucleotide,” and “oligonucleotide,” include polydeoxyribonucleotides (containing 2-deoxy-D-ribose), polyribonucleotides (containing D-ribose), any other type of polynucleotide which is an N- or C-glycoside of a purine or pyrimidine base, and other polymers containing nonnucleotidic backbones, for example, polyamide (e.g., peptide nucleic acids (PNAs)) and polymorpholino (commercially available from Anti-Virals, Inc., Corvallis, Oreg., U.S.A., as Neugene) polymers, and other synthetic sequence-specific nucleic acid polymers providing that the polymers contain nucleobases in a configuration which allows for base pairing and base stacking, such as is found in DNA and RNA.
In an embodiment according to the invention, processes described as being implemented by one processor may be implemented by component processors, and/or a cluster of processors, configured to perform the described processes, which may be performed in parallel synchronously or asynchronously. Such component processors may be implemented on a single machine, on multiple different machines, in a distributed fashion in a network, or as program module components implemented on any of the foregoing.
In one embodiment, the processor routines 92 and data 94 are a computer program product (generally referenced 92), including a non-transitory computer-readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the invention system. The computer program product 92 can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable communication and/or wireless connection 107. In other embodiments, the invention programs are a computer program propagated signal product embodied on a propagated signal on a propagation medium (e.g., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s)). Such carrier medium or signals may be employed to provide at least a portion of the software instructions for the present invention routines/program 92.
In alternative embodiments, the propagated signal is an analog carrier wave or digital signal carried on the propagated medium. For example, the propagated signal may be a digitized signal propagated over a global network (e.g., the Internet), a telecommunications network, or other network. In one embodiment, the propagated signal is a signal that is transmitted over the propagation medium over a period of time, such as the instructions for a software application sent in packets over a network over a period of milliseconds, seconds, minutes, or longer.
In other embodiments, the software instructions 92 and data 94 are provided on a cloud platform, as SaaS (Software as a Service), and the like.
In investigations conducted in accordance with an embodiment of the invention, directed to cytomegalovirus (CMV), for which data is shown in
Principal-component dimensionality reduction was performed on a set of 26 standard physical and chemical measures of each the 20 amino acids (Sandberg, 1998) to obtain five composite measures that together explained 92% of the overall variance; the top three of these corresponded roughly to side-chain surface area, size, and charge. Expectation values for each composite measure were calculated across productively recombined CDR3s of a given input repertoire. In addition, expectation values for the product for each nearest, next-nearest, third-nearest, and cross-loop pair and for each amino-acid triple, as well as for the CDR3 length distribution, were also calculated. In some experiments additional expectation values (e.g. fourth-nearest neighbor pairs, consecutive quadruples of amino acids, properties of CDRs 1 and 2, stems vs. loops) were also used.
For each repertoire, a maximum-entropy model was trained on these features using a Metropolis-Hastings Markov-chain Monte-Carlo (MHMCMC) approach, testing Damerau-Levenshtein distances to confirm no autocorrelation in the sampling chains (R2 >0.999 for observed vs. uncorrelated distance distributions), and with a stopping condition of the size of the sample not exceeding that of the training set to avoid overfitting. Validity was tested statistically on both toy and real repertoires by confirming decreasing root-mean-squared distance between expectation values of the training set and MHMMC samples, and functionally by confirming significant overlap in sequence identity/similarity between the training set and samples. Robustness was confirmed statistically by comparing all biases of repeat fits (R2 >0.999) and confirming no outliers and empirically by confirming that the final samples from repeat fits contained similar sequences.
Each repertoire from an exposed individual was thought of as consisting of a disease-specific signal superimposed on background processes, with the majority of variation in model biases most likely due to background processes. A MHMCMC search was performed to find sets of biases that classified repertoires by disease status with high accuracy, using 10-fold cross-validation at each step to decrease the risk of overfitting and repeating this search on hundreds of randomly relabeled datasets to reject the null hypothesis that the resulting accuracy of such a classifier could be achieved by chance. Robustness was confirmed by repeat searches finding in the same set. To measure the expected accuracy on unseen data, exhaustive leave one-out testing was performed in which a classifier was trained on all but one model and tested on the holdout, with the accuracy measured over all tests (similar leave-n-out/explicit validation-set testing can be applied when datasets are sufficiently large).
The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.
While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 62/693,252, filed on Jul. 2, 2018. The entire teachings of the above application are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/038660 | 6/24/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62693252 | Jul 2018 | US |