The following are incorporated by reference for all purposes as if fully set forth herein, and should be considered part of, this patent filing:
U.S. patent application Ser. No. 17/947,053, titled, “INTER-MODEL PREDICTION SCORE RECALIBRATION,” filed on Sep. 16, 2022 (Attorney Docket No. ILLM 1067-2/IP-2301-US1);
Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018) (referred to herein as “PrimateAI”);
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019) (referred to herein as “SpliceAI”);
U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV) (referred to herein as “PrimateAI”);
U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV) (referred to herein as “PrimateAI”);
U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV) (referred to herein as “PrimateAI”);
U.S. Patent Application No. 62/582,898, titled “PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM 1000-4/IP-1618-PRV) (referred to herein as “PrimateAI”);
U.S. patent application Ser. No. 16/160,903, titled “DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” herein as “PrimateAI”);
U.S. patent application Ser. No. 16/160,986, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US) (referred to herein as “PrimateAI”);
U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-7/IP-1613-US) (referred to herein as “PrimateAI”);
U.S. patent application Ser. No. 16/160,978, titled “DEEP LEARNING-BASED SPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1001-4/IP-1610-US) (referred to herein as “SpliceAI”);
U.S. patent application Ser. No. 16/407,149, titled “DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed May 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US) (referred to herein as “PrimateAI”);
U.S. patent application Ser. No. 17/232,056, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1037-2/IP-2051-US) (referred to herein as “PrimateAI 3D”);
U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEIN VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1047-1/IP-2142-PRV) (referred to herein as “PrimateAI 3D”);
U.S. Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATION FOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM 1048-1/IP-2143-PRV) (referred to herein as “PrimateAI 3D”);
U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIAL INTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D) STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM 10373/IP-2051A-US) (reffered to herein as “PrimateAI 3D”);
U.S. Provisional Patent Application No.: 63/253,122, titled “PROTEIN STRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV) (referred to herein as “JigsawAI”);
U.S. Provisional Patent Application No.: 63/281,579, titled “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1060-1/IP-2270-PRV) (referred to herein as “JigsawAI”); and
U.S. Provisional Patent Application No.: 63/281,592, titled “COMBINED AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING GAPED AND NON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1061-1/IP-2271-PRV) (referred to herein as “JigsawAI”).
The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge-based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using deep neural networks to analyze ordered data.
Protein analysis often presents a so-called small data problem, in contrast to natural language processing. The size of the Internet, including digitized libraries, presents a virtually unlimited data set for training natural language models of all types. In contrast, fewer than 1000 species, at the time of this application, have reliably sequenced full proteomes. Compounding the small data problem, researchers have found that protein-by-protein analysis has advantages over trying to generalize across a whole proteome. In contrast to the linear structure of natural language, which progresses in words and sentences from the top to the bottom of a text, proteins undergo three-dimensional folding of the amino acid sequence, described in tertiary and even quaternary levels of the organization.
Protein synthesis begins with genes. Each human has a unique genetic code, though a large portion of the human genetic code is common for all humans. DNA code is transcribed into RNA, which then translates into a chain of amino acids followed by three-dimension folding of the sequence to form the final protein. In some cases, a human genetic code may include an outlier, called a genetic variant, that may be common among individuals of a relatively small group of the human population. The genetic variant may affect the translation of amino acids, which cause an amino acid variant. For example, a particular human protein may comprise a specific sequence of amino acids, whereas a variant of that protein may differ by one amino acid in the otherwise same specific sequence. Genetic variants may be pathogenetic, leading to diseases. Though most of such genetic variants have been depleted from genomes by natural selection, an ability to identify which genetic variants are likely to be pathogenic can help researchers focus on these genetic variants to gain an understanding of the corresponding diseases and their diagnostics, treatments, or cures. The clinical interpretation of millions of human genetic variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change the amino acid of a protein. Consider a protein, its amino acid sequence determines its structure and function. When an amino acid variant replaces its original amino acid in the sequence, it may change the structure (e.g., protein folding), biochemical function, and evolvability of the protein.
It is widely recognized that further developments are needed to encode biological information and to encapsulate the complexity of the sequence space, structural space, and functional relationships of biomolecules and biomolecular systems. See, e.g., Bepler, T., & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Systems, 2019, 12(6), 654-669. In 2020, Facebook's AI Laboratory produced a transformer model, DeepSequence that it trained on 250 million protein sequences. See, e.g., Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo D., Ott M., Zitnick, C.L., Ma, J., Fergus, R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America, 2021, 118(15). But as Bepler & Berger explained, much work remains to refine and combine such divergent approaches to protein analysis.
For example, an important question in molecular evolution is whether an amino acid that occurs at a given site makes an independent contribution to fitness, or whether its contribution depends on the state of other sites in the organism's genome. The phenotypic effect of a mutation at one genetic site often depends on alleles at other sites, a phenomenon known as epistasis. As a result, any mutation is expected to be contingent on earlier mutations and the fate of this mutation depends on the evolutionary history of the population. Therefore, epistasis can profoundly influence the process of evolution in populations and shape the patterns of divergence across species. Understanding the nature of epistatic interactions between sites will allow us to address basic questions in biology at the molecular scale, such as how large a role history plays in evolution, whether later events depend critically on specific earlier events, or do all events occur more or less independently.
The central properties of proteins—folding, biochemical function, and evolvability—arise from a global pattern of cooperative energetic interactions between amino acid residues. When introducing amino acid substitutions in a protein, cooperativity manifests itself as context-dependence of the effects of those mutations, or epistasis. Knowledge of the extent and distribution of epistasis in a protein is essential for understanding its evolution. For example, when a certain functional improvement requires a combination of mutations that are individually unfavorable, no single-mutation trajectory exists that increases fitness at each step, and evolution towards the new functionality will be hampered. Being able to uncover epistasis is relevant for the reconstruction of phylogenetic trees and for estimating the evolutionary potential of antibiotic resistance genes, and viruses, but also for protein engineering efforts that make use of directed evolution: information on epistatic architectures should prove useful in the selection of evolvable templates in focusing mutations to highly-epistatic regions of a protein, or in identifying cooperative units for DNA shuffling experiments. Epistasis can be co-evolutionary, where epistatic interactions lead to co-evolution among the loci.
Models that can predict molecular phenotypes directly from biological sequences can be used as in silico perturbation tools to probe the associations between genetic variation and phenotypic variation and have emerged as new methods for quantitative trait loci identification and variant prioritization. These approaches are of major importance given that the majority of variants identified by genome-wide association studies of complex phenotypes are non-coding, which makes it challenging to estimate their effects and contribution to phenotypes. Moreover, linkage disequilibrium results in blocks of variants being co-inherited, which creates difficulties in pinpointing individual causal variants. Thus, sequence-based deep learning models that can be used as interrogation tools for assessing the impact of such variants offer a promising approach to finding potential drivers of complex phenotypes. More importantly, an opportunity rises to use deep learning models, not only for predicting variant pathogenicity for a single variant introduced at a given location of the protein sequence but also considering context-dependence of the effects of the variant, in other words, epistasis.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The color drawings also may be available in PAIR via the Supplemental Content tab. In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The detailed description of various implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of the various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented in a single piece of hardware (e.g., a general-purpose signal processor or a block of random-access memory, hard disk, or the like) or multiple pieces of hardware. Similarly, the programs may be stand-alone programs, may be incorporated as subroutines in an operating system, may function in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings.
The processing engines and databases of the figures, designated as modules, can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. The modules in the figures can also be thought of as flowchart steps in a method. A module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.
This application uses the terms “recalibration”, “reassignment” and “re-ranking” interchangeably.
Logic as used herein can be rule-based and implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps described herein. The “logic” can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. The rule-based reassignment and rescaling logics can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). In one implementation, the logic implements a data processing function. The logic can be a general purpose, single core or multicore, processor with a computer program specifying the function, a digital signal processor with a computer program, configurable logic such as an FPGA with a configuration file, a special purpose circuit such as a state machine, or any combination of these. Also, a computer program product can embody the computer program and configuration file portions of the logic.
Genetic variants can be pathogenetic, leading to diseases. Though most of such genetic variants have been depleted from genomes by natural selection, an ability to identify which genetic variants are likely to be pathogenic can help researchers focus on these genetic variants to gain an understanding of the corresponding diseases and their diagnostics, treatments, or cures. The clinical interpretation of millions of human genetic variants remains unclear. Some of the most frequent pathogenic variants are single nucleotide missense mutations that change the amino acid of a protein.
The technologies disclosed can be used to improve the quality of pathogenic variant identification. Our model output probabilities for protein missense variants. It can be applied to score all possible variants at every position in most human proteins, achieving high performance on our pathogenicity prediction benchmarks. Our models are applicable to proteins from any species, and so are useful for protein engineering. In addition, intermediate model features can be used for protein structure prediction. The model performance smoothly increases with model size and compute, so other researchers will benefit from transfer learning from a large model trained with more compute.
In one implementation, the first put 102 includes sequence information, structural information, and evolutionary profiles of a reference target protein sequence. Sequence information may include amino acid sequences of the reference target protein sequence and non-target protein sequences aligned with the reference target protein sequence in a multiple sequence alignment. Structural information may represent the three-dimensional (3D) structure of the reference target protein sequence due to the interactions between amino acids that make up the protein. Structural information may also represent locally folded structures that form within the protein due to interactions between atoms of the backbone chain (e.g., α-helix, (β-sheet). When the reference target protein sequence includes multiple polypeptide chains known as subunits, structual information may also represent the quaternary structure of the protein, i.e., the association of subunits into a closely packed arrangement. Evolutionary profiles may include evolutionary conservation summary statistics of amino acids in the target protein sequence. Evolutionary conservation refers to the presence of similar genes, portions of genes, or chromosome segments in different species, reflecting both the common origin of species and important functional property of the conserved elements. The first input 102 may also include an alternative protein sequence with one or more amino acid variants substituting the original amino acids at certain locations of the reference target protein sequence. Different types of the first input 102 will be described in more detail in
The first model 104 can be a site-independent model that factorizes single-position variations in a plurality of aligned sequences. In other words, the first model 104 predicts the pathogenicity score of a single variant located in a particular location of the protein sequence without considering the context-dependence of the effects of the variant. In another implementation, the first model 104 can be a pairwise-interaction model that factorizes two-position variations in a plurality of aligned sequences. The incorporation of inter-site dependencies in the pairwise interaction model can lead to a more accurate prediction of the effects of the variant.
In one implementation, the first set of pathogenicity scores 126 generated from the first model 104 has a score scale, for example, ranging from 0 to 1, to differentiate benign variants from pathogenic ones. That is, a lower score indicates a benign variant mapped to a higher ranking while a higher score indicates a pathogenic variant mapped to a lower ranking.
The second model 114, based on the second input 112, generates a second pathogenicity score-to-rank mapping 116 that includes a second set of pathogenicity scores 146 mapped to a second set of pathogenicity score rankings 156.
In one implementation, the second input 112 includes sequence information of a target protein sequence from a target species and a plurality of non-target protein sequences aligned with the target protein sequence in a multiple sequence alignment (MSA). MSA is a sequence alignment of multiple homologous protein sequences to a target protein. It is an important step in comparative analyses and property prediction of biological sequences since a lot of information, for example, evolution and coevolution clusters, are generated from the MSA and can be mapped to the target sequence of choice or on the protein structure. Compared to pairwise alignments, MSA provides more information since they show conserved regions within a protein family which are of structural and functional importance. MSA used as the second input 112 can be a collection of tens to thousands of aligned protein sequences expanding from a target species (e.g., human) to a variety of homologous non-target species (e.g., primate, mammals, vertebrate). The second model 114 can distinguish between proteins that belong to the same protein family defined by MSA from those that do not. Proteins with pathogenic variants are predicted as outside of the protein family and those with benign variants as inside of the family.
In one implementation, the second set of pathogenicity scores 146 generated from the second model 104 also has a score scale, for example, ranging from a maximum real number represented digitally to a minimum real number represented digitally, to differentiate pathogenic variants from benign variants. That is, each score may range anywhere from −∞ to +∞, where a lower score indicates a pathogenic variant mapped to a lower ranking while a higher score indicates a benign variant mapped to a higher ranking. Yet in other implementations, a lower score may indicate a benign variant mapped to a higher ranking and a higher score indicates a pathogenic variant mapped to a lower ranking. The second set of pathogenicity scores can also be re-scaled to a preset scale, in order to compare with the first set of pathogenicity scores.
A person skilled in the art will appreciate that the present disclosure does not intend to limit the scales of the pathogenicity scores generated from the two models, or the correlation between the numerical values of pathogenicity scores and corresponding score rankings.
A person skilled in the art will also appreciate that the present disclosure does not intend to limit the types of inputs that are fed into the two models or the architectures of the models. The first model 102 and the second model 114 may have different architectures while the first input 102 and the second input 112 are substantially similar. Alternatively, the first model 102 and the second model 114 may have substantially similar architectures while the first input 102 and the second input 112 are different. In one implementation, the first input 102 used to feed the first model 104 may be distinct from the second input 112 used for the second model 114, and because of this input distinction, for the same set of variants in a target protein sequence, the two models may generate different outputs of pathogenicity scores and corresponding rankings. In another implementation, the first model 104 may have a different architecture from the second model 114, and because of this distinction, for the same set of variants in a target protein sequence, the two models may generate different outputs of pathogenicity scores and corresponding rankings.
As further illustrated in
In another implementation, the score recalibration logic 108 may re-assign pathogenicity scores to the set of variants based on the first score-to-rank mapping and the second score-to-rank mapping. The score recalibration logic 108 combines the originally assigned pathogenicity score and the reassigned score, to generate a combined pathogenicity score. The combined pathogenicity score can be an average, a weighted average, or a weighted sum of the originally assigned pathogenicity score and the reassigned score. Still using the aforementioned variants A and B as an example. Instead of directly reassigning the score of 0.3 to A, the score recalibration logic 108 may combine variant A's original score of 0.2 and its reassigned score of 0.3, and generate a combined pathogenicity score of 0.25 as an average of the two scores. In another implementation, the combined pathogenicity score (Scombined) is a weighted average of the original score (Soriginal) and the reassigned score(Sreassigned). That is Scombined=w1×Soriginal+w2×Sreassigned, where w1+w2=1. The weights used to generate the weighted combination can be preset (e.g., user-defined).
Alternatively, the weights can be differentiable. In one implementation, the first model is a deep neural network that utilizes multiple nonlinear and complex transforming layers to successively model high-level features. A deep neural network learns to map inputs to outputs given a training dataset of examples, namely, progresses step by step with small updates to the model weights and, in turn, a change in the performance of the model during each iteration. This iterative training process of neural networks solves an optimization problem that finds model weights that result in a minimum error or loss when evaluating the examples in the training dataset. The minimization of loss requires a loss function to calculate the error by comparing the difference between observed and predicted output during each iterative training cycle and accordingly, adjusting model weights via backpropagation. The first model provides feedback via backpropagation which carries the difference between observed and predicted output to adjust parameters using stochastic gradient descent. The weights used for reassigning the pathogenicity scores, as part of the adjustable parameters, can be optimized during backpropagation.
In one implementation, the score recalibration logic 108 further includes a rank loss determination logic 118, a loss function reconfiguration logic 128 and a training logic 138. As described above, for a set of variants (e.g., A, B, C, D, . . . ), the two models 104 and 114 generate two sets of pathogenicity scores 126 and 146 mapped to corresponding rankings 136 and 156. The rank loss determination logic 118 compares the scores and rankings of two variants and calculates a rank loss during the score re-ranking process using a loss function. For example, variants A and B have first pathogenicity scores of 0.1 and 0.8 with corresponding rankings, where a scale of the first pathogenicity scores ranges from 0 (benign) to 1 (pathogenic). The order of the pathogenic score ranking of variant A is higher than the order of the ranking of variant B, indicating the first model predicts variant A is more benign than variant B. The same variants have second pathogenicity scores of 35 and 0.6, respectively, where the order of the ranking of variant A is lower than that of variant B, indicating the second model predicts variant A is more pathogenic compared to variant B. The rank loss determination logic 118 calculates an absolute difference between their first pathogenicity scores of 0.1 and 0.8, which is 0.7, applies a loss function and generates a rank loss.
In another implementation, the rank loss determined by the rank loss determination logic 118 is one part of the loss that is taken into consideration during the training process of the first model. It is because the first model uses one or more large training datasets as its input during the iterative training process and for different sets of training data, a loss can be calculated via a loss function. For example, the first model uses human common variants and non-human primate common variants as training samples, because variations from different non-human primate variants contribute significantly more unique missense variants that are non-overlapping common human variations and thus, are valuable as a training dataset to improve the prediction accuracy of the first model. Accordingly, a loss by comparing the observed output and predicted output using human and primate common variants will be calculated and used for adjusting the weights of the first model.
Another example of the training dataset is the structural information of proteins, as a missense base variant causes the substitution of an amino acid (i.e., amino acid variant) in the protein sequence, which not only changes the sequence information but may also alter the structure and function of the protein. By including the structural information of proteins as a training dataset for the first model, a loss by comparing the observed output and predicted output will also be calculated and used for adjusting the model weights.
The loss function reconfiguration logic 128 takes into consideration different types of losses and reconfigures the loss function of the first model. As will be described in the following, the different types of losses may include Pairwise Logistic Loss or MSE loss, Pairwise Hinge Loss, Pairwise Soft Zero One Loss, Circle Loss, Softmax Loss, Unique Softmax Loss, Sigmoid Cross Entropy Loss, Mean Square Loss, List MLE Loss, Approx. NDCG Loss, Approx. MRR Loss, Gumbel Approx. NDCG Loss, Neural Sort Cross Entropy Loss, Gumbel Neural Sort Cross Entropy Loss, Neural Sort NDCG Loss, Gumbel Neural Sort NDCG Loss, etc. The training logic 138, based on the reconfigured loss function, trains the first model. For example, the first model may use common non-human primate variants and structural information of the variants as two training datasets and uses a loss function to adjust the model weights during the iterative training process. The loss function reconfiguration logic 128 uses the rank loss determined by the rank loss determination logic 118 to reconfigure the loss function, such that the rank loss is also included in the optimization process of the model.
In another implementation, the first model includes one or more re-ranking layers as one part of the neural network architecture, where the parameters of the re-ranking layers can also be optimized along with other parameters within the model architecture during the training process. The re-ranking layers and the training process of the first model will be described in more detail below, in accordance with
Next, we turn to the advantages of the presently disclosed technology by using the pathogenicity scores and rankings generated from one model to re-rank the pathogenicity scores generated from another model and including the re-ranking into the model's training process. As described above, the two models may use different types of input as their training datasets and therefore generate pathogenicity scores with different scales and values, mapped to different rankings. The two models may also have different architectures with focuses and strengths in different aspects. For example, the first model can be a discriminative model that makes predictions on the unseen data based on conditional probability and can be used either for classification or regression problem statements. The discriminative model can be a site-independent or pairwise interaction model that utilizes evolutionary profiles and 3D structural information of protein sequences primarily from human and primate. Thus, the model excels when predicting the pathogenicity of variants across protein domains and even across proteins, while not effective in determining high-order epistasis. On the other hand, the second model can be a generative model that focuses on the distribution of a dataset to return a probability for a given example. Here, it can be used to quantify the effects of variants in mutations by a ratio of mutant likelihood to a non-mutant (i.e., wildtype) likelihood. The generative model can be a non-linear latent variable model that posts hidden variables to jointly detect global patterns and local patterns of sequence variations across windows spanning multiple positions and multiple sequences in the plurality of aligned sequences. Thus, it may provide epistasis-related information and reliable prediction within a local protein domain, but not in across-domains or across-proteins. In other words, a pathogenicity score in one protein domain may indicate high pathogenicity while the same score in another protein domain may indicate the variant is benign.
The proteome has the ten million positions. Accordingly, there are nineteen combinatorically generated amino acid substitutions as variants for each of the ten million positions. Each model can generate a pathogenicity score for each variant located at a particular microenvironmental context. By combining the prediction results from different models and including them in the training process, the optimized neural network is significantly improved in the accuracy of predicting the pathogenicity of variants. More importantly, the technology disclosed not only improves in predicting variants across protein domains and across proteins but also in predicting variants in the local mutational neighborhood in the context of high-order epistasis. As an example, the technology disclosed can use the rankings of the variant pathogenicity scores generated from one model and re-rank the scores generated from another model. The rankings generated from one model accurately reflect the effects of the variants in the surrounding context, and the re-ranked scores remain optimal and consistent across different protein domains and different proteins. The trainable re-ranking processes, as described in more detail below, takes into consideration the context of high-order epistasis and meanwhile, outputs consistent pathogenicity scores. It can effectively avoid undesired situations where the same pathogenicity score indicates high pathogenicity in one protein domain but entirely benign in another.
Before going into further details, we first explain the role of epistasis in the context of protein evolution, and the intuition as to why the prediction of the pathogenicity of variants in the context of both low-order and high-order epistasis is critical.
Mathematically, epistasis is a transform (Ω) in which phenotypes (
ω
For N positions with a single substitution at each position,
Some examples below can help to explain the concept of epistasis. taking the variant L63M/S168G/A174LN207K as an arbitrary reference state (yref=0.85, blue fluorescence) the data show that introducing the mutation Y197R results in reduced brightness (y=0.28). The difference in these two values defines a first-order epistasis (ω1=yY197R−yref=−0.57). However, in the background of F143S, the effect of Y197R is entirely different; it shows increased brightness (ω1|F143S=+0.49), with conversion to red fluorescence. This indicates a large second-order epistatic term (ω2=ω1|F143S−ω1=1.06), meaning that the effect of Y197R is context-dependent on F143S. This second-order term is itself dependent on other mutations. For example, in the background of V45A, the second-order epistasis between Y197R and F143S nearly vanishes (ω2|V45A=−0.01), indicating a large third-order epistasis (ω3=ω2|V45A−ω2=−1.07). These findings show that Y197R, F143S, and V45A work as a cooperative unit whose contribution to the phenotype cannot be broken down into a simple, additive contribution of the underlying mutational effects. Instead, the prediction of phenotypes involving these mutations requires knowledge of their individual effects and epistatic interactions at all orders.
Understanding the phenotypic effects of variations is a central challenge for bioengineering and basic biology, especially considering the complexity in the sequence-structure-function relationship in a protein space and the context-dependence of the effects of mutations. Contemporary deep learning models have become powerful tools for exploring the evolutionary, structural and functional organization across protein space. Deep learning models have strengths and weaknesses, based on their input for training and validation, as well as model architectures. The disclosed technology maintains the strengths of different models while avoiding weaknesses of the models when used individually. Therefore, the disclosed technology can lead to improvement in predicting the pathogenicity of variants and prediction of phenotypes, both of which require knowledge of individual effects of the variants as well as high-order epistatic interactions.
A person skilled in the art will appreciate
Next, we turn to the different inputs that can be used to train, validate and test the two models and examples of different architectures of the two models.
As illustrated in
In one implementation, when a given protein sequence is used as an input to the first model, for each amino acid located in a given position of the protein sequence, the output is a vector with 20 floating numbers. Each floating number corresponds to the pathogenicity of a particular alternative amino acid as there are 20 types of amino acids.
As illustrated in
The first input 202 may further include three-dimensional (3D) structural information of the reference target protein sequence and the alternative protein sequence 232. Protein sites are microenvironments within a protein structure, distinguished by their structural or functional role. A site can be defined by a 3D location and a local neighborhood around this location in which the structure or function exists. Since it has been established that structure is far more conserved than sequence, the protein structural data provides an opportunity to systematically study the underlying pattern governing the structural-functional relationships using data-driven approaches.
In one implementation, proteins are represented by a collection of atoms and their coordinates in 3D space. An amino acid has a variety of atoms, such as carbon atoms, oxygen (0) atoms, nitrogen (N) atoms, and hydrogen (H) atoms. The atoms can be further classified as side-chain atoms and backbone atoms. The backbone carbon atoms can include alpha-carbon (Ca) atoms and beta-carbon (C(3) atoms. The 3D atomic coordinates can be attributed to 21 amino acid categories, including 20 categories of amino acid with a similar basic structure but differ in the side chains and 1 stop or gap amino acid category. For example, an amino acid-wise classification of alpha-carbon atoms can respectively list alpha-carbon atoms or beta-carbon atoms under each of the 21 amino acid categories. In other examples, an amino acid-wise classification of oxygen atoms can respectively list oxygen atoms, nitrogen atoms or hydrogen atoms under each of the 21 amino acid categories.
In another implementation, the 3D structural information of the reference target protein sequence and the alternative protein sequence 232 includes voxelized features. For example, a voxel grid is centered at the reference amino acid experiencing a target variant at the amino acid level. In particular, the voxel grid can be centered at an atomic coordinate of a particular atom of the reference amino acid experiencing the target variant. The voxel grid can be centered at the 3D atomic coordinate of one of the alpha-carbon atoms, beta-carbon atoms, oxygen atoms, nitrogen atoms, or hydrogen atoms of the reference amino acid experiencing the target variant. The voxel grid can have any resolution, for example, 3×3×3, 5×5×5, 7×7×7, and so on. Voxels in the voxel grid can be of any size, for example, one angstrom (Å) on each side, two Å on each side, three Å on each side, and so on.
The voxels in the voxel grid can have a plurality of channels (or features). In one implementation, the voxels in the voxel grid have a plurality of distance channels, for example, twenty-one distance channels for the twenty-one amino acid categories, respectively (including a stop or gap amino acid category). The distance channels are independently generated for each of the twenty-one amino acid categories. Consider amino acid Alanine (A) category using a 3×3×3 voxel grid with 27 voxels as an example. An Alanine distance channel includes 27 distance values for the 27 voxels in the voxel grid, respectively. The 27 distance values are measured from respective centers of the 27 voxels in the voxel grid to the respective nearest atoms (e.g., alpha-carbon (Cα) atoms, beta-carbon (Cβ) atoms, oxygen (O) atoms, nitrogen (N) atoms, and hydrogen (H) atoms) in the Alanine amino acid category. The distance values are measured between the 3D coordinates of the voxel centers and the 3D atomic coordinates of the atoms.
Yet in another implementation, the 3D structural information of the reference target protein sequences and the alternative protein sequences 232 includes a “gapped spatial representation” of a protein that excludes at least one gap amino acid in the protein. “Spatial representation” of a protein characterizes structural information about amino acids in the protein, based on the shapes, locations, positions, patterns, and/or arrangement of the amino acids. Amino acid-wise distance channels as discussed above can be used as a spatial representation of a protein. When a particular amino acid in the protein is removed or masked from the protein, this amino acid is called a “gap amino acid”, and the resulting protein that lacks the gap amino acid is called a “gapped protein” or a “vacancy-containing protein.” Therefore, “gapped spatial representation” is such a spatial representation of a vacancy-containing protein. In one implementation, when generating a gapped spatial representation, a gap amino acid is “masked” by excluding (or not considering or ignoring) one or more atoms or atom-types of the gap amino acid. The spatial representations of non-gap amino acids are encoded as amino acid class-wise distance channels, where each of the amino acid class-wise distance channels has voxel-wise distance values for voxels. The voxel-wise distance values specify distances from the corresponding voxels in a plurality of voxels to the atoms of non-gap amino acids. The spatial representations of non-gap amino acids are determined based on spatial proximity between the corresponding voxels and the atoms of these non-gap amino acids. However, the spatial representations of the gap amino acid are excluded from the gapped spatial representation by disregarding distances from the corresponding voxels to atoms of the gap amino acid when determining the voxel-wise distance values.
In one implementation, when a reference gap amino acid at a given position in a reference target protein sequence is excluded, a target alternate amino acid may fill the vacancy at the given position. The pathogenicity classifier 204 can use both the spatial representations of the vacancy-containing protein (i.e., the “gapped spatial representation”) and the target alternate amino acid as input to predict a pathogenicity score for the target alternate amino acid. In another implementation, each of the 20 amino acid categories can be used as a target alternate amino acid that fills the vacancy at the given position in the reference target protein sequence. Thus, the pathogenicity classifier 204 can predict a pathogenicity score for each of the 20 amino acid categories (e.g., by using a 20-way softmax classification).
As illustrated in
In one implementation, the evolutionary profiles of a target protein sequence include position weight matrix (PWM) conservation profiles calculated from a plurality of species. For example, to predict a pathogenicity score of a target variant in a protein sequence from human, three different position weight matrices (PWMs) can be generated from multiple sequence alignments (MSA) of 99 vertebrates for this target variant, including one for 12 primates, one for 47 mammals excluding primates, and one for 40 vertebrates excluding primates and mammals. Each of the PWMs has a dimension of L×20, where L is the length of flanking sequences around the target variant. It comprises counts of amino acids seen in each category of species.
In another implementation, the evolutionary profiles of the target protein sequence include position-specific frequency matrices (PSFMs) 262 and position-specific score matrices (PSSMs) 272. PSFMs represent the conservation of amino acids in a human protein sequence across aligned protein sequences of other species by determining, on a location-by-location basis, the frequency of occurrence of an amino acid in the human protein sequence across the aligned protein sequences of other species. On the other hand, PSSMs represent the evolutionary conservation of alternate amino acids that are candidates for filling an amino acid vacancy in a vacancy-containing protein. PSSMs represent evolutionary conservation of alternate amino acids based at least in part on structural (or spatial) compatibility between substitute amino acids and adjacent amino acids in a neighborhood of the amino acid vacancy (e.g., the right and left flanking amino acids around the amino acid vacancy). PSFMs and PSSMs can be used to identify the evolutionary conservation profiles of homologous proteins of the plurality of species and differentiate most-conserved amino acids from non-conserved amino acids.
A protein sequence (e.g., a reference protein sequence) can be used as a seed to search and align homogenous sequences from a protein database (e.g., SWISSPROT) using, for example, a PSI-BLAST program. The aligned sequences share some homogenous segments and belong to the same protein family. The aligned sequences are further converted into two profiles, namely, PSSM and PSFM, to express their homogeneous information. Consider in
The first input 202 may further include the secondary structure and solvent accessibility of the reference target protein sequence and the alternative protein sequence. For example, a three-state secondary structure profile at each amino acid position of the protein sequence can be represented as alpha helix (H), beta sheet (B), and coils (C). A three-state solvent accessibility profile at each amino acid position of the protein sequence can be represented as buried (B), intermediate (I), and exposed (E). In one implementation, the secondary structure and solvent accessibility of the reference target protein sequence and alternative protein sequence are predicted by other deep learning networks, and the predicted results are used here as indirect input to the pathogenicity classifier 204. Given the small number of human proteins that have been successfully crystallized, inferring structure from the primary sequence and using it as an input to the pathogenicity classifier 204 has the advantage of avoiding biases due to incomplete protein structure and functional domain annotations.
In one implementation, the training dataset used for the first model includes common human variants and common non-human primate variants. For example, the training dataset includes over 80,000 common variants (>0.1% allele frequency) and over 300,000 unique primate variants from chimpanzee, bonobo, gorilla, and orangutan, rhesus, and marmoset. Existing databases of human expert-curated variants do not represent the entire genome, with ˜50% of the variants in the ClinVar database coming from only 200 genes (˜1% of 5 human protein-coding genes). Moreover, systematic studies identify that many human expert annotations have questionable supporting evidence, underscoring the difficulty of interpreting rare variants that may be observed in only a single patient. Although human expert interpretation has become increasingly rigorous, classification guidelines are largely formulated around consensus practices and are at risk of reinforcing existing tendencies. Variation from the six non-human primates (chimpanzee, bonobo, gorilla, orangutan, rhesus, and marmoset) contributes over 300,000 unique missense variants that are non-overlapping with common human variation, and largely represent common variants of benign consequence that have been through the sieve of purifying selection, greatly enlarging the training dataset available for machine learning approaches. On average, each primate species contributes more variants than the whole of the ClinVar database (˜42,000 missense variants as of November 2017, after excluding variants of uncertain significance and those with conflicting annotations). Additionally, this content is free from biases in human interpretation.
In another implementation, the training dataset used for the first model includes a “gapped spatial representation” of a vacancy-containing protein that excludes at least one gap amino acid in the protein as well as a spatial representation of an alternate amino acid. For example, the training set includes respective benign protein samples for respective reference amino acids at respective positions in a proteome. The reference amino acids are major allele amino acids of the proteome. In one implementation, the proteome has ten million positions, and therefore the benign training set has ten million benign protein samples. The respective benign protein samples have respective gapped spatial representations generated by using the respective reference amino acids as respective gap amino acids. The respective benign protein samples have respective representations of the respective reference amino acids as respective alternate amino acids. The proteome can include human proteome and non-human proteome, for example, non-human primate proteome.
In another implementation, the training set used for the first model includes respective pathogenic protein samples for respective combinatorically generated amino acid substitutions for each of the reference amino acids at each of the respective positions in the proteome. For example, the respective combinatorically generated amino acid substitutions are confined by the reachability of single nucleotide polymorphisms (SNPs) to transform a reference codon of a reference amino acid into alternate amino acids of unreachable alternate amino acid classes. The combinatorically generated amino acid substitutions for a particular reference amino acid of a particular amino acid class at a particular position in the proteome include respective alternate amino acids of respective amino acid classes that are different from the particular amino acid class. The respective pathogenic protein samples have respective gapped spatial representations generated by using the respective reference amino acids as respective gap amino acids. The respective pathogenic protein samples have respective representations of the respective combinatorically generated amino acid substitutions as respective alternate amino acids created by respective combinatorically generated nucleotide variants at the respective positions in the proteome. The proteome can have ten million positions, and there are nineteen combinatorically generated amino acid substitutions for each of the ten million positions. Thus, the pathogenic training set can have one hundred and ninety million pathogenic protein samples.
During the training process, the first model generates, using these benign and pathogenic variants from human and non-human primate as input, pathogenicity score and corresponding rankings for these variants. By comparing the predicted output with their ground-truth labels, the first model calculates the loss via a loss function such that the loss can be used to adjust model parameters via backpropagation. Meanwhile, the second model also generates pathogenicity scores and corresponding ranking for these variants. In one implementation, the pathogenicity scores and rankings generated from the second model are different from those from the first model. When the rank loss determination logic 118 determines a rank loss parameter based on the comparison of the two sets of scores mapped to corresponding rankings. This rank loss parameter becomes an additional input to be used by the loss function reconfiguration logic 128 and training logic 138 in adjusting model parameters via backpropagation.
In one implementation, the first model 200 is a deep neural network, for example, convolutional neural network (CNN) or recurrent neural network (RNN), that uses multiple non-linear and complex transforming layers to successively model high-level features. Given that sequenced data (e.g., DNA or RNA sequence, amino acid sequence) can be multi- and high-dimensional, deep neural networks have great promise for bioinformatics research because of their broad applicability and enhanced prediction power. In particular, convolutional neural networks have been adapted to solve sequence-based problems in genomics such as motif discovery, pathogenic variant identification, and gene expression inference. CNNs use a weight-sharing strategy that is especially useful for studying DNA because it can capture sequence motifs, which are short, recurring local patterns in DNA that are presumed to have significant biological functions.
In one implementation, the first model 200 is a deep neural network that predicts the pathogenicity of missense variants from protein sequence and sequence conservation data (See Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as “PrimateAI”). PrimateAI uses deep neural networks trained on variants of known pathogenicity with data augmentation using cross-species information. In particular, PrimateAI uses sequences of wild-type and mutant proteins to compare the difference and decide the pathogenicity of mutations using the trained deep neural networks. PrimateAI learns important protein domains, conserved amino acid positions, and sequence dependencies directly from the training data consisting of about 120,000 human samples.
In one implementation, the inputs to PrimateAI include one-hot-encoding of a reference target protein sequence, one-hot-encoding of an alternative protein sequence with a target variant, evolutionary profiles represented by three PWMs from primate, mammal, and vertebrate alignments, respectively, as well as the prediction of the secondary structure and solvent accessibility of the reference target protein sequence. The pathogenicity classifier 204 of PrimateAI is a convolutional neural network that uses different types of convolutions such as 1D convolutions, 2D convolutions, 3D convolutions, dilated or atrous convolutions, transposed convolutions, separable convolutions, and depthwise separable convolutions. Some layers also use ReLU activation function which greatly accelerates the convergence of stochastic gradient descent compared to saturating non-linearities such as sigmoid or hyperbolic tangent. Other examples of activation functions that can be used in PrimateAI include parametric ReLU, leaky ReLU, GeLU, and exponential linear unit (ELU).
Some layers in PrimateAI also use batch normalization. Regarding batch normalization, the distribution of each layer in a CNN changes during training and it varies from one layer to another. This reduces the convergence speed of the optimization algorithm. Batch normalization is a technique to overcome this problem. Denoting the input of a batch normalization layer with x and its output using z, batch normalization applies the following transformation on x:
Batch normalization applies mean-variance normalization on the input x using μ and σ and linearly scales and shifts it using γ and β. The normalization parameters μ and σ are computed for the current layer over the training set using a method called exponential moving 15 average. In other words, they are not trainable parameters. In contrast, γ and β are trainable parameters. The values for μ and σ calculated during training are used in forward pass during inference.
In another implementation, the first model 200 is a deep neural network to predict variant pathogenicity based on 3D structural information of protein sequences, referred to herein as “PrimateAI 3D”. In one implementation, PrimateAI 3D uses as input voxelized one-hot-encodings of amino acids in a reference target protein sequence and one-hot-encodings of amino acids in an alternative protein sequence with a target variant. PrimateAI 3D also uses as input twenty-one amino acid-wise distance channels, where each amino acid-wise distance channel includes a distance value for each of the 27 voxels within a 3×3×3 voxel grid. In another implementation, PrimateAI 3D also uses as input the evolutionary profiles of the reference target protein sequence, for example, PWMs and PSFMs as described above.
In one implementation, PrimateAI 3D is a cascade of processing modules, each of which can include 1D convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1×1 convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions. It can use one or more loss functions such as logistic regression/log loss, multi-class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss. It can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). It can include upsampling layers, downsampling layers, recurrent connections, gates, and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., non-linear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, attention mechanisms, and gaussian error linear unit. The processing modules may also include a final softmax layer that produces exponentially normalized scores for the target variant belonging to a benign class and a pathogenic class.
In yet another implementation, the first model 200 is a deep neural network to predict variant pathogenicity based on “gapped spatial representation” of a vacancy-containing protein that excludes at least one gap amino acid in the protein, and a spatial representation of an alternate amino acid. The first model 200 is referred to herein as “JigsawAI”. The spatial representation of the vacancy-containing protein can be realized by using voxelized amino acid category-wise distance calculations that exclude the reference gap amino acid (or atoms thereof). In one implementation, an alternate amino acid may fill the vacancy at the given position in the reference target protein sequence. The representation of the alternate amino acid, along with the “gapped spatial representation” of the vacancy-containing protein are input to the pathogenicity classifier which generates an output pathogenicity score of the alternate amino acid. In another implementation, JigsawAI also uses evolutionary profiles of protein sequences as input.
In one implementation, the pathogenicity classifier 204 of JigsawAI is a cascade of processing modules, each of which can include 1D convolutions (1×1×1 CONV), 3D convolutions (3×3×3 CONV), ReLU non-linearity, and batch normalization (BN). Other examples of the processing modules include fully-connected layers, a dropout layer, a flattening layer, and a final softmax layer that produces exponentially normalized scores for the target variant belonging to a benign class and a pathogenic class.
A person skilled in the art will appreciate that for the architecture of the first model, other current and future artificial intelligence, machine learning, and deep learning models, datasets, and training techniques can be incorporated into the disclosed model without deviating from the spirit of the technology disclosed.
As described above, the second model can have a different architecture from the first network. For example, the second model can be a generative model that uses probability estimates and likelihood to model data points and differentiate between different class labels present in a dataset. Generative models are often used in natural language processing. In recent years, generative models have emerged as powerful machine-learning tools to discover evolutionary, structural and functional information across protein space and thus, referred to as protein language models. The second model can be a protein language model that is trained on large and diverse protein sequence databases, and captures information of the sequence-structure-function relationship of the protein including, e.g., the pathogenicity of variants in the context of epistasis.
In one implementation, the second model can capture higher-order epistatic interactions between mutations that span multiple positions and multiple sequences in the MSA. In another implementation, the higher-order epistatic interactions between the mutations may form an undirected graph trajectory. The second model can use an undirected graphical model to identify highly evolutionarily coupled loci given the MSA. See, Genome-wide discovery of epistatic loci affecting antibiotic resistance using evolutionary couplings, Benjamin Schubert, Rohan Maddamsetti, Jackson Nyman, Debora S. Marks, Nature Microbiology, 4, pages328-338 (2019). In another implementation, the second model can capture two-way epistatic interactions (e.g., pairwise level) between mutations. In yet another implementation, the second model can capture higher-order (e.g., three-way, four-way) epistatic interactions that describe the full extent of possible interactions.
In one implementation, the second model processes the MSA as an input and generates a reconstruction of the MSA as an output. For a given alternate protein sequence, the second model may generate a reconstruction of the alternate protein sequence as a base-wise probability score for each amino acid in the protein sequence. In other implementations, a joint probability is determined from the base-wise probability score corresponding to each amino acid and, in turn, is used as a pathogenicity score for a given variant that mutates the reference target protein sequence to the given alternate protein sequence.
In yet another implementation, the second model is pre-trained. In other words, the respective coefficients and latent space configurations of the second model are pre-trained to process and reconstruct the MSA that has a respective reference target protein sequence as a query sequence. For the given reference target protein sequence, the second model may have a particular or preset coefficient and latent space configuration. Considering there are twenty thousand reference protein sequences in human proteome, the second model may accordingly have a particular or preset coefficient and latent space configuration for each of the twenty thousand reference protein sequences. When a given protein sequence is used as an input to the second model, for each amino acid located in a given position of the protein sequence, the output is a vector with 20 floating numbers. Each floating number corresponds to the pathogenicity of a particular alternative amino acid as there are 20 types of amino acids.
In one implementation, the second model is a multilayer perceptron (MLP). In another implementation, the second model is a feedforward neural network. In yet another implementation, the second model is a fully-connected neural network. In a further implementation, the second model is a fully convolution neural network. In yet further implementation, the second model is a semantic segmentation neural network. In yet another further implementation, the second model is a generative adversarial network (GAN) (e.g., CycleGAN, StyleGAN, pixelRNN, text-2-image, DiscoGAN, IsGAN).
In one implementation, the second model is a convolution neural network (CNN) with a plurality of convolution layers. In another implementation, the second model is a recurrent neural network (RNN) such as a long short-term memory network (LSTM), bi-directional LSTM (Bi-LSTM), or a gated recurrent unit (GRU). In yet another implementation, the second model includes both a CNN and an RNN.
In yet other implementations, the second model can use 1D convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1×1 convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions. The second model can use one or more loss functions such as logistic regression/log loss, multi-class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss. The second model can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). The second model can include upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., non-linear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, and attention mechanisms (e.g., self-attention).
The second model can be a rule-based model, linear regression model, a logistic regression model, an Elastic Net model, a support vector machine (SVM), a random forest (RF), a decision tree, and a boosted decision tree (e.g., XGBoost), or some other tree-based logic (e.g., metric trees, kd-trees, R-trees, universal B-trees, X-trees, ball trees, locality sensitive hashes, and inverted indexes). The second model can be an ensemble of multiple models, in some implementations.
The second model is trained using backpropagation-based gradient update techniques. Example gradient descent techniques that can be used for training the second model include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train the second model is Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad.
In one implementation, input to the second model includes epistasis-related information expressed by amino acid patterns spanning a target protein sequence and a plurality of non-target protein sequences aligned with the target protein sequence in MSA. The amino acid patterns may be variant or mutation patterns and may include multiple, e.g., at least three amino acids. In another implementation, input to the second model includes sequence information of a target protein sequence and a plurality of non-target protein sequences that are homologous with the target protein sequence.
In one implementation, the target protein sequence belongs to a target species (e.g., human), and the non-target protein sequences belong to a group of species that shares a genus with the target species. The shared genus is, for example, Genus Homo. In another implementation, the non-target protein sequences belong to a group of species that shares a family with the target species. The shared family is, for example, hominids. In yet another implementation, the non-target protein sequences belong to a group of species that shares a class with the target species. The shared family is, for example, mammals. In yet another implementation, the non-target protein sequences belong to a group of species that shares a phylum with the target species. The shared phylum is, for example, chordates. In yet another implementation, the non-target protein sequences belong to a group of species that shares a kingdom with the target species. The shared kingdom is, for example, animals.
The following discussion provides different examples of machine learning architectures that can be used to implement the second model. The following example machine learning architectures can take as input machine-processable or vectorized representations of sequence data, for example, one-hot encodings of nucleotides and/or amino acids, process the machine-processable representations through a plurality of hidden layers and weights of the machine learning architectures, produce learned or alternative or intermediate or compressed representations of the machine-processable representations, and generate one or more outputs based on the learned or alternative or intermediate or compressed representations. These outputs can be genotype predictions identifying one or more attributes or identifies of the genomic data, such as the identity of the nucleotides and/or amino acids, evolutionary conservation states of the nucleotides and/or amino acids, the pathogenicity of the nucleotides and/or amino acids, and so on.
We start from example architectures of generative models. Generative models are often used in natural language processing. In recent years, generative models have emerged as powerful machine-learning tools to discover evolutionary, structural and functional information across protein space and thus, referred to as protein language models. The second model, as a protein language model, can be trained on large and diverse protein sequence databases, and captures information of the sequence-structure-function relationship of the protein including, e.g., the pathogenicity of variants in the context of epistasis.
In one implementation, the second model is a Transformer model that relies on a self-attention mechanism to compute a series of context-informed vector-space representations of elements in the input sequence and the output sequence, which are then used to predict distributions over subsequent elements as the model predicts the output sequence element-by-element. Not only is this mechanism straightforward to parallelize, but as each input's representation is also directly informed by all other inputs' representations, this results in an effectively global receptive field across the whole input sequence. This stands in contrast to, e.g., convolutional architectures which typically only have a limited receptive field.
The second model can include self-attention mechanisms like Transformer, Vision Transformer (ViT), Bidirectional Transformer (BERT), Detection Transformer (DETR), Deformable DETR, UP-DETR, DeiT, Swin, GPT, iGPT, GPT-2, GPT-3, BERT, SpanBERT, RoBERTa, XLNet, ELECTRA, UniLM, BART, T5, ERNIE (THU), KnowBERT, DeiT-Ti, DeiT-S, DeiT-B, T2T-ViT-14, T2T-ViT-19, T2T-ViT-24, PVT-Small, PVT-Medium, PVT-Large, TNT-S, TNT-B, CPVT-S, CPVT-S-GAP, CPVT-B, Swin-T, Swin-S, Swin-B, Twins-SVT-S, Twins-SVT-B, Twins-SVT-L, Shuffle-T, Shuffle-S, Shuffle-B, XCiT-S12/16, CMT-S, CMT-B, VOLO-D1, VOLO-D2, VOLO-D3, VOLO-D4, MoCo v3, ACT, TSP, Max-DeepLab, VisTR, SETR, Hand-Transformer, HOT-Net, METRO, Image Transformer, Taming transformer, TransGAN, IPT, TTSR, STTN, Masked Transformer, CLIP, DALL-E, Cogview, UniT, ASH, TinyBert, FullyQT, ConvBert, FCOS, Faster R-CNN+FPN, DETR-DCS, TSP-FCOS, TSP-RCNN, ACT+MKDD (L=32), ACT+MKDD (L=16), SMCA, Efficient DETR, UP-DETR, UP-DETR, ViTB/16-FRCNN, ViT-B/16-FRCNN, PVT-Small+RetinaNet, Swin-T+RetinaNet, Swin-T+ATSS, PVT-Small+DETR, TNT-S+DETR, YOLOS-Ti, YOLOS-S, and YOLOS-B.
Machine learning is the use and development of computer systems that can learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. Some of the state-of-the-art models use Transformers, a more powerful and faster model than neural networks alone. Neural networks process input in series (e.g., time series data including sequencing-by-synthesis (SBS) sequencing data) and weight relationships by distance in the series. Transformers can process input in parallel and do not necessarily weight by distance. Transformers can be used in addition to neural networks. This architecture is described here.
The context vector is then passed to the second building block, the decoder. Conditioned on the input context vector, the decoder generates an output sequence. At each time step, t, the decoder is fed the hidden state of time step, t−1, and the output generated at time step, t−1. The first hidden state in the decoder is the context vector, generated by the encoder.
The whole model is optimized end-to-end by using backpropagation, a method of training a neural network in which the initial system output is compared to the desired output and the system is adjusted until the difference is minimized. In backpropagation, the encoder is trained to extract the right information from the input sequence, the decoder is trained to capture the grammar and vocabulary of the output language. This results in a fluent model that uses context and generalizes well. When training an encoder-decoder model, the real output sequence is used to train the model to prevent mistakes from stacking. When testing the model, the previously predicted output value is used to predict the next one.
When performing a translation task using the encoder-decoder architecture, all information about the input sequence is forced into one vector, the context vector. Information connecting the beginning of the sentence with the end is lost, the vanishing gradient problem. Also, different parts of the input sequence are important for different parts of the output sequence, information that cannot be learned using only RNNs in an encoder-decoder architecture.
Attention Mechanism
Attention mechanisms distinguish Transformers from other machine learning models. The attention mechanism provides a solution for the vanishing gradient problem.
To weight encoder hidden states, a dot product between the decoder hidden state of the current time step, and all encoder hidden states, is calculated. This results in an attention score for every encoder hidden state. The attention scores are higher for those encoder hidden states that are similar to the decoder hidden state of the current time step. Higher values for the dot product indicate the vectors are pointing more closely in the same direction. The attention scores are converted to fractions that sum to one using the SoftMax function.
The SoftMax scores provide an attention distribution. The x-axis of the distribution is position in a sentence. The y-axis is attention weight. The scores show which encoder hidden states are most closely related. The SoftMax scores specify which encoder hidden states are the most relevant for the decoder hidden state of the current time step.
The elements of the attention distribution are used as weights to calculate a weighted sum over the different encoder hidden states. The outcome of the weighted sum is called the attention output. The attention output is used to predict the output, often in combination (concatenation) with the decoder hidden states. Thus, both information about the inputs, as well as the already generated outputs, can be used to predict the next outputs.
By making it possible to focus on specific parts of the input in every decoder step, the attention mechanism solves the vanishing gradient problem. By using attention, information flows more directly to the decoder. It does not pass through many hidden states. Interpreting the attention step can give insights into the data. Attention can be thought of as a soft alignment. The words in the input sequence with a high attention score align with the current target word. Attention describes long-range dependencies better than RNN alone. This enables analysis of longer, more complex sentence.
The attention mechanism can be generalized as: given a set of vector values and a vector query, attention is a technique to compute a weighted sum of the vector values, dependent on the vector query. The vector values are the encoder hidden states, and the vector query is the decoder hidden state at the current time step.
The weighted sum can be considered a selective summary of the information present in the vector values. The vector query determines on which of the vector values to focus. Thus, a fixed-size representation of the vector values can be created, in dependence upon the vector query.
The attention scores can be calculated by the dot product, or by weighting the different values (multiplicative attention).
For most machine learning models, the input to the model needs to be numerical. The input to a translation model is a sentence, and words are not numerical. Multiple methods exist for the conversion of words into numerical vectors. These numerical vectors are called the embeddings of the words. Embeddings can be used to convert any type of symbolic representation into a numerical one.
Embeddings can be created by using one-hot encoding. The one-hot vector representing the symbols has the same length as the total number of possible different symbols. Each position in the one-hot vector corresponds to a specific symbol. For example, when converting colors to a numerical vector, the length of the one-hot vector would be the total number of different colors present in the dataset. For each input, the location corresponding to the color of that value is one, whereas all the other locations are valued at zero. This works well for working with images. For natural language processing (NLP), this becomes problematic, because the number of words in a language is very large. This results in enormous models and the need for a lot of computational power. Furthermore, no specific information is captured with one-hot encoding. From the numerical representation, it is not clear that orange and red are more similar than orange and green. For this reason, other methods exist.
A second way of creating embeddings is by creating feature vectors. Every symbol has its specific vector representation, based on features. With colors, a vector of three elements could be used, where the elements represent the amount of yellow, red, and/or blue needed to create the color. Thus, all colors can be represented by only using a vector of three elements. Also, similar colors, have similar representation vectors.
Embedding based on context can be trained. Words with similar meanings occur in similar contexts. At nucleotide level, particular combinations of three DNA or RNA nucleotides correspond to specific amino acids or stop signals during protein synthesis. In addition, homologous proteins or genes have sequence similarity that reflects common ancestry. Different methods take the context into account. For natural language process, some methods, like GloVe, base their context embedding on co-occurrence statistics from corpora (large texts) such as Wikipedia. Words with similar co-occurrence statistics have similar word embeddings. Other methods use neural networks to train the embeddings. For example, they train their embeddings to predict the word based on the context (Common Bag of Words), and/or to predict the context based on the word (Skip-Gram). Training these contextual embeddings is time intensive. For this reason, pre-trained libraries exist. Other deep learning methods can be used to create embeddings. For example, the latent space of a variational autoencoder (VAE) can be used as the embedding of the input. Another method is to use 1D convolutions to create embeddings. This causes a sparse, high-dimensional input space to be converted to a denser, low-dimensional feature space.
Transformer models are based on the principle of self-attention. Self-attention allows each element of the input sequence to look at all other elements in the input sequence and search for clues that can help it to create a more meaningful encoding. It is a way to look at which other sequence elements are relevant for the current element. The Transformer can grab context from both before and after the currently processed element.
When performing self-attention, three vectors need to be created for each element of the encoder input: the query vector (Q), the key vector (K), and the value vector (V). These vectors are created by performing matrix multiplications between the input embedding vector using three unique weight matrices.
After this, self-attention scores are calculated. When calculating self-attention scores for a given element, the dot products between the query vector of this element and the key vectors of all other input elements are calculated. To make the model mathematically more stable, these self-attention scores are divided by the root of the size of the vectors. This has the effect of reducing the importance of the scalar thus emphasizing the importance of the direction of the vector. Just as before, these scores are normalized with a SoftMax layer. This attention distribution is then used to calculate a weighted sum of the value vectors, resulting in a vector z for every input element. In the attention principle explained above, the vector to calculate attention scores and to perform the weighted sum was the same, in self-attention two different vectors are created and used. As the self-attention needs to be calculated for all elements (thus a query for every element), one formula can be created to calculate a Z matrix. The rows of this Z matrix are the z vectors for every sequence input element, giving the matrix a size length sequence dimension QKV.
Multi-headed attention is executed in the Transformer.
When performing self-attention, information about the order of the different elements within the sequence is lost. To address this problem, positional encodings are added to the embedding vectors. Every position has its unique positional encoding vector. These vectors follow a specific pattern, which the Transformer model can learn to recognize. This way, the model can consider distances between the different elements.
As discussed above, in the core of self-attention are three objects: queries (Q), keys (K), and values (V). Each of these objects has an inner semantic meaning of their purpose. One can think of these as analogous to databases. A user-defined query of what the user wants to know is used as input. Then we have the relations in the database, i.e., the values which are the weights. More advanced database management systems create some apt representation of its relations to retrieve values more efficiently from the relations. This can be achieved by using indexes, which represent information about what is stored in the database. In the context of attention, indexes can be thought of as keys. So instead of running the query against values directly, the query is first executed on the indexes to retrieve where the relevant values or weights are stored. Lastly, these weights are run against the original values to retrieve data that are most relevant to the initial query.
Examples of attention calculation include scaled dot-product attention and additive attention. There are several reasons why scaled dot-product attention is used in the Transformers. Firstly, the scaled dot-product attention is relatively fast to compute, since its main parts are matrix operations that can be run on modern hardware accelerators. Secondly, it performs similarly well for smaller dimensions of the K matrix, dk, as the additive attention. For larger dk, the scaled dot-product attention performs a bit worse because dot products can cause the vanishing gradient problem. This is compensated via the scaling factor, which is defined as √{square root over (dk)}.
As discussed above, the attention function takes as input three objects: key, value, and query. In the context of Transformers, these objects are matrices of shapes (n, d), where n is the number of elements in the input sequence and d is the hidden representation of each element (also called the hidden vector). Attention is then computed as:
where Q, K, V are computed as:
X·WQ, X·WK, X·WV
X is the input matrix and WQ, WK, WV are learned weights to project the input matrix into the representations. The dot products appearing in the attention function are exploited for their geometrical interpretation where higher values of their results mean that the inputs are more similar, i.e., pointing in the geometrical space into the same direction. Since the attention function now works with matrices, the dot product becomes matrix multiplication. The SoftMax function is used to normalize the attention weights into the value of 1 prior to being multiplied by the values matrix. The resulting matrix is used either as input into another layer of attention or becomes the output of the Transformer.
Transformers become even more powerful when multi-head attention is used. Queries, keys, and values are computed the same way as above, though they are now projected into h different representations of smaller dimensions using a set of h learned weights. Each representation is passed into a different scaled dot-product attention block called a head. The head then computes its output using the same procedure as described above.
Formally, the multi-head attention is defined as
MultiHeadAttention (Q, K, V)=[head1, . . . , headh]W0
where headi=Attention (QWiQ, KWiK, VWiV)
The outputs of all heads are concatenated together and projected again using the learned weights matrix WO to match the dimensions expected by the next block of heads or the output of the Transformer. Using the multi-head attention instead of the simpler scaled dot-product attention enables Transformers to jointly attend to information from different representation subspaces at different positions.
As shown in
Assuming the naive matrix multiplication algorithm which has a complexity of:
a·b·c
For matrices of shape (a, b) and (c, d), to obtain values Q, K, V, the operations are computed as:
X·WQ, X·WK, X·WV
The matrix X is of shape (n, d) where n is the number of patches and d is the hidden vector dimension. The weights WQ, WK, WV are all of shape (d, d). Omitting the constant factor 3, the resulting complexity is:
n·d2
It is then proceeded to the estimation of the complexity of the attention function itself, i.e., of
The matrices Q and K are both of shape (n, d). The transposition operation does not influence the asymptotic complexity of computing the dot product of matrices of shapes (n, d)·(d, n), therefore its complexity is:
n2·d
Scaling by a constant factor of √{square root over (dk)}, where dk is the dimension of the keys vector, as well as applying the SoftMax function, both have the complexity of a·b for a matrix of shape (a, b), hence they do not influence the asymptotic complexity. Lastly the dot product
is between matrices of shapes (n, n) and (n, d) and so its complexity is:
n2·d
The final asymptotic complexity of scaled dot-product attention is obtained by summing the complexities of computing Q, K, V, and of the attention function
n·d2+n2·d.
The asymptotic complexity of multi-head attention is the same since the original input matrix X is projected into h matrices of shapes
wnere h is the number of heads. From the view of asymptotic complexity, h is constant, therefore the same estimate of asymptotic complexity using a similar approach as for the scaled dot-product attention can be obtained.
Transformer models often have the encoder-decoder architecture, although this is not necessarily the case. The encoder is built out of different encoder layers which are all constructed in the same way. The positional encodings are added to the embedding vectors. Afterward, self-attention is performed.
Just like the encoder, the decoder is built from different decoder layers. In the decoder, a modified version of self-attention takes place. The query vector is only compared to the keys of previous output sequence elements. The elements further in the sequence are not known yet, as they still must be predicted. No information about these output elements may be used.
For some tasks other than translation, only an encoder is needed. This is true for both document classification and name entity recognition. In these cases, the encoded input vectors are the input of the feed-forward layer and the SoftMax layer. These models have applications in the field of biology as well for predicting protein structure and function and labeling DNA sequences.
There are extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation).
In image classification, we often have a single input image in which the pixels are in a sequence. To reduce the computation required, Vision Transformers (ViTs) cut the input image into a set of fixed-sized patches of pixels. The patches are often 16×16 pixels. ViTs are depicted in
The computations of the ViT architecture can be summarized as follows. The first layer of a ViT extracts a fixed number of patches from an input image (
When the input image is split into patches, a fixed patch size is specified before instantiating a ViT. Given the quadratic complexity of attention, patch size has a large effect on the length of training and inference time. A single Transformer block comprises several layers. The first layer implements Layer Normalization, followed by the multi-head attention that is responsible for the performance of ViTs. In the depiction of a Transformer block in
ViTs can be pretrained and fine-tuned. Pretraining is generally done on a large dataset. Fine-tuning is done on a domain specific dataset.
Domain-specific architectures, like convolutional neural networks (CNNs) or long short-term memory networks (LSTMs), have been derived from the usual architecture of MLPs and suffer from so-called inductive biases that predispose the networks towards a certain output. ViTs stepped in the opposite directions of CNNs and LSTMs and became more general architectures by eliminating inductive biases. A ViT can be seen as a generalization of MLPs because MLPs, after being trained, do not change their weights for different inputs. On the other hand, ViTs compute their attention weights at runtime based on the particular input.
The following discussion describes some implementations of how the Transformer models process a genomic sequence and produce position-wise nucleotide classification of the genomic sequence.
The Transformer models include convolutional layers that can detect local patterns, and thereby enhance the detection of nucleotide motifs. The Transformer models process a genome sequence in consecutive segments of length l. Every input nucleotide x ∈{A, C, G, T} is first transformed into a vector embedding h(0), after which it is transformed k times through addition (residual connection) with another vector, obtained by the multi-head attention function present in each layer (h(0)→ . . . →h(k)).
A set of fully connected layers transforms h(k) into a model output ŷ(k). For each residual block, the vector that is summed with the input (to obtain h(1), . . . , h(k)) is calculated using the hidden states of l upstream positions.
The multi-head attention applied in each residual block is methodologically identical. From each input hidden state h, a query (q), key (k), and value (v) vector of equal shapes are calculated. The output z of the attention head, applied on the hidden state at position n, is calculated as follows:
where K, V ∈ are the matrices that are composed from l upstream hidden states (e.g., K=[k(n−1), . . . , k(n)]).
The denominator is used to stabilize the scores based on the dimensions of q, k, and v (dhead). The multiplication of the query vector with all the key vectors results in a vector of scores that is normalized for all input values using the softmax function. These scores are multiplied to the v vectors for the calculation of z (i.e., a linear combination). The attention scores denote the relevance of information present between two positions, where the multiplication of the q and k vectors function as a lock and key encoding, which returns goodness-of-fit scores for the information embedded in two hidden states (defined by v).
In each residual block, multiple attention heads are present (hence, multi-head attention), each featuring their own unique sets of model weights to calculate q, k, and v. As such, multiple types of information can be extracted from the input hidden states. The outcome of different attention heads within the same layer is further processed into a single vector, which is summed with h to obtain the hidden state of the next layer (e.g., h(1)→h(2)).
Contextual information embedded within the hidden states derived from single nucleotides is limited. Motifs formed from multiple neighboring nucleotides are deemed of greater importance towards biological processes. The addition of a convolutional layer allows the q, k, and v vectors to be derived from multiple neighboring hidden states without affecting the input/output resolution. Thereby, the retrieval of relevant information using attention is improved, resulting in improved predictive performances on a variety of tasks.
Positional information is used within the vectors q, k, and v by superimposing (i.e., through summation) a positional encoding vector to h. The added signal is a function of the vector index and the relative positioning with respect to the other input hidden states.
The annotation of DNA is a sequence labeling task that has correspondences in natural language processing. The DNA sequence is a data set of n nucleotides, i.e., X ∈{x(1), x(2), . . . , x(n)}, where x ∈ A, C, T, G, the task comprises predicting a label y 0, 1 for each position x, where a positive label denotes the occurrence of an event at that position.
The Transformer models process the genome in sequential segments of l nucleotides. During training, a non-linear transformation function E is optimized that maps the input classes {A, C, T, G} to a vector embedding h of length dmodel. For nucleotide x(i) on the genome:
h=E(x(i)), x(i)∈{A, T, C, G},
where h ∈.
The hidden states of each segment H ∈, [h(1), . . . , h(l)], are processed through k layers. As such, the data propagation through the network for any input x follows multiple transformation: x→h(0,:)→ . . . →h(k,:)→ŷ.
Within each layer, multi-head attention is calculated for each hidden state. Next, for each hidden state of h, the output of the multi-head attention step (MultiHead) is summed with the input, i.e., a residual connection, with the final step being layer normalization. The calculations of the output for all hidden states h in layer t at position m of segment s are performed in parallel:
h
(s,t+1,m)=LayerNorm(h(s,t,m)+MultiHead(H(s,t))),
or
H
(s,t+1)=LayerNorm(H(s,t)+MultiHead(H(s,t))),
where t ∈[0, k[and m ∈[1, 1].
After a forward pass through k layers, a final linear combination reduces the dimension of the output hidden state (dmodel) to the amount of output classes. In one implementation, only binary classification is performed. In another implementation, a softiiiax layer is applied before obtaining the prediction value ŷi for nucleotide xi.
Particular Implementation of using Second Model to Predict Variant Pathogenicity
An example of using variational autoencoder (VAE) as the second model to predict pathogenicity scores of variants will be described in more detail in accordance with
In particular, the second model is used to implicitly capture higher-order interactions between positions in a sequence in a protein family. For every observed sequence x, unobserved latent variables z together with a generative process p(z)p(x|z) will be generated that specifies a joint distribution over hidden variables and observed variables. As the marginal probability of the observed data, p(x), is difficult to directly compute because it is intractable, variational inference is used here to form a lower bound, namely, Evidence Lower Bound (ELBO), on the log probability. ELBO used in this model takes the form of:
log p(x|θ)≥q[log p(x|z, θ)]−DKL(q(z|x, φ)||p(z)),
where q(z|x) is an approximate posterior for hidden variables given the observed variables p(z|x).
Thus, given a starting target wild-type sequence, the pathogenicity effect caused by a single point mutation can be computed as the difference in ELBOs and predicted as:
As illustrated in
The target wild-type sequence (Swt) is fed into an encoder and compressed to a vector of hidden variables z (steps 1606 and 1608). In one implementation, the encoder has an architecture of 1500-1500-30 with fully connected layers and ReLU non-linearity functions (See Exhibit 7). The vector of hidden variables z is further decoded via a decoder and thus, generates a reconstructed sequence . In one implementation, the decoder has two hidden layers: the first with size 100 and a ReLU non-linearity, and the second with size 2,000 with a sigmoid non-linearity (See Exhibit 7). Thus, p(xwildtype) can be calculated as:
p(xwildtype)=p(Swt)
With softmax (step 1612) and probabilistic encoding (step 1614), the computed loss via ELBO as described above (step 1616) is used to update the weights in the encoder and decoder (step 1618).
When a single point mutation is introduced, the steps from 1602 to 1618 are repeated. In other words, an alternative protein sequence with a single point mutation (Smut) is encoded and compressed to a vector of hidden variables z. The vector of hidden variables z is further decoded via the decoder and generates a reconstructed sequence . Thus, p(xmutant) can be calculated as:
p(xmutant)=p(Smut)
The effect of the single point mutation (e.g., pathogenicity) can be quantified as the log ratio of mutant likelihood to wild-type likelihood as approximated by the ELBO:
As described above, this non-linear latent variable model learns in an unsupervised manner solely from sequence information from large multiple sequence alignments. It captures beyond-pairwise features in high-order epistasis, despite the model requiring high computation power (e.g., four hours per protein on GPU) and input sequence alignment to be as deep as possible.
In one implementation, the rank loss determination logic 1700 compares the two sets of rankings and determines whether there is a rank loss, in accordance with
An example cycle of rank comparisons is listed in 1720 of
Variant A in Residue 1 is more benign compared to Variant B in Residue 1. However, the same variants have a second pathogenicity score of 2000 mapped to a ranking of 8 and a second pathogenicity score of 0.0003 mapped to a ranking of 2, respectively, indicating the second model predicts Variant A in Residue 1 is much more pathogenic compared to Variant B in Residue 1. In view of the disagreement between the two models, the rank loss determination logic 1700 calculates an absolute value of the difference (d) in the first pathogenicity scores, namely, |0.8−0.1|.
The rank loss determination logic 1700 further calculates a rank loss based on the score difference using a loss function. In one implementation, the loss function is Pairwise Logistic Loss. The rank loss Loss can be calculated as:
Loss=d+logistic(d)=d+log(1+ed)
A person skilled in the art will appreciate that the Pairwise Logistic Loss or MSE Loss is an example without limiting the scope of the disclosed technology. Other loss functions can be used, including Pairwise Hinge Loss, Pairwise Soft Zero One Loss, Circle Loss, Softmax Loss, Unique Softmax Loss, Sigmoid Cross Entropy Loss, Mean Square Loss, List MLE Loss, Approx. NDCG Loss, Approx. MRR Loss, Gumbel Approx. NDCG Loss, Neural Sort Cross Entropy Loss, Gumbel Neural Sort Cross Entropy Loss, Neural Sort NDCG Loss, Gumbel Neural Sort NDCG Loss, etc.
In another implementation, each rank loss can be parameterized in different ways. For example, for the logistic loss above, the curvature of the logistic curve can be changed.
The ranking comparison can be repeated (e.g., 1720 of
S=Loss(Comparison 1)+Loss(Comparison 1)+Loss(Comparison 3)+ . . . +Loss(Comparison n×(n−1)/2)
The rank loss determination logic 1700 further generates an averaged rank loss, or a rank loss parameter (step 1860). For example, the rank loss determination logic 1700 calculates the number (L) of all non-zero rank losses and obtains a rank loss parameter (M) as follows:
M=S/L
The rank loss parameter M can be used by the loss function reconfiguration logic 128 to reconfigure the loss function of the first model during backpropagation.
In another implementation, when the rank loss determination logic 1700 performs a rank comparison where the scores of two variants are too close, its corresponding rank loss is presumably determined as zero. For example, Variant C in Residue 7 has a pathogenicity score of 0.3 and Variant C in Residue 12 has a pathogenicity score of 0.33. The rank loss determination logic 1700 determines both variants have the same score within an acceptable error range and thus, the rank loss is zero. The error range can be pre-determined.
As illustrated in the table of
In one implementation, the first model 102 is a deep neural network that is trained using backpropagation-based gradient update techniques. The goal of training deep neural networks is to optimize weight parameters in each layer, which gradually combines simpler features into complex features such that the most suitable hierarchical representations can be learned from training data. Example gradient descent techniques that can be used for training the first model include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train the first model 102 are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad. In other implementations, the first model 102 can be trained by unsupervised learning, semi-supervised learning, self-learning, reinforcement learning, multitask learning, multimodal learning, transfer learning, knowledge distillation, and so on.
An example training process of the first model is described in accordance with
In one implementation, the training dataset 1910 for the first model includes one-hot encodings of common human variants 1912 and one-hot encodings for common non-human primate variants 1914. The training data may also include 3D structural information of benign and pathogenic protein samples 1916, for example, a “gapped spatial representation” of a vacancy-containing protein that excludes at least one gap amino acid in the protein as well as a spatial representation of an alternate amino acid.
In one implementation, the training dataset includes a set of 385,236 labeled benign variants comprised human common variants (>0.1% allele frequency) from the ExAC/gnomAD database and variants from six species of non-human primates. In another implementation, the training dataset includes 3D structural information of 7,000 proteins. Generating distance channels for a plurality of voxels across a plurality of amino acids, atoms, and attributes can involve more than 100 voxelizations per protein, resulting in about 800,000 voxelizations in a single training iteration (epoch). A training run of 20-40 epochs, with rotation of atomic coordinates in each epoch, can result in as many as 32 million voxelizations. In yet another implementation, the proteome (e.g., human proteome and non-human primate proteome) has ten million positions, and therefore the benign training dataset includes ten million benign protein samples. These benign protein samples have respective gapped spatial representations generated by using the respective reference amino acids as respective gap amino acids. In yet another implementation, there are nineteen combinatorically generated amino acid substitutions for each of the ten million positions, and therefore the pathogenic training dataset includes one hundred and ninety million pathogenic protein samples.
By comparing the predicted outputs with the corresponding ground truth labels 1920, each set of training data has a calculated loss. For example, the training data of common human variants has a respective loss via an objective loss function, and similarly, the training data of common non-human primate variants and 3D structural information of benign and pathogenic protein samples. These losses are passed to the loss function reconfiguration logic 1930, which in turn, reconfigures the loss function such that the training logic 1940 can use the chain rule to backpropagate error signals and compute gradients with respect to all weights throughout the neural network.
Meanwhile, the second model also generate respective pathogenicity scores and rankings 1960 for variants. These pathogenicity scores and rankings, along with the output generated from the first pathogenicity classifier 204, are passed to one or more re-ranking layers 1970 to generate re-ranked pathogenicity scores, which will be described in more detail in accordance with
In one implementation, the training process of the first model uses mini-batch gradient descent for backpropagation. Mini-batch gradient descent splits the training dataset into small batches that are used to calculate model error and update model weights. A mini batch can include pre-determined numbers of samples for rank loss calculation, samples from common non-human variants, as well as samples from structural information of proteins, respectively. The loss calculated from the training dataset (e.g., common human variant loss, common non-human primate loss, structural information loss) and the rank loss can be taken into consideration. With the use of mini-batches, the training process maintains computational efficiency, stable convergence and faster learning rate.
The variations of sample types can also be considered when using mini batches. For example, when the two rank loss samples are compared, they both can be from the same protein. On the other hand, when using non-primate variants as training dataset, the two samples can be from the same protein or different proteins. So does the 3D structural information of proteins. Accordingly, in one implementation, a mixed sample-type batch can be used. Consider a batch with a size of 6. It can include two samples for rank loss determination from the same protein, two samples of different proteins for non-human primate variants, and another two samples of different proteins for 3D structural information.
In another implementation, a single sample-type batch can be used, where all the selected samples are from the same protein but different protein residues. For a batch with a size of 6, it may include two respective rank loss samples from protein Residues 34 and 93, two samples from Residues 1 and 7 for non-human primate variants, and another two samples from Residues 41 and 92 for 3D structural information.
In yet another implementation, a single sample-type batch can be used, where all the samples come from different protein residues within the same protein. But instead of contributing to one part of the batch, all samples are shared between sample types. For a batch with a size of 6, each of the samples within the batch can be calculated for a rank loss, a loss for non-human primate variants, and a loss for structural information.
As described above, during the training process of the first model, the rank loss function is reconfigured, and the rank loss parameters along with weight parameters within each layer of the first model are optimized. During the inference stage, the first model may no longer need the rankings generated from the second model. Instead, the first model is trained to rank like the second model, that is, captures beyond-pairwise features in high-order epistasis and generate pathogenicity scores that are optimal and consistent across different protein domains and even proteins.
Alternatively or additionally, in one implementation, the pathogenic scores and corresponding rankings generated from the first model and the second model are passed to one or more re-ranking layers, and the calculated rank loss is used in reconfiguring the loss function. The re-ranking layers can be learned. The parameters (weights, bias, etc.) of the re-ranking layers can be adjusted during backpropagation, along with the parameter optimization for each layer of the first model (e.g., first pathogenicity classifier 204).
The input 2010 to the re-ranking layer includes sets of scores generated and re-ranked by various models. In one implementation, for a set of training data, the first model generates a set of pathogenicity scores 2012 mapped to a set of score rankings. These scores are re-ranked based on the ranking of the second model and thus, creating a new set of re-ranked scores 2014. Consider the table of
As illustrated, Variant A in Residue 1 has a score of 0.1 and ranking of 1 from the first model. However, it is ranked 8 by the second model. In one implementation, a given score of 0.9 is re-assigned to Variant A in Residue 1, where the given score was originally belonging to Variant D in Residue 2 with a score ranking of 8. As such, all of the variants are re-ranked, with a set of re-ranked scores 2080.
In one implementation, the original scores 2012 generated from the first model (Soriginal) and re-ranked scores 2014 (Sreassigned) are input to at least one re-ranking layer 1200, which in turn generate updated re-ranked scores 2030 (Supdated). The re-ranking layer(s) 1200 can be 1D convolution layers with differentiable parameters (e.g., weights and bias) and optionally, with activation functions. The activation functions can include non-linear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent (tanh).
Consider a simplified example for generating updated re-ranked scores listed in the Table of
S
updated
=S
original
×W1+Sreassigned×W2
The rank loss determination logic 1700 calculates, based on Soriginal and Supdated, a rank loss parameter M, which is used by the loss function reconfiguration logic 1930.
In other implementations, a plurality of re-ranking layers is used to generate updated re-ranked scores. For example, the first re-ranking layer has differentiable parameters of w1, w2 and b1 (bias), the second re-ranking layer has differentiable parameters of w3, w4 and b2, and so on. Each of the parameters can be optimized during backpropagation.
In yet another implementation, more than two models can be involved in the pathogenicity score re-ranking and rank loss calculation. For example, a third model is used to generate a third set of pathogenicity scores mapped to corresponding rankings. The third model can be a generative model, e.g., a Transformer-based model. The model input can be epistasis expressed by the amino acid patterns spanning the MSA.
Accordingly, the score recalibration logic 108 may reassign pathogenic scores generated from the first model based on the score rankings generated from the third model and thus, creates another set of re-ranked scores 2016. The original scores generated from the first model Soriginal, the re-ranked scores based on the rankings from the second model Sreassigned-2 and the re-ranked scores based on the rankings from the third model Sreassigned-3 will be fed as input to re-ranking layer(s) to generate updated re-ranked scores Supdated. An example is as follows:
S
updated
=S
original
×W1+Sreassigned-2×W2+Sreassigned-3×W3+b
In other implementations, a plurality of re-ranking layers is used to generate updated re-ranked scores. For example, the first re-ranking layer has differentiable parameters of w1, w2, w3 and bl (bias), the second re-ranking layer has differentiable parameters of w4, w5, w6 and b2, and so on. Each of the parameters can be optimized during backpropagation.
Similarly, a fourth model can be used to generate a fourth set of pathogenicity scores mapped to corresponding rankings. The fourth model can be a discriminative model that classifies variants and predict their pathogenicity. The model input can be masked representations of the evolutionary conservation summary statistics about at least one amino acid in an alternate protein sequence, where one or more variant amino acids substitute the original amino acids at certain locations of the reference target protein sequence. The model input can be masked representations of the 3D structural information of the amino acids in a reference target protein sequence, and the masked representations exclude 3D structural data of at least one amino acid in the reference target protein sequence. The model input can also be masked representations of the 3D structural information of the amino acids in alternate protein sequences, and the masked representations of the alternate protein sequences exclude at least one amino acid in the protein sequence.
The score recalibration logic 108 may reassign pathogenic scores generated from the first model based on the score rankings generated from the fourth model and thus, creates another set of re-ranked scores 2018. This score set (Sreassigned-4), along with Soriginal, Sreassigned-2 and Sreassigned-3 will be fed as input to re-ranking layer(s) to generate updated re-ranked scores Supdated. An example is as follows.
S
updated
=S
original
×W1+Sreassigned-2×W2+Sreassigned-3×W3+Sreassigned-4×W4+b
In other implementations, a plurality of re-ranking layers is used to generate updated re-ranked scores. For example, the first re-ranking layer has differentiable parameters of w1, w2, w3, w4 and b1 (bias), the second re-ranking layer has differentiable parameters of w5, w6, w7, w8 and b2, and so on. Each of the parameters can be optimized during backpropagation.
A person skilled in the art will appreciate that the numbers of models used in rank comparison and rank loss calculation, the model types (e.g., discriminative model, generative model), the numbers and features of re-ranking layers (e.g., weights, bias, action functions) can be different in various implementations without departing from the spirit and scope of the technology disclosed.
In other implementations, a plurality of re-ranking layers can be stacked, such that an output from one re-ranking layer can be an input to a next re-ranking layer.
This process can be repeated by introducing scores and rankings from more models. For example, the updated re-ranked output 2160 can be passed to next re-ranking layer(s) 2170, along with scores and rankings generated from a fourth model 2180. The updated re-ranking scores 2190 can be used by the rank loss determination logic to determine rank losses and corresponding rank loss parameter M.
In other implementations, more than one model can be trained based on the rank loss among different models.
Similarly, another training dataset 2220 can be used as input to train the fourth model including fourth pathogenicity classifier 2240. The training dataset 2220 for the fourth model can be different from the training dataset 2210, for example, including gapped spatial representations of benign and pathogenic protein samples 2222 and spatial representations of gap amino acids 2224. The fourth pathogenicity classifier 2240 generates a fourth set of pathogenicity scores mapped to corresponding score rankings for each training dataset.
The second model and/or third model can be involved, by using e.g., MSA as input, to generate second set of pathogenicity scores mapped to corresponding rankings and/or third set of pathogenicity scores mapped to corresponding rankings, respectively. In one implementation, the second and/or third model may be generative models. For example, the second model may be a VAE or GAN model, while the third model a transformer model. The respective coefficient and latent space configurations of the second and/or third model can be pre-trained to process and reconstruct respective multiple sequence alignments that have respective reference target protein sequences as query sequences. The second and/or third model can have a particular coefficient and latent space configuration corresponding to the reference target protein sequence.
Each of the models independently generates a set of pathogenicity scores mapped to corresponding rankings, which in turn is passed to the rank loss determination logic 2260 for rank loss and rank loss parameter determination. In one implementation, the rank loss determination logic 2260 determines the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, and the third set of score rankings. In another implementation, the rank loss determination logic 2260 determines the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, and the fourth set of score rankings. In yet another implementation, the rank loss determination logic 2260 determines the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, the third set of score rankings, and the fourth set of score rankings.
The loss function reconfiguration logic 2270 reconfigures the loss function based on the generated rank loss parameter(s). In some implementations as described above, the loss function reconfiguration logic 2270 also considers the loss calculated by comparing the predicted pathogenicity variants with their respective ground truth labels. As the input training datasets can be different between the first and fourth models, their respective ground truth labels 2292 and 2294, and loss calculations can also be different.
In one implementation, the loss function reconfiguration logic 2270 reconfigures the loss function of the first model, based on at least one of the respective rank losses, common non-human primate variant losses and losses of 3D structural information. Furthermore, the loss function reconfiguration logic 2270 reconfigures the loss function of the fourth model, based on at least one of the respective rank losses and losses in spatial representation of gapped protein samples and gap amino acids. The training logic 2280 uses the re-configured loss functions to train the first and fourth model during back propagation 2296 and 2298, respectively.
The disclosed system for inter-model pathogenicity score recalibration has at least the following advantages. By taking into the considerations pathogenicity scores and rankings generated from different models, reassigning the scores and updating the rankings, the disclosed system improves the accuracy in predicting variant pathogenicity and assess the effects of variants on phenotypes. More importantly, such prediction accounts for not only single-position or pairwise mutations, but also for higher-order epistasis.
As described above in various implementations, some deep learning models can be site-independent or pairwise interaction models that focus on single-position mutation or pairwise mutations at proximate positions. These models use structural information of proteins and evolutionary profiles from human and primate, and are trained primarily using common human variants and common non-human primate variants that are mapped to human variants. As a result, these models outperform other models that solely rely on sequence information, because the scale of the pathogenicity scores across protein domains and proteins is consistent and optimal. However, the predictions of these site-independent or pairwise interaction models do not consider high-order epistasis.
On the other hand, some deep learning models account for the effects of variants in a surrounding context and thus, are strong tools in modeling epistasis in proteins. An example mentioned above is protein language models that identify the context that confers meaning to each position in a sequence. These models can have hundreds of millions of parameters in the model architecture and be trained on hundreds of millions of diverse protein sequences across evolution. See, Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, Alexander Rives, Language models enable zero-shot prediction of the effects of mutations on protein function, BioRxiv, doi: https://doi.org/10.1101/2021.07.09.450648. These models solely use a large and diverse database in MSA as input and thus, the ranking of the pathogenicity scores within a protein domain is optimal, despite the scale of the scores may not precisely reflect the pathogenicity across domains or proteins.
The disclosed technology for inter-model pathogenicity score reassignment during training provides an approach of combining the strengths of various models. As such, the rankings of the pathogenicity scores within a protein domain generated from one model (e.g., generative model, protein language model) accurately reflects the effects of the variants in the surrounding context (i.e., epistasis). Furthermore, the scale of scores (e.g., discriminative model) across different protein domains and different proteins remains optimal and consistent. It avoids undesired situations where the same pathogenicity score indicates high pathogenicity in one protein domain but entirely benign in another.
Additionally, the disclosed technology provides an approach of utilizing the prediction results from one model as input to train another model, such that the trained model takes into consideration the prediction results from other models to optimize their own model parameters via backpropagation. Rather than directly using scores generated from one model as input to the other model, the use of score rankings avoids score normalization which may impact the prediction accuracy. The calculation of rank loss and re-configuration of loss functions improves the optimization process, which leads to a trained model with high accuracy in predicting variants within local protein domains and reliability across protein domains and proteins.
In one particular implementation, the first model used in the inter-model re-ranking system is PrimateAI 3D, and the second model is a VAE-based non-linear latent variable model (See, DeepSequence described in Exhibit 7). The output of PrimateAI-3D for one sample, i.e., one protein residue (position), is a vector of 20 floating point numbers, in which each number corresponds to the pathogenicity of a particular alternative amino acid. The output has a pathogenicity score scale from zero to one (closer to 0: more benign, closer to 1: more pathogenic). The output of DeepSequence has the same format and meaning for pathogenicity scores.
For the sake of simplicity, in this particular implementation, the output only consists of a vector of 4, corresponding to amino acids A, B, C and D. PrimateAI is trained in batches of samples, for example, with an assumed batch size of two. All samples with one batch come from the same protein.
An example batch is listed as follows. Sample 1 (Residue 7 of Protein 1) in the example batch has a PrimateAI 3D output vector (0.1, 0.8, 0.5, 0.15) and a DeepSequence output vector (0.2, 0.3, 0.11, <mask>), respectively. Sample 2 (Residue 12 of Protein 1) in the example batch has a PrimateAI 3D output vector (0.3, 0.5, 0.1, 0.9) and a DeepSequence output vector (<mask>, 0.33, <mask>, 0.1), respectively.
The rank loss determination logic performs the rank comparison. In particular, the rank loss determination logic compares <Sample 1, amino acid A>with <Sample 1, amino acid B>. For PrimateAI 3D, the predicted scores for <Sample 1, amino acid A>and <Sample 1, amino acid B>are 0.1 and 0.8, respectively. For DeepSequence, the predicted scores for the same samples are 0.2 and 0.3, respectively. The order of the two amino acids is the same in both PAI-3D and DeepSequence and thus, the rank loss determination logic determines the rank loss as zero (i.e., loss(Comparison 1)=0). The rank loss determination logic subsequently compares <Sample 1, amino acid A>with <Sample 1, amino acid C>and determined the order of the two amino acids is not the same in PrimateAI 3D and DeepSequence and thus, the rank loss as a non-zero value (i.e. ,loss(Comparison 2)>0). The rank loss determination logic calculates an absolute difference between the two PrimateAI 3D scores for <Sample 1, amino acid A>with <Sample 1, amino acid C>, and obtains a score difference of 0.4. The rank loss determination logic subsequently applies a logistic loss function to calculate a rank loss for this particular comparison as 0.9.
The comparison between two samples is repeated between two samples within the batch until all possible combinations of amino acid pairs in the batch have been compared. This process generates (nx(n−1))/2 pairs of loss values, where n is the number of amino acids. The rank loss determination logic further sums up all the losses (5) from all comparisons, namely,
S=loss(Comparison 1)+ . . . +loss(Comparison n)
The sum of loss (5) is divided by the number (L) of all possible non-zero losses and generates a value of the rank loss (M) for this batch, which in turn is used in backpropagation.
It is noted that whenever one of the two amino acids under comparison is deemed as <mask>, the rank loss determination logic determines the loss of that comparison as zero. The rank loss determination logic can additionally ignore any comparisons that involved reference scores that are too close. For example, Residue 7, amino acid B has a DeepSequence score of 0.3, and Residue 12, amino acid B has a DeepSequence of 0.33. The difference is so small that it can be interpreted as both amino acids having the same score within error. Therefore, the comparison is too close to call and the rank loss is zero.
The rank loss determination logic can take more rank losses into consideration. For example, given one set of reference scores on top of the existing DeepSequence scores, another set of samples in a batch which leads to four sample types is introduced. Alternatively, the rank loss determination logic can have a weighted average over the different rank losses for each sample. The reference scores can be combined by, for example, weighted averaging before calculating the rank loss, which leads to the three different sample types in a batch.
The rank loss that is described above can be one part of the losses that will be used in backpropagation for Primate 3D. The other two losses are losses calculated by using non-human primate variants using training dataset, and losses calculated by using gapped spatial representation of protein samples. For the sake of simplicity here, it is assumed that a size of one batch is six. Each batch can be constructed to include two primate samples (from different proteins), two gapped protein samples (from different proteins) and two rank loss samples (from the same protein). In each iteration of the training process, the loss of each variant in a batch will be summed up and backpropagated.
Alternatively, all samples within a batch can come from the same protein. The samples can be different between types. Consider one batch with the size of six again. Residues 1 and 7 of the protein can be for primate samples, Residues 41 and 92 of the protein can be for gapped protein samples, while Residues 34 and 93 for rank loss. Alternatively, while all samples within one batch can come from the same protein and be shared among sample types. That is, with a batch size of six, all of the losses (i.e., loss for primate samples, loss for gapped protein samples and rank loss) can be calculated for all six samples.
In another particular implementation, the first model used in the inter-model re-ranking system is PrimateAI 3D, and the second model is a VAE-based non-linear latent variable model (See, DeepSequence described in Exhibit 7). Here, the DeepSequence scores are not normalized, and the score scale can range from minus infinity to plus infinity. PrimateAI 3D's score scale ranges from zero to one. Simply incorporating absolute score values of DeepSequence into PrimateAI 3D is less valuable, considering their difference in scales. Also, the scores from DeepSequence may not be comparable across proteins. For example, a score of 33 can mean highly pathogenic in Protein 1, but benign in Protein 2. Yet, the relative ranks of variants within the same protein can be valuable. For example, for amino acid A in Residue 7 (score=2000) is more pathogenic than amino acid D in Residue 12 (score=0.00001).
In order to utilize the reliable ranking information rather than unnormalized scores, in this particular implementation, re-ranking layers are used as illustrated in
Furthermore, PrimateAI 3D scores are scaled across proteins, whereas DeepSequence scores are not. Within the same protein, however, the ranks generated from DeepSequence may be more accurate than PrimateAI 3D. Hence, in this implementation, the original scores generated from PrimateAI 3D are kept but re-ranked using the ranks generated from DeepSequence. For example, the first variant (R1, A) has a ranking of 8 generated from DeepSequence (see Column “DS rank”). Based on the rank of 8, it is moved to Column “PAI-3D rank” and look up the variant with rank of 8 (R2, D). This variant has a score of 0.9 generated from PrimateAI 3D. Accordingly, the score of 0.9 is the new score in Column “PAI-3D score from DS ranks”.
The above re-ranking process is repeated for all variants. As a result, each variant has two scores, the original PrimateAI 3D score mapped to PrimateAI 3D score ranking (see Column “PAI-3D score”), re-ranked PrimateAI 3D score based on DeepSequence score ranking (see Column “PAI-3D score from DS ranks”). Both scores can be complimentary, and their combination may indicate better prediction.
Both columns can be given user-defined weights, such that they are combined as a weighted average. Column “New score (DS weight=0.5)” and Column “New score (DS weight=0.1)” are two examples of combined scores using two alternative weights. As such, each of the original PrimateAI 3D scores is transformed to two new scores.
This operation can be interpreted as a 1D convolution with a kernel size of 1 and 2 filters, as illustrated in
When more than two models are involved, multiple two-score re-ranking layers can be stacked on top of each other. That is, a re-ranked output of one re-ranking layer becomes an input of another re-ranking layer.
The performance of the disclosed implementations of inter-model re-ranking system is evaluated, along with other popular classification algorithms (Sift, Polyphen2, fathmm, CADD, M-CAP and REVEL). The first model used in the inter-model re-ranking system is PrimateAI 3D, and the second model is a VAE-based non-linear latent variable model. As illustrated, “PrimateAI-3D” refers to the first model as benchmark. “PrimateAI-3D (re-rank)” refers to an inter-model re-ranking system, where the system reassigns the pathogenicity scores generated from PrimateAI-3D (i.e., first model) to the variants based on the pathogenicity score rankings generated from VAE-based non-linear latent variable model (i.e., second model). “PrimateAI-3D (avg re-rank)” refers another inter-model re-ranking system, where the system reassigns the pathogenicity scores generated from PrimateAI-3D (i.e., first model) to the variants based on the pathogenicity score rankings generated from VAE-based non-linear latent variable model (i.e., second model).
Two different datasets are used for the evaluation. One dataset includes rare variants from UK Biobank found in genes, and these variants are significantly associated with certain quantitative phenotypes, e.g., LDL cholesterol level. The disclosed implementation of inter-model re-ranking system is used to predict the quantitative phenotypes of human genes from the rare variants found in the genes, along with other classification algorithms. The p-values are generated from Spearman rank correlation.
Another dataset includes rare variants from patients with developmental delay (DDD) or autism (ASD), together with rare variants from healthy controls. The disclosed implementation of inter-model re-ranking system is used to distinguish rare variants found in the patients with DDD or ASD from those found in controls. The p-values are generated from Wilcoxon ranksum test. As illustrated in
In one implementation, the score recalibration logic 108, including the pathogenicity classifier, is communicably linked to the storage subsystem 2610 and the user interface input devices 2638.
User interface input devices 2638 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 2600.
User interface output devices 2676 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 2600 to the user or to another machine or computer system.
Storage subsystem 2610 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 2678.
Processors 2678 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Processors 2678 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples of processors 2678 include Google's Tensor Processing Unit (TPU)™, rackmount solutions like GX4 Rackmount Series™, GX36 Rackmount Series™, NVIDIA DGX-1™, Microsoft Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™ NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamiclQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.
Memory subsystem 2622 used in the storage subsystem 2610 can include a number of memories including a main random access memory (RAM) 2632 for storage of instructions and data during program execution and a read only memory (ROM) 2634 in which fixed instructions are stored. A file storage subsystem 2636 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored by file storage subsystem 2636 in the storage subsystem 2610, or in other machines accessible by the processor.
Bus subsystem 2655 provides a mechanism for letting the various components and subsystems of computer system 2600 communicate with each other as intended. Although bus subsystem 2655 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 2600 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2600 depicted in
The technology disclosed, in particularly, the clauses disclosed in this section, can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
One or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed, or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
The clauses described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the clauses described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These clauses are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these clauses but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents.
Other implementations of the clauses described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section. Yet another implementation of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.
We disclose the following clauses:
a first model (e.g., PrimateAI, PrimateAI 3D, JigsawAI) configured to generate, based in part on evolutionary conservation summary statistics [evolutionary profiles—examples: PSFM and PSSM] of amino acids in a target protein sequence,
a second model (e.g., DeepSequence in Exhibit 7, Transformer-based models (e.g., the one shown in Exhibit 3 with self-attention) configured to generate, based in part on epistasis expressed by amino acid patterns spanning the target protein sequence and a plurality of non-target protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a reassignment logic configured to reassign pathogenicity scores from the first set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings (as illustrated by the example in slide 26 of Exhibit 6); and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores (as illustrated by the example in slide 26 of Exhibit 6 by 100% DeepSequence).
a first model configured to generate, based in part on a target protein sequence,
a second model configured to generate, based in part on the target protein sequence and a plurality of non-target homologous protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a reassignment logic configured to reassign pathogenicity scores from the first set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores.
a first model configured to generate, based in part on position-specific evolutionary conservation statistics of amino acids in a target protein sequence,
a second model configured to generate, based in part on epistasis expressed by amino acid patterns spanning the target protein sequence and a plurality of non-target protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a reassignment logic configured to reassign pathogenicity scores from the second set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores.
a first model configured to generate, based in part on a target protein sequence,
a second model configured to generate, based in part on the target protein sequence and a plurality of non-target homologous protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a second score-to-rank mapping that maps a second set of pathogenicity scores for the set of variants to a second set of score rankings;
a reassignment logic configured to reassign pathogenicity scores from the second set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores.
a first model configured to generate, based in part on three-dimensional (3D) structural information about amino acids in a target protein sequence,
a second model configured to generate, based in part on the target protein sequence and a plurality of non-target homologous protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a reassignment logic configured to reassign pathogenicity scores from the first set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
a first model configured to generate, based in part on three-dimensional (3D) structural information about amino acids in a target protein sequence,
a second model configured to generate, based in part on the target protein sequence and a plurality of non-target homologous protein sequences aligned with the target protein sequence in a multiple sequence alignment,
a second score-to-rank mapping that maps a second set of pathogenicity scores for the set of variants to a second set of score rankings;
a reassignment logic configured to reassign pathogenicity scores from the second set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores.
a first model configured to generate a first score-to-rank mapping that maps a first set of pathogenicity scores for a set of variants observed in a target protein sequence to a first set of score rankings;
a second model configured to generate a second score-to-rank mapping that maps a second set of pathogenicity scores for the set of variants to a second set of score rankings;
a reassignment logic configured to reassign pathogenicity scores from the second set of pathogenicity scores to the set of variants based on the first and second score-to-rank mappings; and
an output logic configured to generate a ranking of the set of variants based on the reassigned pathogenicity scores.
a first model configured to generate a first score-to-rank mapping that maps a first set of pathogenicity scores for a set of variants observed in a target protein sequence to a first set of score rankings, wherein pathogenicity scores in the first set of pathogenicity scores are scored on a first scale;
a second model configured to generate a second score-to-rank mapping that maps a second set of pathogenicity scores for the set of variants to a second set of score rankings, wherein pathogenicity scores in the second set of pathogenicity scores are scored on a second scale that is different from the first scale;
a rescaling logic configured to use the first and second score-to-rank mappings to rescale the pathogenicity scores in the second set of pathogenicity scores from the second scale to the first scale; and
an output logic configured to generate a ranking of the set of variants based on the rescaled pathogenicity scores.
a first model (e.g., PrimateAI, PrimateAI 3D) configured to generate, based in part on evolutionary conservation summary statistics of amino acids in a reference target protein sequence, a first set of pathogenicity scores for a set of variants that mutate the reference target protein sequence to a set of alternate protein sequences, wherein the first set of pathogenicity scores has a first set of score rankings;
a second model (e.g., DeepSequence in Exhibit 7, Transformer-based models (e.g., the one shown in Exhibit 3 with self-attention) configured to generate, based in part on epistasis expressed by amino acid patterns spanning a multiple sequence alignment that aligns the reference target protein sequence to a plurality of non-target protein sequences, a second set of pathogenicity scores for the set of variants, wherein the second set of pathogenicity scores has a second set of score rankings;
a rank loss determination logic configured to determine a rank loss parameter based on a comparison of the first set of score rankings against the second set of score rankings;
a loss function reconfiguration logic configured to reconfigure a loss function based on the rank loss parameter; and
a training logic configured to use the reconfigured loss function to train the first model.
a third model e.g., DeepSequence in Exhibit 7, Transformer-based models (e.g., the one shown in Exhibit 3 with self-attention) configured to generate, based in part on the epistasis expressed by the amino acid patterns spanning the multiple sequence alignment, a third set of pathogenicity scores for the set of variants, wherein the third set of pathogenicity scores has a third set of score rankings;
the rank loss determination logic further configured to determine the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, and the third set of score rankings;
the loss function reconfiguration logic further configured to reconfigure the loss function based on the rank loss parameter; and
the training logic further configured to use the reconfigured loss function to train the first model.
a fourth model (e.g., JigsawAI) configured to generate, based in part on masked representations of the evolutionary conservation summary statistics, a fourth set of pathogenicity scores for the set of variants, wherein the masked representations mask evolutionary conservation summary statistic data about at least one amino acid in the alternate protein sequences, and wherein the fourth set of pathogenicity scores has a fourth set of score rankings;
the rank loss determination logic further configured to determine the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, and the fourth set of score rankings;
the loss function reconfiguration logic further configured to reconfigure the loss function based on the rank loss parameter; and
the training logic further configured to use the reconfigured loss function to train the first model.
the rank loss determination logic further configured to determine the rank loss parameter based on a comparison of the first set of score rankings, the second set of score rankings, the fourth set of score rankings, and the fourth set of score rankings;
the loss function reconfiguration logic further configured to reconfigure the loss function based on the rank loss parameter; and
the training logic further configured to use the reconfigured loss function to train the first model.
the loss function reconfiguration logic further configured to reconfigure, based on the rank loss parameter, a first loss function for the first model and a fourth loss function for the fourth model; and
the training logic further configured to use the reconfigured first function to train the first model, and to use the reconfigured fourth function to train the fourth model.
This application claims benefit of U.S. Provisional Patent Application No. 63/294,828, titled, “Inter-Model Prediction Score Recalibration,” by Tobias HAMP and Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1067-1/IP-2301-PRV); U.S. Provisional Patent Application No. 63/294,813, titled, “Periodic Mask Pattern for Revelation Language Models,” by Tobias HAMP, Anastasia Susanna Dagmar DIETRICH, Yibing WU, Jeffrey Mark EDE, Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1063-1/IP-2296-PRV); U.S. Provisional Patent Application No. 63/294,816, titled, “Classifying Millions of Variants of Uncertain Significance Using Primate Sequencing And Deep Learning,” by Hong GAO, Tobias HAMP, Joshua Goodwin Jon MCMASTER-SCHRAIBER, Laksshman SUNDARAM, Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1064-1/IP-2297-PRV); U.S. Provisional Patent Application No. 63/294,820, titled, “Identifying Genes With Differential Selective Constraint Between Humans And Non-human Primates,” by Hong GAO, Joshua Goodwin Jon MCMASTER-SCHRAIBER, Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1065-1/IP-2298-PRV); U.S. Provisional Patent Application No. 63/294,827, titled, “Deep Learning Network for Evolutionary Conservation,” by Sabrina RASHID and Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1066-1/IP-2299-PRV); U.S. Provisional Patent Application No. 63/294,830, titled, “Species-differentiable Evolutionary Profiles,” by Tobias HAMP and Kai-How FARH, filed Dec. 29, 2021 (Attorney Docket No. ILLM 1068-1/IP-2302-PRV). The provisional applications are hereby incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63294830 | Dec 2021 | US | |
63294828 | Dec 2021 | US | |
63294827 | Dec 2021 | US | |
63294820 | Dec 2021 | US | |
63294816 | Dec 2021 | US | |
63294813 | Dec 2021 | US |