Aspects of this technology are described in an article entitled “Auto-regressive WaveNet Variational Autoencoders for Alignment-free Generative Protein Design and Fitness Prediction,” ICLR2022 Machine Learning for Drug Discovery (2022). The article was published online Apr. 29, 2022, and is incorporated herein by reference in its entirety.
The present disclosure is directed to a system, method, and computer-readable storage medium for generation of novel synthetic protein sequences and for function prediction through reconstruction of input protein sequences. In particular, the disclosure is directed to a deep generative model having a dilated convolutional neural network encoder and a decoder implemented as a decoder employing dilated causal convolutions.
Protein sequences from non-homologous families or within homologous families with high variability and diverse lengths present challenges in the construction of multiple-sequence alignments. See Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church, “Unified rational protein engineering with sequence-based deep representation learning,” Nature Methods, 16(12):1315-1322, 2019; Surojit Biswas, Grigory Khimulya, Ethan C Alley, Kevin M Esvelt, and George M Church, “Low-n protein engineering with data-efficient deep learning,” Nature Methods, 18(4):389-396, 2021; and Jung-Eun Shin, Adam J Riesselman, Aaron W Kollasch, Conor McMahon, Elana Simon, Chris Sander, Aashish Manglik, Andrew C Kruse, and Debora S Marks, “Protein design and variant prediction using autoregressive generative models,” Nature Communications, 12(1):1-11, 2021, each incorporated herein by reference in their entirety.
Recently deep generative models (DGMs) have been highly successful in novel protein design and could enable an unprecedented level of control in therapeutic and industrial applications. One DGM approach is variational autoencoders (VAEs), which can infer higher-order amino acid dependencies for useful prediction of fitness effects of mutation. Additionally, such a model infers a latent space distribution, which enables the model to learn biologically meaningful representations. Another example of a DGM approach is autoregressive models, commonly implemented in language or audio tasks that have been intensively explored in protein generation of unaligned sequences.
Deep generative models are models for learning high-dimensional data distributions and generating novel data samples indistinguishable from the true data. Subsequently, DGMs are promising for synthetic protein design. For example, autoregressive models (i.e., language or audio generative models) have no dependency on sequence alignments, allowing these models to learn and generate novel sequences with high variability and diverse lengths. However, one major limitation of autoregressive models is the lack of ability to infer meaningful representations or conditional information (e.g., latent vectors).
In contrast, variational autoencoders (VAEs) can infer a latent space and generate novel data indistinguishable from a true data distribution. These models have been shown to effectively predict single-mutant effects, infer a homologous family's phylogeny through the latent space, and diversify synthetic AAV capsids. See Sam Sinai, Nina Jain, George M Church, and Eric D Kelsic, “Generative AAV capsid diversification by latent interpolation,” bioRxiv, 2021, incorporated herein by reference in its entirety. While these models can infer a biologically meaningful latent space, they struggle to implement powerful and expressive decoders (i.e., generators) because VAEs are prone to posterior collapse. See Shengjia Zhao, Jiaming Song, and Stefano Ermon, “Infovae: Balancing learning and inference in variational autoencoders,” in Proceedings of the aaai conference on artificial intelligence, volume 33, pp. 5885-5892, 2019; Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel, “Variational lossy autoencoder, “arXivpreprint arXiv:1611.02731, 2016; Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, 30, 2017; and Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick, “Improved variational autoencoders for text modeling using dilated convolutions,” in International Conference on Machine Learning, PMLR, 2017, each incorporated herein by reference in their entirety. Therefore, VAEs struggle with incorporating autoregressive decoders for generating variable-length sequences and inferring alignment-free homologous protein datasets.
An aspect of the present disclosure is a method of generating protein sequences, the method including encoding, using a dilated convolutional encoder, a plurality of input protein sequences onto a latent space distribution; and decoding, using a decoder employing dilated causal convolutions, the latent space distribution to generate new protein sequences different from the input protein sequences.
A further aspect of the present disclosure is a method of generating protein sequences using a system including an encoder coupled to an autoregressive generator, and having been trained with a loss function that comprises reconstruction loss and a mutual information maximization term, the method including encoding, using the encoder, a plurality of input protein sequences onto a latent space distribution; and decoding, using the autoregressive generator, the latent space distribution to generate new protein sequences different from the input protein sequences.
A further aspect of the present disclosure is a method of training a system for generating protein sequences, the system including an encoder that encodes a plurality of input protein sequences onto a latent space distribution, and an autoregressive generator that decodes the latent space distribution to generate new protein sequences different from the input protein sequences, the method including training the system with a loss function that comprises reconstruction loss and a mutual information maximization term.
A further aspect of the present disclosure is a system for generating protein sequences, including an encoder configured to encode a plurality of input protein sequences onto a latent space distribution; and an autoregressive generator configured to decode the latent space distribution to generate new protein sequences different from the input protein sequences, in which the system is trained with a loss function that includes reconstruction loss and a mutual information maximization term.
A further aspect of the present disclosure is a non-transitory computer-readable medium storing a program that, when executed by processing circuitry, causes the processing circuitry to perform a method of generating protein sequences using a system including an encoder coupled to an autoregressive generator, and having been trained with a loss function that includes reconstruction loss and a mutual information maximization term, the method including encoding, using the encoder of a variational autoencoder, a plurality of input protein sequences onto a latent space distribution; and decoding, using the autoregressive generator, the latent space distribution to generate new protein sequences different from the input protein sequences.
A more complete appreciation of this disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
A problem that occurs when combining autoregressive models and VAEs for protein design and fitness prediction is that VAEs are prone to posterior collapse when implemented with an expressive decoder. A disclosed solution is to use a VAE with a WaveNet-based decoder. WaveNet-based generators can provide inexpensive training time and computation cost relative to recurrent neural networks (RNNs) and avoid vanishing gradients, because WaveNet leverages dilated causal convolutions. In addition, to avoid posterior collapse, the disclosed model implements and adapts an Information Maximizing VAE (InfoVAE) loss objective, instead of a standard Evidence Lower Bound (ELBO) training objective, to a semi-supervised setting with an autoregressive reconstruction loss. Further, in one embodiment the disclosed model is extended from unsupervised to a semi-supervised learning paradigm for fitness prediction tasks enabling effective prediction of fitness from sequences alone.
In one embodiment, the disclosed model's performance is benchmarked on FLIP and TAPE datasets for protein function prediction. To illustrate the model's performance for protein design, the model is trained on unaligned homologous sequence libraries of the SH3 domain and AroQ Chorismate mutase enzymes, for example. Then, the trained model is deployed to generate novel (variable-length) sequences that are computationally predicted to fold into native structures and possess natural function. The results demonstrate that combining a semi-supervised InfoVAE model with a WaveNet-based generator provides a robust framework for functional prediction and generative protein design, without requiring multiple sequence alignments.
As mentioned above, in order to successfully combine VAEs with autoregressive generators and overcome posterior collapse, a disclosed embodiment incorporates an Information Maximizing (InfoMax) loss objective instead of the common ELBO training objective. See Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio, “Generating sentences from a continuous space,” arXiv preprint arXiv:1511.06349, 2015, incorporated herein by reference in its entirety. The InfoMax loss is similar to ELBO. However, prefactor weights are introduced to motivate better inference and regularization. Further, a mutual information maximization term is introduced for explicitly encouraging high mutual information between the input vectors and latent space embeddings. A WaveNet-based autoregressive generator is implemented for the decoder that avoids vanishing or exploding gradients by leveraging dilated causal convolutions. See Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016, incorporated herein by reference in its entirety.
Previously, models have been developed that combine VAEs with dilated causal convolutions as the decoder component for text generation. However, this approach does not carry over to protein design and fitness prediction. Disclosed embodiments expand this modeling approach by incorporating an InfoMax loss objective for improving amortized inference and avoiding posterior collapse. In one embodiment, the convolutions are much faster than recurrent networks during training time, offer superior inference of long-range correlations, and are computationally lighter-weight than standard convolutional filters. The InfoVAE can infer biologically meaningful latent spaces while incorporating an expressive autoregressive generator. The InfoVAE training objective is extended to a semi-supervised learning paradigm for fitness landscape prediction.
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
Regarding
The original WaveNet model includes causal convolutions. By using causal convolutions, the model cannot violate the ordering in which the data is modeled. However, one of the problems of causal convolutions is that they require many layers, or large filters to increase the receptive field. The conditional WaveNet-based architecture 1610 uses dilated convolutions to increase the receptive field by orders of magnitude, without greatly increasing the computational cost.
The WaveNet decoder shown in
Similarly, simple 1×1 convolution layers 1732, 1734, i.e. instead of dilated causal convolutions, are used to transform the latent input vectors 1606. These transformed latent vectors are passed through two independent operations: one operation corresponds to the “signal” path 1742, and the second corresponds to the “gate” path 1744. After the protein input 1612 and latent data 1606 are passed through the signal path 1742, these output transformations are only summed together and are not followed by a nonlinear activation function—illustrated by “(+).” While the gate paths 1744 sum the output representation for the protein and latent input data, a nonlinear activation function, specifically the sigmoid function 1736, further transforms the output. After both signal 1742 and gate 1744 paths are complete and ready to merge the protein and latent input data, the final result of the gate and signal is multiplied together—illustrated by “(*).” Then, this final multiplied output is independently passed through two more convolution 1×1 operations 1752, 1754, where one result is added back to the “Residual” connection 1706, and the second output is added and passed through a “Skip” connection 1756. The output which went through the “Residual” path 1712 is given to the next WaveNet layer 1710 of the WaveNet module 1610 as the input protein sequence data 1612 “x” while the latent input data 1606 for the next layer remains the same. For the output of the “Skip” connection 1756, the output transformation is accumulated (summed “(+)”) with the output transformation of each WaveNet layer 1710 implemented in the WaveNet module 1610. After the cumulation is complete with each WaveNet layer 1710, the transformation is passed through a “WaveNet top head 1760,” which passes the data through ReLU activation functions 1762 and Convolution 1×1 layers 1764 until it reaches a Softmax operation 1766 that reconstructs the input sequence to provide an output protein sequence 1770.
Integrating InfoVAE with a WaveNet-Based Generator
Traditional Variational Autoencoders (VAEs) are prone to posterior collapse or poor amortized inference when implementing expressive decoders (e.g. autoregressive generators). Here, a VAE model is implemented to overcome posterior collapse and improve variational inference when implementing a WaveNet-based autoregressive decoder. In one embodiment, he unsupervised loss function is the following:
US
=ξ
z˜q
ϕ(z|x)[log pθ(x|z)]−(1−α)KL(qθ(z|x)∥p(z))−(α+λ−1)
MMD(qϕ(z)∥p(z)) (1)
Extending Generative Model to a Semi-Supervised Paradigm
The unsupervised WaveNet VAE is extended to a semi-supervised learning paradigm for fitness landscape prediction. The main motivation of using a semi-supervised approach is based on the idea that latent representations z can be more informative for predicting y when also used for reconstructing x. In addition, semi-supervised learning is beneficial when labels are scarce, and unlabeled data is abundant, which is generally the case for protein design over large unlabeled sequence databases for which a small fraction of sequences are labeled with functional assays. The semi-supervised training objective is the following:
SS=US+γ
[log pω(y|z)] (2)
Training and Hyperparameter Optimization
In one embodiment, during training, ξ, α, λ, γ, and σ are set to 1, 0.95, 2, 1, and √{square root over (dim(z))}. Then, hyperparameter optimization is conducted over the latent space dimensions z∈[1,20] for each fitness landscape prediction task. The optimal latent space dimension was chosen based on minimizing the negative-log likelihood and maximizing the Spearman ρ score on the validation set. In general, the prefactor loss weights can be optimized as well. The optimization algorithm used in this study was Adam with a learning rate of 1e−4. See Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014, incorporated herein by reference in its entirety. For fitness prediction tasks, the number of epochs was set to 2000, and early stopping was only implemented if p reaches a value of 0.99 on the validation set. For simplicity, the minibatch size is set across all fitness prediction tasks to 256 samples, but this too can be optimized.
Results: Protein Design
To illustrate the advantage of combining VAEs with an autoregressive WaveNet decoder, the model was trained on unaligned homologous datasets, in one embodiment. To show that the model can handle unaligned homologous sequences, the latent embeddings are compared and learned representations between unaligned and aligned sequence datasets.
Description of the Homolog Family Datasets
The disclosed model is trained and generated novel sequences from two homologous protein families: Src homologous 3 domains (SH3) and AroQ chorismate mutase (CM) enzymes. The SH3 family consists of many paralogs, which are homologous sequences that diverge due to duplication events. Since the gene is duplicated in the genome, new selective pressure can act on the duplicated gene, and subsequently, paralogs can acquire new functions. The SH3 dataset size is 5611 sequences, consisting primarily of proteins found in the fungal kingdom. Of the 5611 sequences in the basebase, 4664 are labeled with functional assay measurements for osmosensing capabilities. The CM dataset differs from the SH3 dataset because all of proteins are orthologs, which are homologous sequences that diverge due to speciation events instead of duplication events. This means most of the natural homologous CM sequences have a similar catalytic function. In addition, the CM dataset has two sets of proteins—one set corresponds to the natural homologs (1130 sequences), while the second set corresponds to synthetic designs (1618 sequences) produced using a direct coupling analysis (DCA) model that explicitly considers only pairwise epistasis. See William P Russ, Matteo Figliuzzi, Christian Stocker, Pierre Barrat-Charlaix, Michael Socolich, Peter Kast, Donald Hilvert, Remi Monasson, Simona Cocco, Martin Weigt, et al., “An evolution-based model for designing chorismate mutase enzymes,” Science, 369(6502):440-445, 2020, incorporated herein by reference in its entirety. All sequences have been functionally assayed for CM catalytic function.
Latent Space Interpretation
Referring to
Referring to
Generation of Novel Variable Length Sequences with Structure Prediction
To illustrate the practical advantage of using an autoregressive decoder, the model trained on aligned and unaligned sequence data is compared for both the SH3 and CM datasets. Novel sequences are generated by randomly sampling points within the latent spaces and decoding these through the WaveNet generator to produce novel protein sequences. For the SH3 system, 5611 novel sequences are sampled and generated, while for the CM system, 1130 novel sequences are generated. To check whether the generated sequences fold into a proper tertiary structure, AlphaFold2 was used to predict structures of four sequences for both the SH3 and CM task: the shortest generated sequence, the longest generated sequence, and two randomly selected sequences (Tables 2 and 3).
With the SH3 task, the structure predictions of the generated sequences can be displayed in a display device 2100 along with the wild-type SH3 domain (PDB: 2VKN) as shown in
where η and Lmax are the minimum Hamming distance and longest protein sequence within the natural homolog library). Interestingly, the longest SH3 domain among the generated pool is predicted to acquire an alpha helix, which was originally a hairpin loop on the WT structure (D). It is important to note that this sequence is 11 amino acids longer than the wild-type SH3 domain (Table 2), potentially allowing the generative model to extrapolate in terms of design.
See Christian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, and Kevin K. Yang, “Flip: Benchmark tasks in fitness landscape inference for proteins,” bioRxiv, 2022, incorporated herein by reference in its entirety.
Regarding
With the CM task, the two randomly sampled sequences from the generated pool (A,B in
Regarding
Regarding
Results: Fitness Prediction
One important goal of deep learning models for biology is to learn meaningful representations that can be leveraged on down-stream tasks. For instance, a major endeavor in protein design is fitness landscape prediction and representation learning for semi- and self-supervised tasks. The deep generative model is extended to a semi-supervised paradigm in the hopes of learning biologically meaningful representations for fitness landscape prediction. A main goal is to learn a latent space z that is informative for the generative and discriminative tasks. The intuition behind this construction is based on the ideal that the representations that can be used to reconstruct the training data and generate new data indistinguishable from the training data can also be more meaningful for discriminative tasks (e.g. fitness landscape prediction). To benchmark the model's learned representations, the ability to predict test datasets is predicted on four main protein systems from two popular community benchmarks tasks: TAPE and FLIP. See Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, and Yun Song, “Evaluating protein transfer learning with TAPE,” Advances in neural information processing systems, 32, 2019, incorporated herein by reference in its entirety. Thus, the four protein systems are the following:
Benchmark Model on Landscape Prediction Tasks
For the FLIP AAV capsid task, there were 7 different data split tasks, which are each relevant to protein engineering scenarios. The semi-supervised model is able to outperform or competitively perform to current baseline scores in 5 out of 6 dataset splits (Tables 4 and 5). However, the semi-supervised generative model underperforms when the training set contains only low-fitness sequences and the test set contains only high-fitness sequences. For the FLIP GB1 task, the model outperforms or competitively performs against the current baseline scores (Table 1). However, similar to the AAV capsid tasks, the model underperforms on the protein task where the training and testing splits contain only low- and high-fitness sequences. For the TAPE tasks, the model competitively performs against the state-of-the-art models on the GFP tasks (Table 6) while also performing competitively against the dark and bright GFP modes for the GFP task (Tables 8, 9). However, the model underperforms on the stability prediction task found in Table 7. Overall, these results suggest that the generative model is not only capable of unsupervised learning and protein design, but also capable of generalizing such that the model can infer fitness landscapes and predict function from sequence alone, and is competitive with state-of-the-art methods.
See Tristan Bepler and Bonnie Berger, “Learning protein sequence embeddings using information from structure,” arXiv preprint arXiv:1902.08661, 2019; Amy X Lu, Haoran Zhang, Marzyeh Ghassemi, and Alan M Moses, “Self-supervised contrastive learning of protein representations by mutual information maximization,” BioRxiv, 2020; Amir Shanehsazzadeh, David Belanger, and David Dohan, “Is transfer learning necessary for protein landscape prediction?” arXiv preprint arXiv:2011.03443, 2020, each incorporated herein by reference in their entirety.
Additional 2D projections of the SH3 dataset is shown on
Additional 2D elevations of the CM latent space are provided in
Thus, in some of the disclosed embodiments, a variational autoencoder (VAE) and autoregressive generator (WaveNet) are combined for protein design, avoiding the need of multiple-sequence aligned input data. To avoid posterior collapse when combining VAEs and WaveNet models, an Infomation Maximizing VAE (InfoVAE) is implemented, adding a mutual information term to the common ELBO training objective, improving amortized inference, and forcing the decoder to use the latent conditional information. The generative model is able to learn meaningful latent space representations from homologous protein families, which can be leveraged to design novel functional sequences with length variability. The generated structures can be realized by sequence designs that are predicted by AlphaFold2 to adopt tertiary structures in good agreement with the native fold. Additionally, when the model is trained on unaligned versus aligned input sequences, the diversity of the generated sequences can improve, illustrating a potential advantage of using an autoregressive decoder. The model is extended to a semi-supervised learning paradigm and benchmarked the model on four different fitness landscape prediction tasks from FLIP and TAPE. The model can outperform many baseline scores for the AAV and GB1 tasks. In addition, the model can compete against state-of-the-art performance for the GFP task. These results and analysis suggest the deep generative model is capable of successful protein design of variable length sequences, inferring meaningful biological representations, and effectively predicting fitness from sequence alone.
In some embodiments, the computer system 2700 may include a server CPU and a graphics card by NVIDIA, in which the GPUs have multiple CUDA cores.
The above disclosure also encompasses the embodiments listed below.
(1) A method of generating protein sequences. The method includes
(2) A method of generating protein sequences using a system including an encoder coupled to an autoregressive generator, and having been trained with a loss function that comprises reconstruction loss and a mutual information maximization term. The method includes
(3) The method of feature (2), in which the decoding step comprises decoding the latent space distribution using a dilated casual convolution autoregressive generator as the autoregressive generator.
(4) The method of feature (2), in which the decoding step comprises decoding the latent space distribution to generate the new protein sequences, which include sequences of different lengths.
(5) The method of feature (2), in which the encoding step comprises encoding the plurality of input protein sequences, which are unaligned.
(6) The method of feature (2), in which the encoding step comprises encoding the plurality of input protein sequences using a dilated convolutional neural network encoder.
(7) The method of feature (2), in which the encoding step comprises encoding the plurality of input protein sequences into a latent space embedding.
(8) The method of feature (2), in which the decoding step comprises predicting a next amino acid in a particular sequence, based on the particular sequence and a latent space embedding.
(9) The method of feature (2), in which the system was trained using the loss function, which further includes a semi-supervised loss.
(10) The method of any of features (2) or (3), in which the decoding step further comprises decoding the latent space distribution using the dilated casual convolution autoregressive generator, which incorporates residual and skip connections.
(11) A method of training a system for generating protein sequences, the system including an encoder that encodes a plurality of input protein sequences onto a latent space distribution, and an autoregressive generator that decodes the latent space distribution to generate new protein sequences different from the input protein sequences. The method includes training the system with a loss function that comprises reconstruction loss and a mutual information maximization term.
(12) The method of feature (11), in which the system further includes a semi-supervised learning module including a regression model with a set of training parameters that are learned by minimizing, for a subset of the latent space distribution, an error between outputs of the regression model and fitness values obtained from assay measurements; and the method further includes training the system with a modified loss function that further includes term based on performance of the regression model.
(13) The method of feature (12), in which the term in the modified loss function is a mean-squared error term based on a ground truth and a predicted regression value of the regression model.
(14) The method of feature (12), in which the regression model is a neural network having weights as the training parameters, which are determined in the training step.
(15) The method of feature (11), in which the encoder is a dilated convolutional neural network encoder.
(16) The method of feature (11), in which the encoder learns a latent space embedding.
(17) The method of feature (11), in which the training step comprises training the system with the loss function, which further includes a semi-supervised loss.
(18) A system for generating protein sequences, including
(19) The system of feature (18), in which the autoregressive generator is a dilated casual convolution autoregressive generator.
(20) The system of feature (18), in which the autoregressive generator is further configured to decode the latent space distribution to generate the new protein sequences, which include sequences of different lengths.
(21) The system of feature (18), in which the encoder is further configured to encode the plurality of input protein sequences, which are unaligned.
(22) The system of feature (18), in which the system was trained with the loss function, which further includes a semi-supervised loss.
(23) The system of feature (19), in which the dilated casual convolution autoregressive generator incorporates residual and skip connections.
(24) A non-transitory computer-readable medium storing a program that, when executed by processing circuitry, causes the processing circuitry to perform a method of generating protein sequences using a system including an encoder coupled to an autoregressive generator, and having been trained with a loss function that includes reconstruction loss and a mutual information maximization term. The method includes
This application claims the benefit of priority to Provisional Application Nos. 63/314,898, filed Feb. 28, 2022; and 63/390,663, filed Jul. 20, 2022, the entire contents of which are incorporated herein by reference.
This invention was made with government support under EB009412 awarded by the National Institutes of Health, and 1746045 awarded by the National Science Foundation. The government has certain rights in the invention. The inventors gratefully acknowledge support from the Machine Learning in the Chemical Sciences and Engineering program of The Camille and Henry Dreyfus Foundation.
Number | Date | Country | |
---|---|---|---|
63314898 | Feb 2022 | US | |
63390663 | Jul 2022 | US |