A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
This application is related to co-pending and commonly owned U.S. Provisional Application Nos. 63/281,975, filed Nov. 22, 2021, and 63/321,916, filed Mar. 21, 2022, both of which are hereby expressly incorporated by reference herein in their entirety.
The present disclosure relates generally to protein sequencing and generation designs, and more specifically, to few-shot protein generation using knowledge learnt from a protein family.
Proteins are composed of sequences of amino acid sequences. The unique amino acid sequencing may determine or render a unique property of a protein, e.g., an antibody, a virus, and/or the like. Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide, which can be used to identify the protein or characterize its post-translational modifications. On the other hand, designing an amino acid sequence may be used to generate a protein with certain desired properties. However, the number of amino acids for protein sequencing is significantly large, resulting in exponentially increased complexity in determining probable sequences for protein generation.
Therefore, there is a need for an efficient mechanism to design protein sequences.
In the figures and appendix, elements having the same designations have the same or similar functions.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Protein engineering is the task of mutating proteins in order to achieve a desired function, and has numerous applications in medicine and sustainability. Designing such mutations can often be challenging, because inferring protein functional impact from protein structure is difficult, and the search space of possible sequence variants is combinatorialy large. For example, a given mutation could cause a disproportionate effect due to being positioned within an active site or long-range interactions with other amino acids. In addition, introducing multiple mutations simultaneously can have complex non-linear effects, called epistasis.
Machine learning systems have been adopted for protein sequencing analysis and/or generation. For example, a machine learning system may be trained on a dataset of protein properties and the corresponding protein structure of amino acid sequences. The machine learning system can then be used to predict an amino acid sequence for protein generation, given one or more desired protein properties. The machine learning system used for protein generation is herein referred to as a “generative model.”
In another aspect, supervised training data on the functional impact of protein mutants is also limited. Acquiring supervised data means performing complicated and costly deep mutational scanning experiments. Generally, these experiments characterize the functional impact of point mutations, so experimental data on the impact of higher order mutants can be even more scarce. On the other hand, the number of variants that can be measured is limited by the assay throughput. For typical functional activity assays, this can restrict the number of feasibly measurable variants to hundreds or less.
In contrast, sequence data is plentiful for natural proteins. The number of known natural protein sequences has nearly tripled in the last 5 years, and continues to grow rapidly due to the falling cost of DNA sequencing. However, one issue in biological sequence analysis is to take a set of sequences representing a protein family, fit a generative model to those sequences, and then use the resulting model to search databases and classify new proteins. In this setting, families are usually represented by sets of sequences (e.g., in a multiple sequence alignment (MSA) query matrix), a protein structure, and/or the like. Thus the task is to find the parameters of a generative model, given protein information for a family that describes the family and generalizes to unseen members. Existing sequence models used for this problem includes position-specific scoring matrices (PSSMs) or profile Hidden Markov Models (pHMMs). For example, PSSMs model each column in the MSA as independent distributions over amino acids. Profile HMMs model each amino acid as being generated conditioned on a hidden state corresponding to the column in the MSA, but this alignment is considered unobserved when calculating the probability of a new sequence. The PSSM and HMM models are widely used, because they can be inferred from relatively small sets of sequences (often only 10 s or 100 s) and parameter inference needs to be performed for each set of proteins of interest.
In view of the learning limitations in protein engineering, embodiments described herein provide a system for building generative models of proteins based on sequence-to-sequence learning. Specifically, sequence modeling is formulated as a few-shot learning problem such that a single encoder-decoder model is trained to receive and encode information of a protein family. The information can take a form of an amino acid sequence, a set of amino acid sequences that belong to the same protein family, a multiple sequence alignment (MSA) matrix, a protein structure, and/or the like. The information of a protein family may be used as input to an encoder which encodes the information into protein representation, which is then decoded into a probability distribution over sequences from that protein family. In other words, the encoder-decoder model outputs a sequence that possibly represents a new protein for the protein family conditioned on learned encoding of the input protein information of the protein family. In this way, the encoder-decoder model may be trained to handle different protein families, circumventing the need for fitting dedicated family models.
In one embodiment, the encoder-decoder model may be trained on tens of thousands of protein sequencing information. In some implementations, the protein sequencing information may be input in the form of sequences of tokens. In another implementation, the protein sequencing information may be learnt in the form such as MSAs representing known protein families and then may receive unseen families held out from training at inference stage.
In this way, the generative encoder-decoder model learns to infer statistical sequence models of proteins that are substantially more accurate (lower perplexity) than PSSMs and pHMMs without requiring training on new protein families. Instead, the proposed generative model extrapolates directly from the multiple sequence alignment and learns how to infer evolutionary constraints from the training families.
In one embodiment, the generative model 105 may generate, in response to the input MSA query 103 (denoted by X), a probability distribution 125 over target protein sequences (y's), e.g., p(y|X). The target protein sequences are a set of possible protein sequences in the same protein family of the input sequences 102a-n. Thus, the generative model 105 is trained by a training set of (MSA, target protein) pairs (Xk, yk) where k ∈ 1, . . . , K denotes the kth pair. For example, the target protein yk is a member of the same family as the sequences in Xk that does not comprise the same target protein sequence yk.
In one embodiment, the generative model 105 may be a transformer model that comprises an MSA-encoder 110 and a sequence-to-sequence decoder 120 neural network architecture. The detailed structures of the encoder 110 and the decoder 120 are described in relation to
It is noted that the example embodiment shown in
In one embodiment, the MSA encoder 110 accepts tokens xi,j in the MSA query 103 as input and returns a vector representation 119 for each position in the MSA query, zi,j ∈ Rd, where d is the dimension of the learned embedding. The MSA encoder 110 is parameterized as a stack of transformer layers for axial attention over the rows 201, axial attention over the columns of the MSA 202 and a feed forward layer 203.
Before being processed by the transformer stack, the input tokens xi,j are preprocessed by an input embedding module 108 which embed the input tokens into vectors in Rd and augmented with a random Fourier projection of the column index as a positional embedding, also in Rd. No positional embedding is used for the rows of the MSA (the sequence index), because the ordering of sequences in an MSA is arbitrary and the MSA encoder 110 is expected to be invariant to the specific ordering of sequences in the input.
Specifically, the input embedding module 108 forms the input embeddings by adding a learned embedding for each amino acid to the random Fourier feature embedding of the column index as follows. First, the amino acid token is embedded by learned embeddings:
x
i,j
aa
=W
xi,j
,x
i,j
aa
∈R
d
,W ∈R
K×d
where W is a matrix of learnable amino acid embeddings of dimension d, K is the size of the vocabulary (22 in the case of 20 amino acids plus gap and start/end tokens), and xi,j indicates the amino acid at position i, j of the MSA. Next, the column index is embedded by:
x
i,j
pos
=W
pos cos(rj+b),xi,jpos∈Rd,r ∈Rd,b ∈Rd,Wpos∈Rd×d,
where Wpos is a learnable matrix, r is a random vector drawn from Normal(0, 1), and b is a random vector drawn from Uniform(0, 2π). The input embedding is then formed by:
z
i,j
=x
i,j
aa
+x
i,j
pos.
The MSA encoder transformer layers comprise axial self-attention layers 201-202 along the rows and columns of the MSA matrix 103 followed by a fully connected feed forward layer 203. The row-attention layer 201 comprises a normalization layer 111 and an attention layer 112 such that the axial row attention is preceded by layer normalization and uses residual connections. Similarly, the column-attention layer 202 comprises a normalization layer 113 and an attention layer 114 such that the axial column attention is preceded by layer normalization and uses residual connections. The feed forward layer 203 comprises a normalization layer 115, a linear projection layer 116, a GeLu layer 117 and another linear projection layer 118.
Therefore, within row and within column attention can be computed efficiently by calculating the multi-headed attention operation over the rows or columns batch-wise. For example, given a batch of intermediate MSA representations, Z, with dimensions B×N×M×d, where B is the number of MSA queries in the batch, N is the number of rows in each MSA query matrix, and M is the number of columns in each MSA query matrix, and d is the size of the input embedding of the MSA query, per-row self-attention can be calculated by treating rows as part of the batch dimension, BN×M×d, and then per column self-attention can be calculated by treating columns as part of the batch dimension, BM×N×d.
In one embodiment, before being passed into the decoder transformer layers, the target sequence ak is embedded following the same scheme as the input embedding module 108 for the MSA along the column dimension. When decoding, the target sequence is padded to begin with a start token and end with a stop token.
The embedded and padded target sequence ak is then sent to a normalization layer 121, followed by the causal self-attention layer 122 followed by another normalization layer 124. Meanwhile, the MSA representations 119 are sent to a normalization layer 123. The outputs from normalization layers 123 and 124 are then sent to the MSA cross-attention layer 125, which generates cross-attentions between the self-attentions of the target sequence and the MSA representations. Specifically, in the MSA cross attention layer 125, each position of the target sequence attends over the complete MSA representations for each attention head (i.e., L×N×M×H, where L is the number of target sequences, and His the number of heads). The cross-attentions can thus be efficiently computed by flattening the MSA representations along the row and column dimensions such that each MSA representation, zi,j, is a single key in the cross attention layer 123. The cross-attentions are then sent to the feed forward layer, which in turn applies a normalization layer 126, a linear projection 127, a GeLu operation 128, another linear projection 129.
Decoding the target sequence is performed using the decoder representations by learning a transformation from ak to the probability distribution over the (k+1)th token. This decoding process is formulated as a linear transformation of ak into a vector of dimension equal to the number of tokens, followed by softmax 131 to give the probability of each token, p(yk+1|ak)=p(yk+1|y1, . . . ,k, X).
The decoding process may be applied to yield different outputs. The decoder may generate, based on the probability of each token, token by token, a new protein sequence different from any of the plurality of amino acid sequences in the MSA query matrix but belongs to the same protein family. For another example, given a specific protein sequence, the decoder may determine a score indicating a likelihood level that the given protein sequence belongs to the same protein family. In another example, give a number of sampled protein sequences, the decoder may determine, based on the decoded probabilities, a recommended protein sequence that has the highest likelihood to belong the protein family among a number of given protein sequences.
In this way, the encoder-decoder model may be trained to generate predicted protein sequence in a sequence-from-sequences manner. For example, the decoder may generate predicted sequences in a protein family conditioned on previously generated sequences that belong to the family.
At the decoder 120, the target sequence may be padded with a start token<end> and an end token<end>, and be applied with a causal self-attention. The self-attentions 137 (e.g., corresponding to token “A”) may then be applied with cross attention against the MSA representations 119 to result in the output probability over the next token (e.g., “F”) in the target sequence.
It is noted that the example embodiments described in relation to
At step 502, a training input pair of first information representing a first protein belonging to a first protein family and a first target protein belonging to the first protein family is received. In one implementation, the first information representing the first protein may include an amino acid sequencing, a first multiple sequence alignment (MSA) query matrix (e.g., 103 in
In one implementation, high-performing mutants may be sampled from the first protein family as the first target protein sequence for training the encoder and the decoder.
In one implementation, an input embedding of entries for the training data may be generated. For example, when the training input includes an MDS query matrix, a first embedding is generated for each amino acid token in the MSA query matrix, and a second embedding is generated based on a random feature embedding of a column index of the MSA query matrix. The input embedding is then formulated by adding the first embedding and the second embedding.
In another example, the training pair may be obtained from full Pfam family alignments. In order to evaluate the performance of the model on unseen families, the Pfam sequences at the family level may be split into 10,593 training, 563 validation, and 2,654 test families.
At step 504, an encoder (e.g., 110 in
At step 508, a decoder (e.g., 120 in
At step 510, a loss function may be computed based on a log-likelihood of the predicted probability of the first target protein sequence conditioned on the first information representing the first protein (e.g., MSA query matrix 103 in
At step 512, the machine learning model comprising the encoder and the decoder may then be updated based on the computed loss function, e.g., via backpropagation. For example, the machine learning model is trained with the following specific hyperparameters: six encoder and decoder layers each with hidden dimension (d) of 768. The MSA encoder uses 12 attention heads and the decoder uses 8 attention heads.
In one implementation, the training may adopt ADAM variant of stochastic gradient descent using a linear ramp up, square root decay learning rate scheduler using a learning rate of 0.0001 with 4,000 warmup steps. A total minibatch size of 256 spread over 16 GPUs using distributed training. Each GPU process minibatches of size 1 with gradient accumulation over 16 steps to give the total effective minibatch size. In order to reduce GPU RAM consumption, sequences and MSAs are randomly sampled to a maximum length of 402 tokens during training. Furthermore, MSAs are randomly downsampled to contain between 1 and 50 sequences. During training, the loss is monitored on a validation set of MSAs and stop training when the validation loss stops decreasing.
At step 514, an input of second information (e.g., a second MSA query matrix) representing a plurality of amino acid sequences corresponding to a second protein family that is different from the first protein family may be received.
At step 516, the updated machine learning model may generate a second target protein sequence in response to an input of a second protein that belongs to a second protein family that is different from the first protein family. In this way, the trained machine learning model may be used to predict the protein sequences in the second protein family without re-training using sequencing data corresponding to the second protein family.
Memory 620 may be used to store software executed by computing device 600 and/or one or more data structures used during operation of computing device 600. Memory 620 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 610 and/or memory 620 may be arranged in any suitable physical arrangement. In some embodiments, processor 610 and/or memory 620 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 610 and/or memory 620 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 610 and/or memory 620 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 620 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 610) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 620 includes instructions for a protein generation module 630 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the protein generation module 630, may receive an input 640, e.g., such as a set of protein sequences belonging to a protein family, represented by an MSA query 103 via a data interface 615. The protein generation module 630 may generate an output 650 (such as predicted probabilities of tokens in a target protein sequence) in response to the input 640.
The protein generation module 630 may comprise an encoder 631 (e.g., similar to encoder 110 in
The decoder 632 then decodes the target sequence from the representations by attending to the learned representations in a decoder transformer with cross attention to the encoded protein representation. For example, when the decoder 632 receives an MSA representation from the encoder 631, the decoder 632 may comprise causal self-attention, cross-attention layers to apply to the MSA representations.
Some examples of computing devices, such as computing device 600 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 610) may cause the one or more processors to perform the processes of method. Some common forms of machine readable media that may include the processes of method are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
As expected, it is observed that all models' ability to generalize to unseen family members improves as the number of observed family members increases as shown in
A major challenge in protein engineering is navigating the enormous search space of possible sequence variants, because the space of sequence variants increases exponentially with the number of sites. For example, if all 20 amino acids at 10 positions are considered, the number of unique sequences is 2010 which is greater than ten trillion. At 65 sites, the space of possible sequences exceeds the number of atoms in the universe. However, the vast majority of these variants are not functional (<1% in typical mutagenesis experiments). Therefore, homing in on only the space of viable protein variants is critical for efficiently and feasibly searching sequence space. Perplexity represents the number of amino acids that would need to be guessed from uniformly to find the correct amino acid, therefore, it is the size of the reduced alphabet learned by the model. On this basis, the model can produce an enormous reduction in library size for protein engineering. Using the 10 sites example, the pHMM perplexity of 5.3 yields a library size of 18 million which the model reduces to only 42,000, more than an order of magnitude reduction in library size over the pHMM and about 8 orders of magnitude better than random search. This improvement is even more extreme when considering more sites for mutation.
Compared to the masked language models described, the generative model (denoted as “MSA2Prot”) offers exact sampling through the use of a decoder. This sidesteps computationally intensive Gibbs sampling.
In addition, MSA2Prot is able to generalize from a distribution of high-performing mutants, as shown in
Often, combing the literature for a given protein will yield a list of high and low performing mutants. However, given that experimental setups differ, there may not be consistent fitness measurements. MSA2Prot is an ideal candidate for this situation, given its ability to harness both high and low performing variants without explicit functional measurements. MSA2Prot also offers exact generation conditioned on multiple attributes. Given a protein sequence, the probability distribution over the next residue can be obtained by adding and re-normalizing the marginals of two MSAs, each representing different attributes.
MSA2Prot is further evaluated on the data set (Gonzalez et al., Fitness effects of single amino acid insertions and deletions in term-1-lactamase. Journal of Molecular Biology, 431(10):2320-2330, May 2019) of 262 deletions and 4422 insertions, and benchmarked against (Riesselman et al., 2018) and an HMM.
MSA2Prot's ability to adaptively sample high fitness variants given black-box oracle, approximated by a Random Forest Regressor. For example, the system may randomly sample 100 sequences from the training MSA to form an initial MSA. Sequences are sampled, and update the MSA if the regressor predicts the sampled sequence has a higher fitness than the minimum fitness sequence in the MSA. The results, shown in
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
Number | Date | Country | |
---|---|---|---|
63281975 | Nov 2021 | US | |
63321916 | Mar 2022 | US |