Proteins are biological molecules that are comprised of one or more chains of amino acids. Proteins can have various functions within an organism. For example, some proteins can be involved in causing a reaction to take place within an organism. In other examples, proteins can transport molecules throughout the organism. In still other examples, proteins can be involved in the replication of genes. Additionally, some proteins can have therapeutic properties and can be used to treat various biological conditions. The structure and function of proteins are based on the arrangement of amino acids that comprise the proteins. The arrangement of amino acids for proteins can be represented by a sequence of letters with each letter corresponding to an amino acid at a certain position of the protein. The arrangement of amino acids for proteins can also be represented by three dimensional structures that not only indicate the amino acids at certain positions of the protein, but also indicate three dimensional features of the proteins, such as an α-helix or a β-sheet.
The present disclosure is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
Proteins can have many beneficial uses within organisms. For example, proteins can be used to treat diseases and other biological conditions that can detrimentally impact the health of humans and other mammals. In various scenarios, proteins can participate in reactions that are beneficial to subjects and that can counteract one or more biological conditions being experienced by the subjects. In some examples, proteins can also bind to molecules within an organism that may be detrimental to the health of a subject. In various situations, the binding of proteins to potentially harmful molecules can result in activation of the immune system of a subject to neutralize the potential effects of the molecules. For these reasons, many individuals and organizations have sought to develop proteins that may have therapeutic benefits.
The development of proteins for use in treating biological conditions can be a time consuming and resource intensive process. Often, candidate proteins for development can be identified as potentially having desired biophysical properties, three-dimensional (3D) structures, and/or behavior within an organism. In order to determine whether the candidate proteins actually have the desired characteristics, the proteins can be physically synthesized and then tested to determine whether the actual characteristics of the synthesized proteins correspond to the desired characteristics. Due to the amount of resources needed to synthesize and test proteins for specified biophysical properties, 3D structures, and/or behaviors, the number of candidate proteins synthesized for therapeutic purposes is limited. In some situations, the number of proteins synthesized for therapeutic purposes can be limited by the loss of resources that takes place when candidate proteins are synthesized and do not have the desired characteristics.
The use of computer-implemented techniques to identify candidate proteins that have particular characteristics has increased. These conventional techniques, however, can be limited in their scope and accuracy. In various situations, conventional computer-implemented techniques to generate protein sequences can be limited by the amount of data available and/or the types of data available that may be needed by those conventional techniques to accurately generate protein sequences with specified characteristics. Additionally, the techniques utilized to produce models that can generate protein sequences with particular characteristics can be complex and the know-how needed to produce models that are accurate and efficient can be complex and difficult to implement. The length of the protein sequences produced by conventional models can also be limited because the accuracy of conventional techniques can decrease as the lengths of the proteins increases and because the computing resources used to generate large numbers of protein sequences, such as hundreds, thousands, up to millions of protein sequences, having a relatively large number of amino acids (e.g., 50-1000) can become prohibitive. Thus, the number of proteins generated by conventional computational techniques is limited.
Further, although proteins produced by one organism or type of organism can have functionality that may be beneficial to a number of organisms, in various scenarios, the same proteins can be rejected by the immune system of another organism or type of organism and obviate the beneficial functionality of the proteins. The techniques and systems described herein can be used to generate amino acid sequences of target molecules based on amino acid sequences of template molecules. The template molecules can exhibit a functionality that can be beneficial for a number of different organisms besides the original host that produced the template molecules. The target molecules can also exhibit the functionality of the template molecules, while minimizing the possibility of rejection by an organism that is different from the original host.
For example, the portions of the amino acid sequence of a template protein that are attributed to a functionality of the template protein within a host organism can be preserved, while additional portions of the amino acid sequence of the template protein can be modified to minimize the possibility of rejection by another organism. To illustrate, a template antibody produced in a mouse can effectively bind to an antigen that is found in both mice and humans. The binding of the template antibody to the antigen can be attributed to one or more binding regions of the template antibody. The techniques and systems described herein can generate data corresponding to a number of amino acid sequences for target antibodies that include the binding regions of the template antibody and that also include additional regions that have been modified from the template antibody that correspond to amino acid sequences included in human antibodies. In this way, the techniques and systems described herein can produce an antibody with a human framework in conjunction with binding regions for a specific antigen, where the binding regions for the antigen may not be present in known human antibodies. Accordingly, biological conditions that may not have been responsive to known human antibodies can be treated using antibodies with amino acid sequences generated from the techniques and systems described herein.
Machine learning techniques can be used to generate the target protein amino acid sequences from the template protein amino acid sequences. In illustrative examples, generative adversarial networks can be used to generate the target protein amino acid sequences. The generative adversarial networks can be trained using target protein amino acid sequences in relation to template protein amino acid sequences and position modification data. The position modification data can indicate, for individual positions of a template protein amino acid sequence, a likelihood that the amino acid can be modified to a different amino acid. In various implementations, the position modification data can correspond to a penalty applied by the generative adversarial network in response to modification of an individual amino acid. For example, a position of a template protein amino acid sequence having a relatively high penalty for being modified can be less likely to be modified by the generative adversarial network, while another position of the template protein amino acid sequence having a relatively low penalty for being modified can be more likely to be modified by the generative adversarial network. In various examples, transfer learning techniques can also be applied to produce target antibodies having one or more biophysical properties.
The position modification data can be based on the location of the amino acids in the template protein sequence. Amino acids located in regions of the template protein associated with a desired functionality can have a relatively high penalty for being modified, while amino acids located in other regions of a template protein can have relatively moderate or relatively low penalties for being modified. In situations where a target protein corresponds to a different organism than a host organism that produces the template protein, the positions of the template protein associated with relatively low penalties for being modified can be the most likely to be changed to correspond to a framework for the organism related to the target protein. Additionally, in scenarios where the target protein is derived from a germline gene that is different from the germline gene of the host that produces the template protein, the positions of the template protein associated with relatively low penalties for being modified can be the most likely to be changed to correspond to a protein produced from the target protein germline gene. As used herein, germline, can correspond to amino acid sequences of proteins that are conserved when cells of the proteins replicate. An amino acid sequence can be conserved from a parent cell to a progeny cell when the amino acid sequence of the progeny cell has at least a threshold amount of identity with respect to the corresponding amino acid sequence in the parent cell. In an illustrative example, a portion of an amino acid sequence of a human antibody that is part of a kappa light chain that is conserved from a parent cell to a progeny cell can be a germline portion of the antibody.
In illustrative examples, an antibody produced in mice can bind to an antigen that is found in both mice and humans. The binding of the antibody to the antigen can be based on amino acids located in the complementarity-determining regions (CDRs) of the antibody. In this scenario, the position modification data can indicate relatively high penalties for changing amino acids located in the CDRs of the template mouse antibody. The position modification data can also indicate lower penalties for the modification of amino acids located in the constant domains and other portions of the variable domains of the template mouse antibody. Thus, the generative adversarial networks described herein can generate target human antibody amino acid sequences that preserve most or all of the residues of the mouse antibody that participate in binding the antigen, while changing constant domains and/or other parts of the variable domains of the heavy chains and/or light chains of the mouse antibody to correspond to the heavy chains and light chains of human antibodies. The generative adversarial networks described herein can also be trained using human antibodies in order to determine the characteristics of human antibodies and identify changes to the template mouse antibody that can be made to produce a humanized target antibody for the antigen.
By implementing the techniques and systems described herein, target protein amino acid sequences can be generated based on one or more template proteins amino acid sequences that can preserve at least some functionality of the template proteins, while utilizing a different supporting framework for the portions of the template proteins that are attributed to the functionality The computational and machine learning techniques described herein can efficiently generate target protein amino acid sequences, while minimizing the probability that the target proteins will lose functionality of the template proteins. The techniques and systems described herein can also minimize the probability that the target proteins will be rejected by an organism that is different from the host organism that produced the template proteins. For example, the use of position modification data can decrease the amount of computing resources utilized in generating target protein sequences by constraining the number of changes that can be made by a computational model to the template protein sequences, while allowing for flexibility in portions of the template sequences that are less constrained to coincide with features of target proteins related to the new host organism. In various examples, the techniques and systems described herein can analyze thousands up to millions of amino acid sequences of proteins to accurately generate amino acid sequences of new proteins that both preserve the functionality of template proteins while also minimizing the probability of the new proteins being rejected by a new host organism.
In illustrative examples, an amount of sequence identity between the region 108 of the template protein 104 and a portion of a target protein 106 can indicate that at least a portion of the region 108 of the template protein 104 and the portion of the target protein 106 have the same nucleotide at a number of positions. An amount of identity between at least a portion of the region 108 of the template protein 104 and a portion of the target protein 106 can be determined using a Basic Local Alignment Search Tool (BLAST).
Additional portions of the target protein 106 can have different amino acid sequences in relation to portions of the template protein 104. The regions of the target protein 106 that have different amino acid sequences in relation to portions of the template protein 104 can also have one or more different secondary structures in relation to the secondary structures of the template protein 104. The differences between the amino acid sequences of regions of the template protein 104 and the regions of the target protein 106 can also result in different tertiary structures for the template protein 104 and the target protein 106. In the illustrative example of
The machine learning architecture 102 can modify regions of the template protein 104 to produce the amino acid sequence of the target protein 106 such that portions of the amino acid sequence of the target protein 106 correspond to proteins produced by a different organism than the organism that produced the template protein 104. For example, the template protein 104 can be produced by one mammal and the target protein 106 can be produced by a different mammal. To illustrate, the template protein 104 can be produced by mice and the target protein 106 can correspond to proteins produced by humans. In additional examples, the template protein 104 can correspond to a protein produced in relation to a first germline gene and the target protein 106 can correspond to a protein produced in relation to a second germline gene. In situations where the template protein 104 and the target protein 106 are antibodies, the template protein 104 can have an amino acid sequence that corresponds to a first antibody isotype (e.g., immunoglobin E (IgE)) and the target protein 106 can have an amino acid sequence that corresponds to a second antibody isotype (e.g., IgG).
The machine learning architecture 102 can include a generating component 118 and a challenging component 120. The generating component 118 can implement one or more models to generate amino acid sequences based on input provided to the generating component 118. In various implementations, the one or more models implemented by the generating component 118 can include one or more functions. The challenging component 120 can generate output indicating whether the amino acid sequences produced by the generating component 118 satisfy various characteristics. The output produced by the challenging component 120 can be provided to the generating component 118 and the one or more models implemented by the generating component 118 can be modified based on the feedback provided by the challenging component 120. The challenging component 120 can compare the amino acid sequences produced by the generating component 118 with amino acid sequences of a library of target proteins and generate an output indicating an amount of correspondence between the amino acid sequences produced by the generating component 118 and the amino acid sequences of target proteins provided to the challenging component 120.
In various implementations, the machine learning architecture 102 can implement one or more neural network technologies. For example, the machine learning architecture 102 can implement one or more recurrent neural networks. Additionally, the machine learning architecture 102 can implement one or more convolution neural networks. In certain implementations, the machine learning architecture 102 can implement a combination of recurrent neural networks and convolutional neural networks. In examples, the machine learning architecture 102 can include a generative adversarial network (GAN). In these situations, the generating component 118 can include a generator and the challenging component 120 can include a discriminator. In additional implementations, the machine learning architecture 102 can include a conditional generative adversarial network (cGAN).
In the illustrative example of
Additionally, position modification data 128 can be provided to the generating component 118 to be used by the generating component 118 to produce the generated sequences 122. The position modification data 128 can indicate one or more criteria related to the modification of amino acids of the one or more template protein sequences 126. For example, the position modification data 128 can indicate one or more criteria corresponding to the modification of individual amino acids of the one or more template protein sequences 126. To illustrate, the position modification data 128 can indicate respective probabilities that amino acids at individual positions of a template protein sequence 126 can be modified. In additional implementations, the position modification data 128 can indicate a penalty associated with the modification of amino acids at individual positions of a template protein sequence 126. The position modification data 128 can include values or functions corresponding to the respective amino acids located at individual positions of a template protein sequence 126.
In illustrative examples, the position modification data 128 can include criteria that reduce the probability of amino acids being modified at positions of a template protein that correspond to functionality of the template protein that is to be preserved in a target protein. For example, a penalty associated with modifying an amino acid located in a region that is attributed to functionality of a template protein can be relatively high. Additionally, the position modification data 128 can include criteria for amino acids outside of one or more regions that are attributed to functionality of a template protein that indicate increased or neutral probabilities for modification of those amino acids. To illustrate, a penalty associated with modifying an amino acid located at a position outside of a region attributed to particular functionality of a protein can be relatively low or neutral. Further, the position modification data 128 can indicate probabilities of changing amino acids at positions of a template protein to different types of amino acids. In illustrative examples, an amino acid located at a position of a template protein can have a first penalty for being changed to a first type of amino acid and a second, different penalty for being changed to a second type of amino acid. That is, in various implementations, a hydrophobic amino acid of a template protein can have a first penalty for being changed to another hydrophobic amino acid and a second, different penalty for being changed to a positively charged amino acid.
In one or more examples, the position modification data 128 can be determined, at least in part, based on input obtained via a computing device. For example, a user interface can be generated that includes one or more user interface elements to capture at least a portion of the position modification data 128. In addition, a data file can be obtained over a communication interface that includes at least a portion of the position modification data 128. Further, the position modification data 128 can be computed by analyzing a number of amino acid sequences to determine numbers of occurrences of different amino acids at one or more positions of the proteins. Occurrences of amino acids at positions of proteins, including template proteins and target proteins, can be used to determine probabilities of modifications of amino acids that are indicated in the position modification data 128. In various examples, biophysical properties and/or structural properties of proteins can be analyzed in conjunction with the placement of amino acids at one or more positions of template proteins and target proteins to determine probabilities included in the position modification data 128 for modifying amino acids at one or more positions of template proteins to generate target proteins.
The generated sequence(s) 122 can be compared by the challenging component 120 against sequences of proteins included in target protein sequence data 130. The target protein sequence data 130 can be training data for the machine learning architecture 102. The target protein sequence data 130 can be encoded according to a schema. A schema applied to amino acid sequences included in the target protein sequence data 130 can be based on a classification of the amino acid sequences. For example, an antibody can be stored according to a first classification, a signaling protein can be stored according to a second classification, and a transport protein can be stored according to a third classification.
The target protein sequence data 130 can include sequences of proteins obtained from one or more data sources that store amino acid sequences of proteins. The one or more data sources can include one or more websites that are searched and information corresponding to amino acid sequences of target proteins can be extracted from the one or more websites. Additionally, the one or more data sources can include electronic versions of research documents from which amino acid sequences of target proteins can be extracted.
In illustrative examples, the target protein sequence data 130 can include amino acid sequences of proteins that are produced by an organism that is different from an organism that produces the template protein sequences 126. For example, the target protein sequence data 130 can include amino acid sequences of human proteins and the one or more template protein sequences 126 can correspond to one or more proteins produced by mice or chickens. In additional examples, the target protein sequence data 130 can include amino acid sequences of horse proteins and the one or more template protein sequences 126 can correspond to one or more proteins produced by humans. In various examples, the amino acid sequences included in the target protein sequence data 130 can have one or more characteristics and/or functions. To illustrate, the amino acid sequences included in the target protein sequence data 130 can correspond to human enzymes used in the metabolism of various foods consumed by humans. In further examples, the amino acid sequences included in the target protein sequence data 130 can correspond to human antibodies.
The template protein sequences 126, the position modification data 128, the target protein sequence data 130, or combinations thereof, can be stored in one or more data stores that are accessible to the machine learning architecture 102. The one or more data stores can be connected to the machine learning architecture 102 via a wireless network, a wired network, or combinations thereof. The template protein sequences 126, the position modification data 128, the target protein sequence data 130, or combinations thereof, can be obtained by the machine learning architecture 102 based on requests sent to the data stores to retrieve one or more portions of at least one of the template protein sequences 126, the position modification data 128, or the target protein sequence data 130.
The challenging component 120 can generate output indicating whether the amino acid sequences produced by the generating component 118 satisfy various characteristics. In one or more implementations, the challenging component 120 can be a discriminator. In additional situations, such as when the machine learning architecture 102 includes a Wasserstein GAN, the challenging component 120 can include a critic.
In illustrative examples, based on similarities and differences between the generated sequence(s) 122 and additional sequences provided to the challenging component 120, such as amino acid sequences included in the target protein sequence data 130, the challenging component 120 can generate the classification output 132 to indicate an amount of similarity or an amount of difference between the generated sequence(s) 122 and sequences provided to the challenging component 120 that are included in the target protein sequence data 130. Additionally, the classification output 132 can indicate an amount of similarity or an amount of difference between the generated sequence(s) 122 and the template protein sequences 126.
In one or more examples, the challenging component 120 can label the generated sequence(s) 122 as zero and the encoded sequences obtained from the target protein sequence data 130 as 1. In these situations, the classification output 132 can include a first number from 0 to 1 with respect to one or more amino acid sequences included in the target protein sequence data 130. Additionally, the challenging component 120 can label the generated sequences 122 as zero and the template protein sequences 126 as 1. Accordingly, the challenging component 120 can generate another number from 0 to 1 with respect to the template protein sequences 126.
In additional examples, the challenging component 120 can implement a distance function that produces an output that indicates an amount of distance between the generated sequence(s) 120 and the proteins included in the target protein sequence data 130. Further, the challenging component 120 can implement a distance function that produces an output that indicates an amount of distance between the generated sequence(s) 122 and the template protein sequence(s) 126. In implementations where the challenging component 120 implements a distance function, the classification output 132 can include a number from −∞ to ∞ indicating a distance between the generated sequence(s) 122 and one or more sequences included in the target protein sequence data 130. The challenging component 120 can also implement a distance function and generate a classification output 132 including an additional number from −∞ to ∞ indicating a distance between the generated sequence(s) 122 and the template protein sequences 126.
The amino acid sequences included in the target protein sequence data 130 can be subject to data preprocessing 134 before being provided to the challenging component 120. For example, the target protein sequence data 130 can be arranged according to a classification system before being provided to the challenging component 120. The data preprocessing 134 can include pairing amino acids included in the target proteins of the target protein sequence data 130 with numerical values that can represent structure-based positions within the proteins. The numerical values can include a sequence of numbers having a starting point and an ending point. In an illustrative example, a T can be paired with the number 43 indicating that a Threonine molecule is located at a structure-based position 43 of a specified protein domain type. In illustrative examples, structure-based numbering can be applied to any general protein type, such as fibronectin type III (FNIII) proteins, avimers, antibodies, VHH domains, kinases, zinc fingers, T-cell receptors, and the like.
In various implementations, the classification system implemented by the data preprocessing 134 can include a numbering system that encodes structural position for amino acids located at respective positions of proteins. In this way, proteins having different numbers of amino acids can be aligned according to structural features. For example, the classification system can designate that portions of proteins having particular functions and/or characteristics can have a specified number of positions. In various situations, not all of the positions included in the classification system may be associated with an amino acid because the number of amino acids in a particular region of a protein may vary between proteins. In additional examples, the structure of a protein can be reflected in the classification system. To illustrate, positions of the classification system that are not associated with a respective amino acid can indicate various structural features of a protein, such as a turn or a loop. In an illustrative example, a classification system for antibodies can indicate that heavy chain regions, light chain regions, and hinge regions have a specified number of positions assigned to them and the amino acids of the antibodies can be assigned to the positions according to the classification system. In one or more implementations, the data preprocessing 134 can use Antibody Structural Numbering (ASN) to classify individual amino acids located at respective positions of an antibody.
The data used to train the machine learning architecture 102 can impact the amino acid sequences produced by the generating component 118. For example, in situations where human antibodies are included in the protein sequence data 130 provided to the challenging component 120, the amino acid sequences generated by the generating component 118 can correspond to human antibody amino acid sequences. In another example, in scenarios where the amino acid sequences included in the target protein sequence data 130 provided to the challenging component 120 correspond to proteins produced from a germline gene, the amino acid sequences produced by the generating component 118 can correspond to proteins produced from the germline gene. Further, when the amino acid sequences included in the target protein sequence data 130 provided to the challenging component 120 correspond to antibodies of a specified isotype, the amino acid sequences produced by the generating component 118 can correspond to antibodies of the specified isotype.
The output produced by the data preprocessing 134 can include encoded sequences 136. The encoded sequences 136 can include a matrix indicating amino acids associated with various positions of a protein. In examples, the encoded sequences 136 can include a matrix having columns corresponding to different amino acids and rows that correspond to structure-based positions of proteins. For each element in the matrix, a 0 can be used to indicate the absence of an amino acid at the corresponding position and a 1 can be used to indicate the presence of an amino acid at the corresponding position. The matrix can also include an additional column that represents a gap in an amino acid sequence where there is no amino acid at a particular position of the amino acid sequence. Thus, in situations where a position represents a gap in an amino acid sequence, a 1 can be placed in the gap column with respect to the row associated with the position where an amino acid is absent. The generated sequence(s) 122 can also be represented using a vector according to a same or similar number scheme as used for the encoded sequences 136. In some illustrative examples, the encoded sequences 136 and the generated sequence(s) 122 can be encoded using a method that may be referred to as a one-hot encoding method.
After the machine learning architecture 102 has undergone a training process, a trained model 138 can be generated that can produce sequences of proteins. The trained model 138 can include the generating component 118 after a training process has been performed using the protein sequence data 130. In illustrative examples, the trained model 138 include a number of weights and/or a number of parameters of a convolution neural network. The training process for the machine learning architecture 102 can be complete after the function(s) implemented by the generating component 118 and the function(s) implemented by the challenging component 120 converge. The convergence of a function can be based on the movement of values of model parameters toward particular values as protein sequences are generated by the generating component 118 and feedback is obtained from the challenging component 120. In various implementations, the training of the machine learning architecture 102 can be complete when the protein sequences produced by the generating component 118 have particular characteristics. For example, the amino acid sequences generated by the generating component 118 can be analyzed by a software tool that determines at least one of biophysical properties of the amino acid sequences, structural features of the amino acid sequences, or adherence to amino acid sequences corresponding to one or more protein germlines The machine learning architecture 102 can produce the trained model 138 in situations where the amino acid sequences produced by the generating component 118 are determined by the software tool to have one or more specified characteristics. In various examples, a software tool used to evaluate the amino acid sequences produced by the generating component 118 can determine that the trained model 138 produces amino acid sequences that have preserved functionality of a template protein.
Protein sequence input 140 can be provided to the trained model 138, and the trained model 138 can produce generated protein sequences 142. The protein sequence input 140 can include one or more template protein sequences, additional position constraint data, and an input vector that can include a random or pseudo-random series of numbers. In an illustrative example, the protein sequence input 140 can include one or more template protein sequences 126. The generated protein sequences 142 produced by the trained model 138 can be represented as a matrix structure that is the same as or similar to the matrix structure used to represent the encoded sequences 136 and/or the generated sequence(s) 122. In various implementations, the matrices produced by the trained model 138 that comprise the generated protein sequences 142 can be decoded to produce a string of amino acids that correspond to the sequence of a target protein. In illustrative examples, the protein sequence input 140 can include the amino acid sequence of the template protein 104 and position modification data indicating a relatively high probability that the amino acids located in the region 108 are to be preserved in order to preserve the functionality of the region 108. The trained model 138 can then use the protein sequence input 140 to generate a number of amino acid sequences of target proteins, such as an amino acid sequence of the target protein 106. In various examples, the trained model 138 can use the protein sequence input 140 to produce hundreds, up to thousands, and up to millions of protein sequences similar to the target protein 106 that correspond to the template protein 104.
Although not shown in the illustrative example of
The generated protein sequences 142 produced by the trained model 138 can correspond to various types of proteins. For example, the generated protein sequences 142 can correspond to proteins that function as T-cell receptors. In additional examples, the generated protein sequences 142 can correspond to proteins that function as catalysts to cause biochemical reactions within an organism to take place. The generated protein sequences 142 can also correspond to one or more types of antibodies. To illustrate, the generated protein sequences 142 can correspond to one or more antibody subtypes, such as immunoglobin A (IgA), immunoglobin D (IgD), immunoglobin E (IgE), immunoglobin G (IgG), or immunoglobin M (IgM). Further, the generated protein sequences 142 can correspond to additional proteins that bind antigens. In examples, the generated protein sequences 142 can correspond to affibodies, affilins, affimers, affitins, alphabodies, anticalins, avimers, monobodies, designed ankyrin repeat proteins (DARPins), nanoCLAMP (clostridal antibody mimetic proteins), antibody fragments, or combinations thereof. In still other examples, the generated protein sequences 142 can correspond to amino acid sequences that participate in protein-to-protein interactions, such as proteins that have regions that bind to antigens or regions that bind to other molecules.
In some implementations, the generated protein sequences 142 can be subject to sequence filtering. The sequence filtering can parse the generated protein sequences 142 to identify one or more of the generated protein sequences 142 that correspond to one or more characteristics. For example, the generated protein sequences 142 can be analyzed to identify amino acid sequences that have specified amino acids at particular positions. One or more of the generated protein sequences 142 can also be filtered to identify amino acid sequences having one or more particular strings or regions of amino acids. In various implementations, the generated protein sequences 142 can be filtered to identify amino acid sequences that are associated with a set of biophysical properties based at least partly on similarities between at least one of the generated protein sequences 142 and amino acid sequences of additional proteins having the set of biophysical properties.
The machine learning architecture 102 can be implemented by one or more computing devices 144. The one or more computing devices 144 can include one or more server computing devices, one or more desktop computing devices, one or more laptop computing devices, one or more tablet computing devices, one or more mobile computing devices, or combinations thereof. In certain implementations, at least a portion of the one or more computing devices 144 can be implemented in a distributed computing environment. For example, at least a portion of the one or more computing devices 144 can be implemented in a cloud computing architecture. Additionally, although the illustrative example of
The first generative adversarial network 202 can be trained in a same or similar manner described with respect to the machine learning architecture 102 of
The first encoded target protein sequences 210 can be encoded according to a classification scheme. In addition, the first encoded target protein sequences 210 can include amino acid sequences of target proteins, where the target proteins include a scaffolding or foundational structure that can support one or more functional regions. For example, in situations where the first encoded target protein sequences 210 are human antibodies, the first encoded target protein sequences 210 can have constant regions of light chains and/or heavy chains that are representative of a particular type or class of antibody. To illustrate, the first encoded target protein sequences 210 can include antibodies that have constant regions of heavy chains that correspond to IgA antibodies.
The trained model 208 can generate amino acid sequences of proteins that have at least a portion of the functionality of the one or more template proteins in addition to the underlying structure or scaffold structure of the target proteins. In implementations, the trained model 208 can generate amino acid sequences of human antibodies that bind to an antigen with a CDR that corresponds to a CDR originally found in a mouse antibody. In additional examples, the trained model 208 can generate amino acid sequences of proteins produced from a first germline gene based on input of one or more amino acid sequences of proteins produced from a second, different germline gene.
In additional implementations, the trained model 208 can be generated without using at least one of the template protein sequences 212 or the position modification data 214. For example, the trained model 208 can be generated using the first encoded target protein sequences 210 and the first input data 216. In various implementations, the trained model 208 can be generated using training data for the first generative adversarial network 202 such that the first encoded target protein sequences 210 include amino acid sequences corresponding to one or more germline genes.
In various examples, the amino acid sequences generated by the trained model 208 can be refined further. To illustrate, the trained model 208 can be modified by being subjected to another training process using a different set of training data than the initial training process. For example, the data used for additional training of the trained model 208 can include a subset of the data used to initially produce the trained model 208. In additional examples, the data used for additional training of the trained model 208 can include a different set of data than the data used to initially produce the trained model 208. In illustrative examples, the trained model 208 can produce amino acid sequences of human antibodies with CDR regions of a mouse antibody that binds to an antigen and the trained model 208 can be further refined to generate amino acid sequences of human antibodies with CDR regions originally found in the chicken antibody that have a higher probability of having at least a threshold level of expression in an environment having a specified pH range. Continuing with this example, the trained model 208 can be refined through additional training using a dataset of human antibodies that have a relatively high level of expression in the specified pH range. In the illustrative example of
Second input data 228 can be provided to the second generating component 220 and the second generating component 220 can produce one or more generated sequences 224. The second input data 228 can include a random or pseudo-random sequence of numbers that the second generating component 220 uses to produce the generated sequences 224. The second challenging component 222 can generate second classification output 226 indicating that the amino acid sequences produced by the second generating component 220 satisfy various characteristics or that the amino acid sequences produced by the second generating component 220 do not satisfy various characteristics. In illustrative examples, the second challenging component 222 can generate the classification output 226 based on similarities and differences between one or more generated sequences 224 and amino acid sequences provided to the second challenging component 222. The classification output 226 can indicate an amount of similarity or an amount of difference between the generated sequences 224 and comparison sequences provided to the second challenging component 222.
The amino acid sequences provided to the second challenging component 222 can be included in additional protein sequence data 230. The additional protein sequence data 230 can include amino acid sequences of proteins that have one or more specified characteristics. For example, the additional protein sequence data 230 can include amino acid sequences of proteins having a threshold level of expression in humans. In additional examples, the additional protein sequence data 230 can include amino acid sequences of proteins having one or more biophysical properties and/or one or more structural properties. To illustrate, the proteins included in the additional protein sequence data can have negatively charged regions, hydrophobic regions, a relatively low probability of aggregation, a specified percentage of high molecular weight (HMW), melting temperature, one or more combinations thereof, and the like. In various examples, the additional protein sequence data 230 can include a subset of the protein sequence data used to produce the trained model 208. By providing amino acid sequences to the second challenging component 222 that have one or more specified characteristics, the second generating component 220 can be trained to produce amino acid sequences have at least a threshold probability of having the one or more of the specified characteristics.
Additionally, in many situations where it is desired to produce amino acid sequences of proteins having specific characteristics, the number of sequences available to train a generative adversarial network is limited. In these situations, the accuracy, efficiency, and/or effectiveness of the generative adversarial network to produce amino acid sequences of proteins having the specified characteristics may be unsatisfactory. Thus, without a sufficient number of amino acid sequences available to train a generative adversarial network, the amino acid sequences produced by the generative adversarial network may not have the desired characteristics. By implementing the techniques and systems described with respect to
Before being provided to the second challenging component 222, the amino acid sequences included in the additional protein sequence data 230 can be subject to data preprocessing 232. For example, the additional protein sequence data 230 can be arranged according to a classification system before being provided to the second challenging component 222. The data preprocessing 232 can include pairing amino acids included in the amino acid sequences of proteins included in the additional protein sequence data 230 with numerical values that can represent structure-based positions within the proteins. The numerical values can include a sequence of numbers having a starting point and an ending point. The second encoded sequences 234 can include a matrix indicating amino acids associated with various positions of a protein. In various examples, the second encoded sequences 234 can include a matrix having columns corresponding to different amino acids and rows that correspond to structure-based positions of proteins. For each element in the matrix, a 0 can be used to indicate the absence of an amino acid at the corresponding position and a 1 can be used to indicate the presence of an amino acid at the corresponding position. The matrix can also include an additional column that represents a gap in an amino acid sequence where there is no amino acid at a particular position of the amino acid sequence. Thus, in situations where a position represents a gap in an amino acid sequence, a 1 can be placed in the gap column with respect to the row associated with the position where an amino acid is absent. The generated sequence(s) 224 can also be represented using a vector according to a same or similar number scheme as used for the second encoded sequences 234. In some illustrative examples, the second encoded sequences 234 and the second generated sequence(s) 224 can be encoded using a method that may be referred to as a one-hot encoding method. In illustrative examples, the classification system used in the data preprocessing 232 can be the same as or similar to the classification system used in the preprocessing 134 described with respect to
The second challenging component 222 can generate output indicating whether the amino acid sequences produced by the second generating component 220 satisfy various characteristics. In various implementations, the second challenging component 222 can be a discriminator. In additional situations, such as when the second generative adversarial network 218 includes a Wasserstein GAN, the second challenging component 222 can include a critic.
In illustrative examples, based on similarities and differences between the generated sequence(s) 224 and additional sequences provided to the second challenging component 222, such as amino acid sequences included in the additional protein sequence data 232, the second challenging component 222 can generate the classification output 226 to indicate an amount of similarity or an amount of difference between the generated sequence(s) 224 and sequences provided to the second challenging component 222 that are included in the additional protein sequence data 232. Additionally, the classification output 226 can indicate an amount of similarity or an amount of difference between the generated sequence(s) 224 and the amino acid sequences included in the additional protein sequence data 232. In additional examples, the second challenging component 222 can implement a distance function that produces an output that indicates an amount of distance between the generated sequence(s) 222 and the proteins included in the additional protein sequence data 232. In implementations where the second challenging component 222 implements a distance function, the classification output 226 can include a number from −∞ to ∞ indicating a distance between the generated sequence(s) 224 and one or more amino acid sequences included in the additional protein sequence data 232.
After the second generative adversarial network 218 has undergone a training process, a modified trained model 236 can be generated that can produce sequences of proteins. The modified trained model 236 can represent the trained model 208 after being trained using the additional protein sequence data 230. In examples, the training process for the second generative adversarial network 218 can be complete after the function(s) implemented by the second generating component 220 and the second challenging component 222 converge. The convergence of a function can be based on the movement of values of model parameters toward particular values as protein sequences are generated by the second generating component 220 and feedback is obtained from the second challenging component 222. The training of the second generative adversarial network 218 can be complete when the protein sequences generated by the second generating component 220 have particular characteristics.
Additional sequence input 238 can be provided to the modified trained model 236, and the modified trained model 236 can produce generated sequences 240. The additional sequence input 238 can include a random or pseudo-random series of numbers and the generated sequences 240 can include amino acid sequences that can be sequences of proteins. In additional implementations, the generated sequences 240 can be evaluated to determine whether the generated sequences 240 have a specified set of characteristics. The evaluation of the generated sequences 240 can produce metrics that indicate characteristics of the generated sequences 240, such as biophysical properties of a protein, biophysical properties of a region of a protein, and/or the presence or absence of amino acids located at specified positions. Additionally, the metrics can indicate an amount of correspondence between the characteristics of the generated sequences 240 and a specified set of characteristics. In some examples, the metrics can indicate a number of positions of a generated sequence 240 that vary from a sequence produced by a germline gene of a protein. Further, an evaluation of the generated sequences 240 can determine the presence or absence of structural features of proteins that correspond to the generated sequences 240.
While the illustrative example of
The computing system 302 can include one or more generative adversarial networks 304. The one or more generative adversarial networks 304 can include a conditional generative adversarial network. In various implementations, the one or more generative adversarial networks 304 can include a generating component and a challenging component. The generating component can generate amino acid sequences of proteins and the challenging component can classify the amino acid sequences produced by the generating component as being an amino acid sequence that is included in a set of training or an amino acid sequence that is not included in the set of training data. The set of training data can include amino acid sequences of proteins that have been synthesized and characterized according to one or more analytical tests and/or one or more assays. The output of the challenging component can be based on comparisons between the amino acid sequences produced by the generating component and amino acid sequences included in the set of training data. In illustrative examples, the output of the challenging component can correspond to a probability that an amino acid sequence produced by the generating component is included in the set of training data. As the generating component produces amino acid sequences and as the challenging component produces feedback regarding the amino acid sequences produced by the generating component, the parameters and/or weightings of one or more models implemented by the challenging component and the parameters and/or weightings of one or more models implemented by the generating component can be refined until the one or more models related to the generating component and the one or more models related to the challenging component have been trained and satisfy one or more training criteria. In implementations, the generating component can generate one or more false amino acid sequences of proteins that are not included in the set of training data to try and “trick” the challenging component into classifying the one or more false amino acid sequences of proteins as being included in the set of training data.
The one or more generative adversarial networks 302 can use amino acid sequences of one or more template proteins, such as a template protein 306, and generate one or more amino acid sequences of target proteins, such as a target protein 308. In the illustrative example of
The position modification data 326, 328, 330, 332, 334, 336 can correspond to constraints on the modification of the individual amino acids 314, 316, 318, 320, 322324 included in the first sequence of amino acids 310 of the template protein 306. In illustrative examples, the position modification data 326, 328, 330, 332, 334, 336 can indicate penalties that are to be applied by one or more generating components and/or one or more challenging components of the one or more generative adversarial networks 304 in response to modification of the respective individual amino acids 314, 316, 318, 320, 322324 in the first sequence of amino acids 310. For example, penalties included in the position modification data 326, 328, 330, 332, 334, 336 can be applied to at least one loss function of the one or more generative adversarial networks 304. In additional examples, the position modification data 326, 328, 330, 332, 334, 336 can include probabilities that individual amino acids 314, 316, 318, 320, 322324 in the first sequence of amino acids 310 can be modified. The position modification data 326, 328, 330, 332, 334, 336 can include numerical values related to probabilities and/or penalties corresponding to the modification of individual amino acids 314, 316, 318, 320, 322324 included in the first sequence of amino acids 310. To illustrate, the position modification data 326, 328, 330, 332, 334, 336 can include numerical values from 0 to 1, numerical values from −1 to 1, and/or values from 0 to 100. In additional implementations, the position modification data 326, 328, 330, 332, 334, 336 can include one or more functions, such as one or more linear functions or one or more non-linear functions, that include one or more variables are related to the probabilities and/or penalties corresponding to modification of the individual amino acids 314, 316, 318, 320, 322324 included in first sequence of amino acids 310. In further examples, at least a portion of the position modification data 326, 328, 330, 332, 334, 336 can indicate that the amino acids located at one or more positions 314, 316, 318, 320, 322, 324 are not to be modified by the one or more generative adversarial networks 304. Also, although the illustrative example of
In various examples, the data corresponding to the first sequence of amino acids 310 of the template protein 306 can be provided to the computing system 302. The first sequence of amino acids 310 and the corresponding position modification data can be used by the one or more generative adversarial networks 304 to generate the second sequence of amino acids 312 that corresponds to the target protein 308. The target protein 308 can be related to, but different from the template protein 306. For example, the one or more generative adversarial networks 304 can modify amino acids at one or more positions of the first sequence of amino acids 310 to produce the second sequence of amino acids 312. To illustrate, the second amino acid sequence 312 includes amino acids 346 and 348 that correspond to amino acids 314, 316 of the first sequence of amino acids 310. That is, amino acid 314 and amino acid 338 are both Threonine and the amino acid 316 and the amino acid 340 are both Histidine. In the illustrative example of
The target protein 310 can retain one or more characteristics of the template protein 308. The one or more characteristics of the template protein 308 can be maintained in the target protein 310 by maintaining the individual amino acids at various positions of the first amino acid sequence 310 of the template protein 306 in the second amino acid sequence 312 of the target protein 308. The one or more characteristics of the template protein 306 that are also present in the target protein 308 can be preserved by determining one or more positions of the first sequence of amino acids 310 that correspond to the one or more characteristics and minimizing the probability that the one or more generative adversarial networks 304 change the amino acids located at the one or more positions. Additionally, the characteristics of the amino acids in the target protein 308 that are used to replace the initial amino acids in the template protein 306 can be limited. For example, the position modification data for the first sequence of amino acids 310 can indicate that a hydrophobic amino acid is to be replaced by another hydrophobic amino acid. In this way, the target protein 308 can have one or more characteristics of the template protein 306 that are similar or the same. For example, the target protein 308 can have values of one or more biophysical properties that are within a threshold amount of the values of the one or more biophysical properties of the template protein 306. Additionally, the target protein 308 can have functionality that is similar to or the same as functionality of the template protein 306. To illustrate, the target protein 308 and the template protein 306 can both bind to a specified molecule or to a specified type of molecule. In illustrative examples, the template protein 306 can include an antibody that binds to an antigen and the first sequence of amino acids 310 can be modified to the second sequence of amino acids 312 such that the target protein 308 can also bind to the antigen.
In various examples, the position modification data can indicate a penalty and/or a probability associated with changing an amino acid at one position of the template protein 306 to one or more different amino acids in the target protein 308. To illustrate, the position modification data can indicate a first penalty and/or a first probability of changing a threonine of amino acid 314 at position 114 to a serine and a second penalty and/or a second probability of changing the threonine of amino acid 314 at position 114 to a cysteine. The position modification data can, in various implementations, indicate a respective probability and/or a respective penalty for modifying an amino acid at a position of the template protein with respect to each of at least 5 other amino acids, at least 10 other amino acids, at least 15 other amino acids, or at least 20 other amino acids.
The one or more generative adversarial networks 304 can modify template proteins produced by one organism to generate target proteins that correspond to a different organism. For example, the template protein 306 can be produced by mice and the first sequence of amino acids 310 can be modified such that the second sequence of amino acids 312 corresponds to a human protein. In an additional example, the template protein 306 can be produced by humans and the first sequence of amino acids 310 can be modified such that the second sequence of amino acids 312 corresponds to an equine protein. Additionally, the one or more generative adversarial networks 304 can modify template proteins that are produced by one or more genes of a germline to generate proteins that correspond to different germline genes. In illustrative examples, modifications of one or more amino acids of a germline gene of an antibody within a species can have an effect on one or more characteristics of the antibody (e.g., expression level, yield, variable region stability) while maintaining an amount of binding capacity to a specified antigen. Further, in situations where the one or more generative adversarial networks 304 modify amino acid sequences of antibodies, the one or more generative adversarial networks 304 can modify template proteins that correspond to a first antibody isotype, such as an IgE isotype antibody, to generate target antibodies that correspond to a second antibody isotype, such as an IgG isotype antibody.
The template antibody 406 can include a first light chain 416. The first light chain 416 can include a variable region having a number of framework regions and a number of hypervariable regions. In various instances, the hypervariable regions can be referred to herein as complementarity determining regions (CDRs). In the illustrative example of
The template antibody 406 can also include a first heavy chain 432. The first heavy chain 432 can include a variable region having a number of framework regions and a number of hypervariable regions. The first heavy chain 432 can include a first framework region 434, a second framework region 436, a third framework region 438, and a fourth framework region 440. Further, the first heavy chain 432 can include a first CDR 442, a second CDR 444, and a third CDR 446. Although not shown in the illustrative example of
The antigen binding region of the first light chain 416 and the antigen binding region of the first heavy chain 432 can have a shape that corresponds to a shape and a chemical profile of the antigen 414. In various examples, at least a portion of the CDRs 426, 428, 430 of the first light chain 416 and at least a portion of the CDRs 442, 444, 446 of the first heavy chain 432 can include amino acids that interact with amino acids of an epitope region of the antigen 414. In this way, amino acids of at least a portion of the CDRs 426, 428, 430, 442, 444, 446 can interact with amino acids of the antigen 414 through at least one of electrostatic interactions, hydrogen bonds, van der Waals forces, or hydrophobic interactions.
Although not shown in the illustrative example of
The one or more generative adversarial networks 404 can generate the target antibody 410 using the amino acid sequences of the regions of the template antibody 406. The target antibody 410 can have one or more portions with amino acid sequences that are different from portions of the amino acid sequence of the template antibody 406. The portions of the amino acid sequence of the template antibody 406 that are changed in relation to the amino acid sequence of the target antibody 410 can be modified such that the target antibody 410 corresponds more closely to an antibody produced by a different species than an antibody produced by a species related to the template antibody 406. In one or more illustrative examples, the one or more generative adversarial networks 404 can modify amino acids included in the variable region of the first light chain 416 and/or amino acids included in the variable region of the first heavy chain 432 to produce the target antibody 410. In various illustrative examples, the one or more generative adversarial networks 404 can modify amino acids included in at least one of one or more of the CDRs 426, 438, 430 of the first light chain 416 or one or more of the CDRs 442, 444, 446 of the first heavy chain 432 to produce the target antibody 410.
The target antibody 410 can include a second light chain 448. The second light chain 448 can correspond to the first light chain 416. In various examples, at least one amino acid of the second light chain 448 can be different from at least one amino acid of the first light chain 416. The second light chain 448 can include a variable region having a number of framework regions and a number of hypervariable regions. The second light chain 448 can include a first framework region 450, a second framework region 452, a third framework region 454, and a fourth framework region 456. Additionally, the second light chain 448 can include a first CDR 458, a second CDR 460, and a third CDR 462. Although not shown in the illustrative example of
The target antibody 410 can also include a second heavy chain 464. The second heavy chain 464 can correspond to the first heavy chain 432. In one or more implementations, at least one amino acid of the second heavy chain 464 can be different from at least one amino acid of the first heavy chain 432. The second heavy chain 464 can include a variable region having a number of framework regions and a number of hypervariable regions. The second heavy chain 464 can include a first framework region 466, a second framework region 468, a third framework region 470, and a fourth framework region 472. Further, the second heavy chain 464 can include a first CDR 474, a second CDR 476, and a third CDR 478. Although not shown in the illustrative example of
Although the second light chain 448 can have a different amino acid sequence than the first light chain 416 and/or the second heavy chain 464 can have a different amino acid sequence than the first heavy chain 432, the antigen binding region of the second light chain 448 and the antigen binding region of the second heavy chain 464 can have a shape that corresponds to a shape and a chemical profile of the antigen 414. In various examples, at least a portion of the CDRs 458, 460, 462 of the second light chain 448 and at least a portion of the CDRs 474, 476, 478 of the second heavy chain 464 can include amino acids that interact with amino acids of an epitope region of the antigen 414. In this way, amino acids of at least a portion of the CDRs 458, 460, 462, 474, 476, 478 can interact with amino acids of the antigen 414 through at least one of electrostatic interactions, hydrogen bonds, van der Waals forces, or hydrophobic interactions.
Although not shown in the illustrative example of
In the illustrative example of
In various implementations, for each antibody isotype, such as IgA, IgD, IgE, IgG, IgM, the light chain constant regions can be comprised of a same or similar sequence of amino acids and the respective heavy chain constant regions can be comprised of a same or similar sequence of amino acids.
The machine learning architecture 502 can include a generating component 504 and a challenging component 506. The generating component 506 can implement one or more models to generate amino acid sequences based on input provided to the generating component 506. In various implementations, the one or more models implemented by the generating component 506 can include one or more functions. The challenging component 506 can generate output indicating whether the amino acid sequences produced by the generating component 504 satisfy various characteristics. The output produced by the challenging component 506 can be provided to the generating component 504 and the one or more models implemented by the generating component 504 can be modified based on the feedback provided by the challenging component 506. The challenging component 506 can compare the amino acid sequences produced by the generating component 504 with amino acid sequences of a library of target proteins and generate an output indicating an amount of correspondence between the amino acid sequences produced by the generating component 504 and the amino acid sequences of target proteins provided to the challenging component 506.
In various implementations, the machine learning architecture 502 can implement one or more neural network technologies. For example, the machine learning architecture 502 can implement one or more recurrent neural networks. Additionally, the machine learning architecture 502 can implement one or more convolution neural networks. In certain implementations, the machine learning architecture 502 can implement a combination of recurrent neural networks and convolutional neural networks. In examples, the machine learning architecture 502 can include a generative adversarial network (GAN). In these situations, the generating component 504 can include a generator and the challenging component 506 can include a discriminator. The challenging component 506 can generate output indicating whether the amino acid sequences produced by the generating component 504 satisfy various characteristics. In various implementations, the challenging component 506 can be a discriminator. In additional situations, such as when the machine learning architecture 502 includes a Wasserstein GAN, the challenging component 506 can include a critic. In additional implementations, the machine learning architecture 502 can include a conditional generative adversarial network (cGAN).
In the illustrative example of
The generated sequence(s) 510 can be analyzed by the challenging component 506 against sequences of proteins included in protein sequence data 512. The protein sequence data 512 can be training data for the machine learning architecture 502. The protein sequence data 512 can be encoded according to a schema. The protein sequence data 512 can include sequences of proteins obtained from one or more data sources that store amino acid sequences of proteins. The one or more data sources can include one or more websites that are searched and information corresponding to amino acid sequences of target proteins is extracted from the one or more websites. Additionally, the one or more data sources can include electronic versions of research documents from which amino acid sequences of target proteins can be extracted. The protein sequence data 512 can be stored in one or more data stores that are accessible to the machine learning architecture 502. The one or more data stores can be connected to the machine learning architecture 502 via a wireless network, a wired network, or combinations thereof. The protein sequence data 512 can be obtained by the machine learning architecture 502 based on requests sent to the data stores to retrieve one or more portions of the protein sequence data 512.
In one or more examples, the protein sequence data 512 can include amino acid sequences of fragments of proteins. For example, the protein sequence data 512 can include sequences of at least one of light chains of antibodies or heavy chains of antibodies. In addition, the protein sequence data 512 can include sequences of at least one of variable regions of antibody light chains, variable regions of antibody heavy chains, constant regions of antibody light chains, constant regions of antibody heavy chains, hinge regions of antibodies, or antigen binding sites of antibodies. In one or more illustrative examples, the protein sequence data 512 can include sequences of complementarity determining regions (CDRs) of antibodies, such as at least one of CDR1, CDR2, or CDR3. In one or more additional illustrative examples, the protein sequence data 512 can include sequences of fragments of T-cell receptors. To illustrate, the protein sequence data 512 can include sequences of antigen binding sites of T-cell receptors, such as one or more CDRs of T-cell receptors.
The amino acid sequences included in the protein sequence data 512 can be subject to data preprocessing 514 before being provided to the challenging component 506. For example, the protein sequence data 512 can be arranged according to a classification system before being provided to the challenging component 506. The data preprocessing 514 can include pairing amino acids included in the target proteins of the protein sequence data 512 with numerical values that can represent structure-based positions within the proteins. The numerical values can include a sequence of numbers having a starting point and an ending point. In an illustrative example, a T can be paired with the number 43 indicating that a Threonine molecule is located at a structure-based position 43 of a specified protein domain type. In illustrative examples, structure-based numbering can be applied to any general protein type, such as fibronectin type III (FNIII) proteins, avimers, antibodies, VHH domains, kinases, zinc fingers, T-cell receptors, and the like.
In various implementations, the classification system implemented by the data preprocessing 516 can include a numbering system that encodes structural position for amino acids located at respective positions of proteins. In this way, proteins having different numbers of amino acids can be aligned according to structural features. For example, the classification system can designate that portions of proteins having particular functions and/or characteristics can have a specified number of positions. In various situations, not all of the positions included in the classification system may be associated with an amino acid because the number of amino acids in a particular region of a protein may vary between proteins. In additional examples, the structure of a protein can be reflected in the classification system. To illustrate, positions of the classification system that are not associated with a respective amino acid can indicate various structural features of a protein, such as a turn or a loop. In an illustrative example, a classification system for antibodies can indicate that heavy chain regions, light chain regions, and hinge regions have a specified number of positions assigned to them and the amino acids of the antibodies can be assigned to the positions according to the classification system. In one or more implementations, the data preprocessing 514 can use Antibody Structural Numbering (ASN) to classify individual amino acids located at respective positions of an antibody.
The output produced by the data preprocessing 514 can include encoded sequences 516. The encoded sequences 516 can include a matrix indicating amino acids associated with various positions of a protein. In examples, the encoded sequences 516 can include a matrix having columns corresponding to different amino acids and rows that correspond to structure-based positions of proteins. For each element in the matrix, a 0 can be used to indicate the absence of an amino acid at the corresponding position and a 1 can be used to indicate the presence of an amino acid at the corresponding position. The matrix can also include an additional column that represents a gap in an amino acid sequence where there is no amino acid at a particular position of the amino acid sequence. Thus, in situations where a position represents a gap in an amino acid sequence, a 1 can be placed in the gap column with respect to the row associated with the position where an amino acid is absent. The generated sequence(s) 510 can also be represented using a vector according to a same or similar number scheme as used for the encoded sequences 516. In some illustrative examples, the encoded sequences 516 and the generated sequence(s) 510 can be encoded using a method that may be referred to as a one-hot encoding method.
In one or more examples, based on similarities and differences between the generated sequence(s) 510 and additional sequences provided to the challenging component 506, such as amino acid sequences included in the protein sequence data 512, the challenging component 506 can generate the classification output 518 to indicate an amount of similarity or an amount of difference between the generated sequence(s) 510 and sequences provided to the challenging component 506 that are included in the protein sequence data 512. In one or more examples, the challenging component 506 can label the generated sequence(s) 510 as zero and the encoded sequences obtained from the protein sequence data 512 as 1. In these situations, the classification output 518 can include a first number from 0 to 1 with respect to one or more amino acid sequences included in the protein sequence data 512.
In one or more additional examples, the challenging component 506 can implement a distance function that produces an output that indicates an amount of distance between the generated sequence(s) 510 and the protein sequences included in the protein sequence data 512. In implementations where the challenging component 506 implements a distance function, the classification output 518 can include a number from −∞ to ∞ indicating a distance between the generated sequence(s) 510 and one or more sequences included in the protein sequence data 512.
The data used to train the machine learning architecture 502 can impact the amino acid sequences produced by the generating component 504. For example, in situations where CDRs of antibodies are included in the protein sequence data 512 provided to the challenging component 506, the amino acid sequences generated by the generating component 504 can correspond to amino acid sequences of antibody CDRs. In another example, in scenarios where the amino acid sequences included in the target protein sequence data 512 provided to the challenging component 506 correspond to CDRs of T-cell receptors, the amino acid sequences produced by the generating component 504 can correspond to sequences of CDRs of T-cell receptors.
After the machine learning architecture 502 has undergone a training process, a trained model 518 can be generated that can produce sequences of proteins. The trained model 518 can include the generating component 504 after a training process has been performed using the protein sequence data 512. In one or more illustrative examples, the trained model 518 include a number of weights and/or a number of parameters of a convolution neural network. The training process for the machine learning architecture 502 can be complete after the function(s) implemented by the generating component 504 and the function(s) implemented by the challenging component 506 converge. The convergence of a function can be based on the movement of values of model parameters toward particular values as protein sequences are generated by the generating component 504 and feedback is obtained from the challenging component 506. In various implementations, the training of the machine learning architecture 502 can be complete when the protein sequences produced by the generating component 504 have particular characteristics. For example, the amino acid sequences generated by the generating component 504 can be analyzed by a software tool that determines at least one of biophysical properties of the amino acid sequences, structural features of the amino acid sequences, or adherence to amino acid sequences corresponding to one or more protein germlines The machine learning architecture 502 can produce the trained model 518 in situations where the amino acid sequences produced by the generating component 504 are determined by the software tool to have one or more specified characteristics. In one or more implementations, the trained model 518 can be included in a target protein system 520 that generates sequences of target proteins.
Protein sequence input 522 can be provided to the trained model 518, and the trained model 518 can produce protein fragment sequences 524. The protein sequence input 522 can include an input vector that can include a random or pseudo-random series of numbers. In one or more illustrative examples, the protein fragment sequences 524 produced by the trained model 518 can be represented as a matrix structure that is the same as or similar to the matrix structure used to represent the encoded sequences 516 and/or the generated sequence(s) 510. In various implementations, the matrices produced by the trained model 518 that comprise the protein fragment sequences 524 can be decoded to produce a string of amino acids that correspond to the sequence of a protein fragment. The protein fragment sequences 524 can include sequences of at least portions of fibronectin type III (FNIII) proteins, avimers, VHH domains, antibodies, kinases, zinc fingers, T-cell receptors, and the like. In one or more illustrative examples, the protein fragment sequences 524 can include sequences of fragments of antibodies. For example, the protein fragment sequences 524 can correspond to portions one or more antibody subtypes, such as immunoglobin A (IgA), immunoglobin D (IgD), immunoglobin E (IgE), immunoglobin G (IgG), or immunoglobin M (IgM). In one or more examples, the protein fragment sequences 524 can include sequences of at least one of one or more antibody light chain variable regions, one or more antibody heavy chain variable regions, one or more antibody light chain constant regions, one or more antibody heavy chain constant regions, or one or more antibody hinge regions. Further, the protein fragment sequences 524 can correspond to additional proteins that bind antigens. In still other examples, the protein fragment sequences 524 can correspond to amino acid sequences that participate in protein-to-protein interactions, such as proteins that have regions that bind to antigens or regions that bind to other molecules.
The target protein system 520 can combine one or more protein fragment sequences 524 with one or more template protein sequences 526 to produce one or more target protein sequences 528. The template protein sequences 526 can include amino acid sequences of portions of proteins that can be combined with the protein fragment sequences 524. For example, a protein fragment sequence 524 can include an amino acid sequence of a variable region of an antibody light chain and a template protein sequence 526 can include an amino acid sequence of a remainder of an antibody. To illustrate, the template protein sequence 526 can include an amino acid sequence that includes a constant region of an antibody light chain. In these scenarios, the target protein sequences 528 can include an amino acid sequence of an antibody light chain. In one or more additional examples, one or more protein fragment sequences 524 can include an amino acid sequence of a variable region of an antibody light chain and an amino acid sequence of a variable region of an antibody heavy chain and one or more template sequences 526 can include amino acid sequences of a constant region of an antibody light chain, a first constant region of an antibody heavy chain, a hinge region of an antibody heavy chain, a second constant region of an antibody heavy chain, and a third constant region of an antibody heavy chain. In these instances, the target protein sequences 528 can include an amino acid sequence of an antibody light chain coupled with an antibody heavy chain.
The target protein system 520 can determine one or more locations of one or more missing amino acids in a template protein sequence 526 and determine one or more amino acids included in one or more protein fragment sequences 524 that can be used to supply the one or more missing amino acid sequences. In various examples, the template protein sequences 526 can indicate locations of missing amino acids within individual template protein sequences 526. In one or more illustrative examples, the trained model 518 can produce protein fragment sequences 524 can correspond to amino acid sequences of antigen binding regions of one or more antibodies. In these scenarios, the target protein system 520 can determine that the template protein sequences 526 are missing at least a portion of the antigen binding regions of one or more antibodies. The target protein system 520 can then extract an amino acid sequence included in the protein fragment sequences 524 that correspond to a missing amino acid sequence of an antigen binding region of a template protein sequence 526. The target protein system 520 can combine the amino acid sequence obtained from the protein fragment sequence 524 with a template protein sequence 526 to generate a target protein sequence 528 that includes the template protein sequence 526 with the antigen binding region supplied by one or more of the protein fragment sequences 524.
Although not shown in the illustrative example of
In one or more implementations, the target protein sequences 528 can be subject to sequence filtering. The sequence filtering can parse the target protein sequences 528 to identify one or more of the target protein sequences 528 that correspond to one or more characteristics. For example, the target protein sequences 528 can be analyzed to identify amino acid sequences that have specified amino acids at particular positions. One or more of the target protein sequences 528 can also be filtered to identify amino acid sequences having one or more particular strings or regions of amino acids. In various implementations, the target protein sequences 528 can be filtered to identify amino acid sequences that are associated with a set of biophysical properties based at least partly on similarities between at least one of the target protein sequences 528 and amino acid sequences of additional proteins having the set of biophysical properties.
The machine learning architecture 502 can be implemented by one or more computing devices 530. The one or more computing devices 530 can include one or more server computing devices, one or more desktop computing devices, one or more laptop computing devices, one or more tablet computing devices, one or more mobile computing devices, or combinations thereof. In certain implementations, at least a portion of the one or more computing devices 530 can be implemented in a distributed computing environment. For example, at least a portion of the one or more computing devices 530 can be implemented in a cloud computing architecture. Additionally, although the illustrative example of
At operation 604, the method 600 can include obtaining second data indicating additional amino acid sequences corresponding to additional proteins having one or more specified characteristics. The one or more specified characteristics can correspond to one or more biophysical properties. The one or more specified characteristics can also correspond to amino acid sequences that can be included in certain types of proteins. For example, the one or more specified characteristics can correspond to amino acid sequences included in human antibodies. To illustrate, the one or more specified characteristics can correspond to amino acid sequences included in framework regions of variable regions of human antibodies. Additionally, the one or more specified characteristics can correspond to amino acid sequences produced by one or more germline genes of human antibodies. The additional proteins can have similarities in relation to the template protein, but the functional region of the template protein may be absent from the additional proteins. For example, the additional proteins can correspond to antibodies, but the antibodies may not bind to the antigen that binds to the functional region of the template protein. In illustrative implementations, the template protein can be produced by a first mammal and the additional proteins can correspond to antibodies produced by a second mammal, such as a human. In these situations, the amino acid sequences included in the second data can include amino acid sequences of human antibodies. In various implementations, the second data can be used as training data for a generative adversarial network.
In addition, at operation 606, the method 600 can include determining position modification data indicating probabilities that amino acids located at positions of the template protein are modifiable. In one or more illustrative examples, the position modification data can indicate that first probabilities to modify amino acids located in a binding region are no greater than about 5% and that second probabilities to modify amino acids located in one or more portions of additional, non-binding regions of a protein are at least 40%. The position modification data can also include penalties for changing amino acids of the amino acid sequence of the template protein. In various examples, the position modification data can be based on a type of amino acid at a position of the amino acid sequence of the template protein. Additionally, the position modification data can be based on a type of amino acid that is replacing an amino acid located at a position of the template protein. For example, the position modification data can indicate a first penalty for modifying amino acids of the template protein having one or more hydrophobic regions and a second penalty that is different than the first penalty for modifying an amino acid of the template protein that is positively charged. Further, the position modification data can indicate a first penalty for modifying an amino acid of the template protein having one or more hydrophobic regions to another amino acid having one or more hydrophobic regions and a second penalty that is different from the first penalty for modifying the amino acid of the template protein having one or more hydrophobic regions to a positively charged amino acid.
Further, at operation 608, the method 600 can include generating amino acid sequences that are variants of the amino acid sequence of the template protein and that have at least a portion of the one or more specified characteristics. The amino acid sequences of the target proteins can be generated using one or more machine learning techniques. In various examples, the amino acid sequences of the variant proteins can be produced using a conditional generative adversarial network.
The amino acid sequences of the variant proteins can have a region that corresponds to the functional region of the template protein, but that have supporting scaffolds or underlying structures, such as one or more framework regions, that are different from that of the template protein. For example, the template protein can be an antibody that binds to an antigen, while the variant proteins can include antibodies having one or more features that are different from features of the template protein that also bind to the antigen, but would not otherwise have a binding region for the antigen without first being modified. In an illustrative example, the template protein can include a human antibody that includes a binding region that binds to an antigen and the additional amino acid sequences can include human antibodies that have one or more biophysical properties that are different from the biophysical properties of the template protein and that do not bind to the antigen. After being trained using the additional amino acid sequences, the amino acid sequence of the template protein, and the position modification data, a generative adversarial network can produce amino acid sequences of variant antibodies that include the binding region of the template protein and that include at least a portion of the biophysical properties of the additional proteins.
In additional illustrative examples, the template protein can correspond to an antibody produced by a mouse that includes a binding region that binds to an antigen. Further, the additional amino acid sequences can correspond to human antibodies that do not bind to the antigen. After being trained using the additional amino acid sequences, the amino acid sequence of the template protein, and the position modification data, a generative adversarial network can produce amino acid sequences of variant antibodies that correspond to human antibodies instead of mouse antibodies and that include the binding region of the template antibody to bind to the antigen. In various examples, the generative adversarial network can modify framework regions of the variable regions of the template mouse antibody to correspond to framework regions of human antibodies. Additionally, the generative adversarial network can produce the variant amino acid sequences of the human antibody such that the amino acid sequence of the binding region of the mouse antibody is present in the variant amino acid sequences and such that the binding region is stable and forms a shape that binds to the antigen.
At operation 704, the method 700 includes obtaining second data indicating a plurality of amino acid sequences corresponding to human antibodies. In addition, at operation 706, the method 700 includes determining position modification data indicating probabilities that amino acids located at positions of the template antibody are modifiable. The position modification data can indicate that some positions of the template antibody have relatively high probabilities of being modified and that other positions of the template antibody can have relatively low probabilities of being modified. Positions of the template antibody having relatively high probabilities of being modified can include amino acids at positions that, if modified, are less likely to affect a functional region of the template antibody. Further, the positions of the template antibody having relatively low probabilities of being modified can include amino acids at positions that, if modified, are more likely to affect a functional region of the template antibody. In one or more illustrative examples, the position modification data can indicate that first probabilities to modify amino acids located in an antigen binding region are no greater than about 5% and that second probabilities to modify amino acids located in one or more portions of at least one of the one or more heavy chain framework regions or the one or more light chain framework regions of an antibody are at least 40%. In various examples, the position modification data can indicate penalties that are to be applied by a generative adversarial network to modification of amino acids at positions of the template protein when the generative adversarial network is generating amino acid sequences of target antibodies.
At 708, the method 700 includes generating, using a generative adversarial network, a model to produce amino acid sequences that correspond to human antibodies and that have at least a threshold amount of identity with respect to a binding region of the template antibody. Further, at 710, the method 700 includes generating, using the model, target amino acid sequences based on the position modification data and the template antibody amino acid sequence. In illustrative examples, the amino acid sequences produced by the generative adversarial network can have a scaffolding or underlying structure of human antibodies while having a region that corresponds to the functional region of the template antibody. For example, the amino acid sequences can have constant regions having at least a threshold amount of identity with human antibodies and additional regions, such as CDRs, having a second threshold amount of identity with the functional region of the template antibody.
The instructions 824 transform the general, non-programmed machine 800 into a particular machine 800 programmed to carry out the described and illustrated functions in the manner described. In additional implementations, the machine 800 operates as a standalone device or may be coupled (e.g., networked) to other machines. In a networked deployment, the machine 800 can operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine 800 can comprise, but not be limited to, a server computer, a client computer, a personal computer (PC), a tablet computer, a laptop computer, a netbook, a personal digital assistant (PDA), a mobile computing device, a wearable device (e.g., a smart watch), a web appliance, a network router, a network switch, a network bridge, or any machine capable of executing the instructions 824, sequentially or otherwise, that specify actions to be taken by the machine 800. Further, while only a single machine 800 is illustrated, the term “machine” shall also be taken to include a collection of machines 800 that individually or jointly execute the instructions 824 to perform any one or more of the methodologies discussed herein.
Examples of computing device 800 can include logic, one or more components, circuits (e.g., modules), or mechanisms. Circuits are tangible entities configured to perform certain operations. In an example, circuits can be arranged (e.g., internally or with respect to external entities such as other circuits) in a specified manner. In an example, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware processors (processors) can be configured by software (e.g., instructions, an application portion, or an application) as a circuit that operates to perform operations as described herein. Software can reside (1) on a non-transitory machine readable medium or (2) in a transmission signal. In an example, the software, when executed by the underlying hardware of the circuit, causes the circuit to perform the operations.
A circuit can be implemented mechanically or electronically. For example, a circuit can comprise dedicated circuitry or logic that is specifically configured to perform one or more techniques such as discussed above, such as including a special-purpose processor, a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). In an example, a circuit can comprise programmable logic (e.g., circuitry, as encompassed within a general-purpose processor or other programmable processor) that can be temporarily configured (e.g., by software) to perform the certain operations. It will be appreciated that the decision to implement a circuit mechanically (e.g., in dedicated and permanently configured circuitry), or in temporarily configured circuitry (e.g., configured by software) can be driven by cost and time considerations.
Accordingly, the term “circuit” is understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily (e.g., transitorily) configured (e.g., programmed) to operate in a specified manner or to perform specified operations. In an example, given a plurality of temporarily configured circuits, each of the circuits need not be configured or instantiated at any one instance in time. For example, where the circuits comprise a general-purpose processor configured via software, the general-purpose processor can be configured as respective different circuits at different times. Software can accordingly configure a processor, for example, to constitute a particular circuit at one instance of time and to constitute a different circuit at a different instance of time.
In an example, circuits can provide information to, and receive information from, other circuits. In this example, the circuits can be regarded as being communicatively coupled to one or more other circuits. Where multiple of such circuits exist contemporaneously, communications can be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the circuits. In embodiments in which multiple circuits are configured or instantiated at different times, communications between such circuits can be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple circuits have access. For example, one circuit can perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further circuit can then, at a later time, access the memory device to retrieve and process the stored output. In various examples, circuits can be configured to initiate or receive communications with input or output devices and can operate on a resource (e.g., a collection of information).
The various operations of method examples described herein can be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors can constitute processor-implemented circuits that operate to perform one or more operations or functions. In an example, the circuits referred to herein can comprise processor-implemented circuits.
Similarly, the methods described herein can be at least partially processor-implemented. For example, at least some of the operations of a method can be performed by one or processors or processor-implemented circuits. The performance of certain of the operations can be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In an example, the processor or processors can be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other examples the processors can be distributed across a number of locations.
The one or more processors can also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” For example, at least some of the operations can be performed by a group of computers (as examples of machines including processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., Application Program Interfaces (APIs).)
Example embodiments (e.g., apparatus, systems, or methods) can be implemented in digital electronic circuitry, in computer hardware, in firmware, in software, or in any combination thereof. Example embodiments can be implemented using a computer program product (e.g., a computer program, tangibly embodied in an information carrier or in a machine readable medium, for execution by, or to control the operation of, data processing apparatus such as a programmable processor, a computer, or multiple computers).
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a software module, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
In an example, operations can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Examples of method operations can also be performed by, and example apparatus can be implemented as, special purpose logic circuitry (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)).
The computing system can include clients and servers. A client and server are generally remote from each other and generally interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In embodiments deploying a programmable computing system, it will be appreciated that both hardware and software architectures require consideration. Specifically, it will be appreciated that the choice of whether to implement certain functionality in permanently configured hardware (e.g., an ASIC), in temporarily configured hardware (e.g., a combination of software and a programmable processor), or a combination of permanently and temporarily configured hardware can be a design choice. Below are set out hardware (e.g., computing device 700) and software architectures that can be deployed in example embodiments.
Example computing device 800 can include a processor 802 (e.g., a central processing unit CPU), a graphics processing unit (GPU) or both), a main memory 804 and a static memory 806, some or all of which can communicate with each other via a bus 808. The computing device 800 can further include a display unit 810, an alphanumeric input device 812 (e.g., a keyboard), and a user interface (UI) navigation device 814 (e.g., a mouse). In an example, the display unit 810, input device 812, and UI navigation device 814 can be a touch screen display. The computing device 800 can additionally include a storage device (e.g., drive unit) 816, a signal generation device 818 (e.g., a speaker), a network interface device 820, and one or more sensors 821, such as a global positioning system (GPS) sensor, compass, accelerometer, or other sensor.
The storage device 816 can include a machine readable medium 822 (also referred to herein as a computer-readable medium) on which is stored one or more sets of data structures or instructions 824 (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 824 can also reside, completely or at least partially, within the main memory 804, within static memory 806, or within the processor 802 during execution thereof by the computing device 800. In an example, one or any combination of the processor 802, the main memory 804, the static memory 806, or the storage device 816 can constitute machine readable media.
While the machine readable medium 822 is illustrated as a single medium, the term “machine readable medium” can include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that configured to store the one or more instructions 824. The term “machine readable medium” can also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure or that is capable of storing, encoding or carrying data structures utilized by or associated with such instructions. The term “machine readable medium” can accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Specific examples of machine-readable media can include non-volatile memory, including, by way of example, semiconductor memory devices (e.g., Electrically Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM)) and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
The instructions 824 can further be transmitted or received over a communications network 826 using a transmission medium via the network interface device 820 utilizing any one of a number of transfer protocols (e.g., frame relay, IP, TCP, UDP, HTTP, etc.). Example communication networks can include a local area network (LAN), a wide area network (WAN), a packet data network (e.g., the Internet), mobile telephone networks (e.g., cellular networks), Plain Old Telephone (POTS) networks, and wireless data networks (e.g., IEEE 802.11 standards family known as Wi-Fi®, IEEE 802.16 standards family known as WiMax®), peer-to-peer (P2P) networks, among others. The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software.
Implementation 1. A method comprising: obtaining, by a computing system including one or more computing devices having one or more processors and memory, first data indicating a first amino acid sequence of a template protein, the template protein including a functional region that binds to an additional molecule or that chemically reacts with the additional molecule; obtaining, by the computing system, second data indicating second amino acid sequences corresponding to additional proteins having one or more specified characteristics; obtaining, by the computing system, position modification data indicating, for individual positions of the first amino acid sequence, a probability that an amino acid located at an individual position of the first amino acid sequence is modifiable; generating, by the computing system and using a generative adversarial network, a plurality of third amino acid sequences corresponding to the additional proteins, the plurality of third amino acid sequences being variants of the first amino acid sequence of the template protein, wherein the plurality of third amino acid sequences are generated based on the first data, the second data, and the position modification data.
Implementation 2. The method of implementation 1, wherein individual third amino acid sequences of the plurality of third amino acid sequences include one or more regions having at least a threshold amount of identity with respect to the functional region.
Implementation 3. The method of implementation 1 or 2, wherein the first amino acid sequence includes one or more first groups of amino acids that are produced with respect to a first germline gene and the plurality of third amino acid sequences include one or more second groups of amino acids that are produced with respect to a second germline gene that is different from the first germline gene.
Implementation 4. The method of implementation 3, wherein the one or more second groups of amino acids are included in at least a portion of the second amino acid sequences.
Implementation 5. The method of any one of implementations 1-4, wherein the one or more specified characteristics include values of one or more biophysical properties.
Implementation 6. The method of any one of implementations 1-5, wherein: the template protein is a first antibody; the additional proteins include second antibodies; and the one or more specified characteristics include one or more sequences of amino acids included in one or more framework regions of the second amino acid sequences.
Implementation 7. The method of any one of implementations 1-6, wherein the template protein is produced by a mammal that is not a human and the additional proteins correspond to proteins produced by a human.
Implementation 8. The method of any one of implementations 1-7, comprising: training, by the computing system, a first model using the generative adversarial network and based on the first data, the second data, and the position modification data; obtaining, by the computing system, third data indicating additional amino acid sequences of proteins having a set of biophysical properties; training, by the computing system and using the first model as a generating component of the generative adversarial network, a second model based on the third data; and generating, by the computing system and using the second model; a plurality of fourth amino acid sequences that correspond to proteins that are variants of the template protein and that have at least a threshold probability of having one or more biophysical properties of the set of biophysical properties.
Implementation 9. A method comprising: obtaining, by a computing system including one or more computing devices having one or more processors and memory, first data indicating a first amino acid sequence of an antibody produced by a mammal that is different from a human, the antibody having a binding region that binds to an antigen; obtaining, by the computing system, second data indicating a plurality of second amino acid sequences with individual second amino acid sequences of the plurality of amino acid sequences corresponding to a human antibody; obtaining, by the computing system, position modification data indicating, for individual positions of the first amino acid sequence, a probability that an amino acid located at an individual position of the first amino acid sequence is modifiable; generating, by the computing system and using a generative adversarial network, a model to produce amino acid sequences having at least a first threshold amount of identity with respect to the binding region and at least a second threshold amount of identity with respect to one or more heavy chain framework regions and one or more light chain framework regions of the plurality of second amino acid sequences; and generating, by the computing system and using the model, a plurality of third amino acid sequences based on the position modification data and the first amino acid sequence.
Implementation 10. The method of implementation 9, wherein the position modification data indicates that first probabilities to modify amino acids located in the binding region are no greater than about 5% and that second probabilities to modify amino acids located in one or more portions of at least one of the one or more heavy chain framework regions or the one or more light chain framework regions of the antibody are at least 40%.
Implementation 11. The method of implementation 9 or 10, wherein the position modification data indicates penalties to apply to modification of amino acids of the antibody with respect to generating the plurality of third amino acid sequences.
Implementation 12. The method of implementation 11, wherein the position modification data indicates that an amino acid located at a first position of the first amino acid sequence of the antibody has a first penalty for being changed to a first type of amino acid and a second penalty for being changed to a second type of amino acid.
Implementation 13. The method of implementation 12, wherein the amino acid has one or more hydrophobic regions, the first type of amino acid corresponds to hydrophobic amino acids, and the second type of amino acid corresponds to positively charged amino acids.
Implementation 14. A system comprising: one or more hardware processors; one or more non-transitory computer-readable storage media storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations comprising: obtaining first data indicating a first amino acid sequence of a template protein, the template protein including a functional region that binds to an additional molecule or that chemically reacts with the additional molecule; obtaining second data indicating second amino acid sequences corresponding to additional proteins having one or more specified characteristics; obtaining position modification data indicating, for individual positions of the first amino acid sequence, a probability that an amino acid located at an individual position of the first amino acid sequence is modifiable; generating, using a generative adversarial network, a plurality of third amino acid sequences corresponding to the additional proteins, the plurality of third amino acid sequences being variants of the first amino acid sequence of the template protein, wherein the plurality of third amino acid sequences are generated based on the first data, the second data, and the position modification data.
Implementation 15. The system of implementation 14, wherein individual third amino acid sequences of the plurality of third amino acid sequences include one or more regions having at least a threshold amount of identity with respect to the functional region.
Implementation 16. The system of implementation 14 or 15, wherein the first amino acid sequence includes one or more first groups of amino acids that are produced with respect to a first germline gene and the plurality of third amino acid sequences include one or more second groups of amino acids that are produced with respect to a second germline gene that is different from the first germline gene.
Implementation 17. The system of implementation 16, wherein the one or more second groups of amino acids are included in at least a portion of the second amino acid sequences.
Implementation 18. The system of any one of implementations 14-17, wherein the one or more specified characteristics include values of one or more biophysical properties.
Implementation 19. The system of any one of implementations 14-18, wherein: the template protein is a first antibody; the additional proteins include second antibodies; and the one or more specified characteristics include one or more sequences of amino acids included in one or more framework regions of the second amino acid sequences.
Implementation 20. The system of any one of implementations 14-19, wherein the template protein is produced by a mammal that is not a human and the additional proteins correspond to proteins produced by a human.
Implementation 21. The system of any one of implementations 14-20, wherein the one or more non-transitory computer-readable storage media storing additional instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform additional operations comprising: training a first model using the generative adversarial network and based on the first data, the second data, and the position modification data; obtaining third data indicating additional amino acid sequences of proteins having a set of biophysical properties; training, using the first model as a generating component of the generative adversarial network, a second model based on the third data; and generating, using the second model, a plurality of fourth amino acid sequences that correspond to proteins that are variants of the template protein and that have at least a threshold probability of having one or more biophysical properties of the set of biophysical properties.
Implementation 22. A system comprising: one or more hardware processors; one or more non-transitory computer-readable storage media storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations comprising: obtaining first data indicating a first amino acid sequence of an antibody produced by a mammal that is different from a human, the antibody having a binding region that binds to an antigen; obtaining second data indicating a plurality of second amino acid sequences with individual second amino acid sequences of the plurality of amino acid sequences corresponding to a human antibody; obtaining position modification data indicating, for individual positions of the first amino acid sequence, a probability that an amino acid located at an individual position of the first amino acid sequence is modifiable; generating, using a generative adversarial network, a model to produce amino acid sequences having at least a first threshold amount of identity with respect to the binding region and at least a second threshold amount of identity with respect to one or more heavy chain framework regions and one or more light chain framework regions of the plurality of second amino acid sequences; and generating, using the model, a plurality of third amino acid sequences based on the position modification data and the first amino acid sequence.
Implementation 23. The system of implementation 22, wherein the position modification data indicates that first probabilities to modify amino acids located in the binding region are no greater than about 5% and that second probabilities to modify amino acids located in one or more portions of at least one of the one or more heavy chain framework regions or the one or more light chain framework regions of the antibody are at least 40%.
Implementation 24. The system of implementation 22 or 23, wherein the position modification data indicates penalties to apply to modification of amino acids of the antibody with respect to generating the plurality of third amino acid sequences.
Implementation 25. The system of implementation 24, wherein the position modification data indicates that an amino acid located at a first position of the first amino acid sequence of the antibody has a first penalty for being changed to a first type of amino acid and a second penalty for being changed to a second type of amino acid.
Implementation 26. The system of implementation 25, wherein the amino acid has one or more hydrophobic regions, the first type of amino acid corresponds to hydrophobic amino acids, and the second type of amino acid corresponds to positively charged amino acids.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/064579 | 12/11/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62947430 | Dec 2019 | US |