The present disclosure generally relates to computational design of engineered polypeptides and experimental selection of improved variants.
Computational design can be used in the design of new therapeutic proteins that mimic native proteins or to design vaccines that display a desired epitope or epitopes from a pathogenic antigen. Computationally designed proteins may also be used to generate or select for binding agents. For example, one can pan libraries of antibodies (e.g. phage display libraries) against a designed protein bait to select for clones that bind to that bait, or one can immunize experimental animals with a designed immunogen to generate novel antibodies.
Although there are others, the leading modeling platform for computational design is Rosetta (Das and Baker, Annu Rev Biochem. 77:363-82 (2008)). This platform can be used for design of proteins that match a desired structure. Correia et al., Structure 18:1116-26 (2010) discloses a general computational method to design epitope-scaffolds in which contiguous structural epitopes are transplanted into scaffold proteins for conformational stabilization and immune presentation. Olek et al., PNAS USA 107:17880-87 (2010) discloses transplantation of an epitope from the HIV-1 gp41 protein into select acceptor scaffolds.
A problem with engineered peptide designs is that every time one needs to build a new engineered peptide for a new epitope, it requires designing a new peptide scaffold. Thus, there is a need for new and improved devices and methods for computational design of proteins that mimic a target protein structure. To solve this problem, a set of general scaffolds is provided.
Generally, in some variations, an apparatus may include a non-transitory processor-readable medium that stores code representing instructions to be executed by a processor. The code may comprise code to cause the processor to provide engineered polypeptides generated based on a first plurality of blueprint records from a predetermined portion of a reference target structure of a reference target. Each blueprint record from the plurality of blueprint records may include target residue positions and scaffold residue positions, each target residue position corresponding to one target residue from the plurality of target residues. Each engineered polypeptide from the engineered polypeptides may include target residue positions and scaffold residue positions, each target residue position corresponding to one target residue from the plurality of target residues. The medium may include code to substitute one or more of the residue at one or more of the target residue positions of each engineered polypeptide based on the residues of a predetermined portion of a second reference target structure of a second reference target to generate sets of first engineered polypeptides and second engineered polypeptides. The medium may include code to add further engineered polypeptides to each set by repeating the substituting step with further reference targets. The medium may include code to filter the sets by structure comparison between the members of the sets of engineered polypeptides.
The apparatus may generate a plurality of filtered sets of engineered polypeptides, the filtered sets comprising engineered polypeptides having scaffold residue positions configured to display portions of target structures of targets sharing structural similarity and/or sequence similarity to the reference target.
In some variations, the predetermined portions of the reference target structures of the reference targets are selected by structure comparison between the predetermined portions.
In some variations, the structure comparison is static structure comparison using de novo folding of each of the engineered polypeptides of each of the sets of engineered polypeptides.
In some variations, the structure comparison is by dynamic structure comparison using molecular dynamics (MD) simulations of each of the engineered polypeptides of each of the sets of engineered polypeptides.
The disclosure further provides a method of expressing in cells a nucleic acid library of polynucleotides encoding variant polypeptides generated by making amino acid substitutions at scaffold residue positions of one or more of the members of the filtered sets of engineered polypeptides.
The method may include performing computational sequence substitution. The method may include performing codon substitution. The method may include performing mutagenesis. The method may include selecting cells from the library.
In some variations, the selecting step includes selective proteolysis of the expressed polypeptides. Selected polynucleotides may be recovered from the selected cells and sequenced.
In some variations, the selecting step includes binding of the expressed polypeptides to the target's ligand(s).
In some variations, the selecting step includes Förster resonance energy transfer (FRET) between donor and acceptor fluorescent moieties on the expressed polypeptides where the FRET signal intensity indicates the degree of folding.
In some variations, results from the selecting step may be configured to be received as input to code that retrains or refines an existing machine learning model or trains new machine learning models for engineered polypeptide generation.
In some variations, an apparatus may include a non-transitory processor-readable medium that stores code representing instructions to be executed by a processor. The code may comprise code to cause the processor to train a machine learning model based on a first set of blueprint records, or representations thereof, and a first set of scores, each blueprint record from the first set of blueprint records associated with each score from the first set of scores. The medium may include code to execute, after the training, the machine learning model to generate a second set of blueprint records having at least one desired score. The second set of blueprint records may be configured to be received as input in computational protein modeling to generate engineered polypeptides based on the second set of blueprint records.
The medium may include code to cause the processor to receive a reference target structure. The medium may include code to cause the processor to generate the first set of blueprint records from a predetermined portion of the reference target structure, each blueprint record from the first set of blueprint records comprising target residue positions and scaffold residue positions, each target residue position from the set of target residue positions corresponding to one target residue from the set of target residues. In some variations, in at least one blueprint record, the target residue positions are nonconsecutive. In some variations, in at least one blueprint record, target residue positions are in an order different from the order of the target residues positions in the reference target sequence.
The medium may include code to cause the processor to label the first set of blueprint records by performing computational protein modeling on each blueprint record to generate a polypeptide structure, calculating a score for the polypeptide structure, and associating the score with the blueprint record. In some variations, the computational protein modeling may be based on a de novo design without template matching to the reference target structure. In some variations, each score comprises an energy term and a structure-constraint matching term that may be determined using one or more structural constraints extracted from the representation of the reference target structure.
The medium may include code to cause the processor to determine whether to retrain the machine learning model by calculating a second set of scores for the second set of blueprint records. The medium may include further code to retrain, in response to the determining, the machine learning model based on (1) retraining blueprint records that include the second set of blueprint records and (2) retraining scores that include the second set of scores.
The medium may include code to cause the processor to concatenate, after the retraining of the machine learning model, the first set of blueprint records and the second set of blueprint records to generate the retraining of blueprint records and to generate the retraining scores, each blueprint record from the retraining of blueprint records associated with a score from the retraining scores. In some variations, at least one desired score may be a preset value. In some variations, the at least one desired score may be dynamically determined.
In some variations, the machine learning model may be a supervised machine learning model. The supervised machine learning model may include an ensemble of decision trees, a boosted decision tree algorithm, an extreme gradient boosting (XGBoost) model, or a random forest. In some variations, the supervised machine learning model may include a support vector machine (SVM), a feed-forward machine learning model, a recurrent neural network (RNN), a convolutional neural network (CNN), graph neural network (GNN), or a transformer neural network.
In some variations, the machine learning model may include an inductive machine learning model. In some variations, the machine learning model may include a generative machine learning model.
The medium may include code to cause the processor to perform computational protein modeling on the second set of blueprint records to generate engineered polypeptides.
The medium may include code to cause the processor to filter the engineered polypeptides by static structure comparison to the representation of the reference target structure.
The medium may include code to cause the processor to filter the engineered polypeptides by dynamic structure comparison to the representation of the reference target structure using molecular dynamics (MD) simulations of the representation of the reference target structure and each of the engineered polypeptides. In some variations, MD simulations are performed in parallel using symmetric multiprocessing (SMP).
Non-limiting examples of various aspects and variations of the invention are described herein and illustrated in the accompanying drawings.
A problem with engineered peptide designs is that every time one needs to build a new engineered peptide for a new epitope, it requires designing a new peptide scaffold. To solve this problem, a set of general scaffolds is provided. A general scaffold is defined here as a peptide scaffold sequence capable of stabilizing a family of different epitope sequences, where all epitope sequences in this family adopt the same structural motif. For example, the tryptophan zipper is a general scaffold for displaying beta turn motifs, because the peptide scaffold retains a beta sheet structure for a variety of turn lengths and sequence identities (Cochran al. Tryptophan zippers: Stable, monomeric β-hairpins. PNAS USA 98:5578-83 (2001)). Given that much of the structural proteome comprises a relatively small number of frequently observed fold families, fold-specific general scaffolds that stabilize many or all of these families could be identified. General scaffolds provide the following benefits: (1) More efficient design of engineered polypeptides for commonly occurring epitope structural motifs, (2) Obtaining a protein structure for an engineered polypeptides that makes use of a general scaffold can be used as evidence that all engineered polypeptides employing the same general scaffold adopt the desired structure, and (3) differences in epitope steering for engineered polypeptides designed using the same general scaffold can be solely attributed to differences in the epitope sequences, because the scaffold sequence is constant.
Provided herein are methods of designing engineered polypeptides, and compositions comprising and methods of using said engineered peptides. For example, provided herein are methods of using engineered peptides in in vitro selection of antibodies. In some aspects, a user (or program) may select a target protein having a known structure and identify a portion of the target protein as input for design of an engineered polypeptide. The target protein may be an antigen (or putative antigen) from a pathogenic organism; a protein involved in cellular functions associated with disease; an enzyme; a signaling molecule; or any protein for which an engineered polypeptide recapitulating a portion of the protein is desired. The engineered polypeptide may be intended for antibody discovery, vaccination, diagnostic, use in a method of treatment, biomanufacturing, or other applications. The “target protein” may, in a variation, be more than one protein, such as a multimeric protein complex. For simplicity, the disclosure refers to a target protein, but the methods apply to multimeric structures as well. In a variation, the target protein is two or more distinct proteins or protein complexes. For example, the methods disclosed herein may be used to design engineered peptides that mimic common attributes of proteins from diverse species—e.g., to target a conserved epitope for antibody selection.
A computational record of the topology of the protein is derived, termed here a “reference target structure.” The reference target structure may be a conventional protein structure or a structural model, represented for example by 3D coordinates for all (or most) atoms in the protein or 3D coordinates for select atoms (e.g., coordinates of the CP atoms of each protein residue). Optionally the reference target structure may include dynamic terms derived either computationally (e.g., from molecular dynamics simulation) or experimentally (e.g., from spectroscopy, crystallography, or electron microscopy).
The predetermined portion of the target protein is converted into a blueprint having target-residue positions and scaffold-residue positions. Each position may be assigned either a fixed amino-acid residue identity or a variable identity (e.g., any amino acid, or an amino acid with desired physiochemical properties—polar/non-polar, hydrophobicity, size, etc.). In a variation, each amino acid from the predetermined portion of the target protein is mapped to one target-residue position, which is assigned to have the same amino-acid identity as found in the target protein. The target-residue positions may be continuous and/or ordered. An advantage, however, in some variations, is that the target-residue position may be discontinuous (interrupted by scaffold-residue positions) and not ordered (in a different order from the target protein). Unlike grafting approaches, in some variations, the order of residues is not constrained. Similarly, the disclosed methods can accommodate discontinuous portions of the target protein (e.g., discontinuous epitopes where different portions of the same protein or even different protein chains contribute to one epitope).
The scaffold-residue positions of the blueprint may be assigned to have any amino acid at that position (i.e., an X representing any amino acid). In variations, the scaffold-residue position is assigned by selection from a subset of possible natural or unnatural amino acids (e.g., small polar amino acid residue, large hydrophobic amino-acid residue, etc.). The blueprint may also accommodate optional target- and/or scaffold-residue positions. Similarly stated, the blueprint may tolerate insertion or deletion of residue positions. For example, a target- or scaffold-residue position may be assigned to be present or absent; or the position may be assigned to be 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more residues.
A subset of the blueprints may then be used to perform computational modeling to generate corresponding polypeptide structures, using, e.g., energy terms(s) and topological constraint(s) derived from the reference target structure, with a score calculated for each polypeptide structure. A machine learning (ML) model may be trained using the scores and the blueprints, or representations of the blueprints (e.g., vectors that represent the blueprints), and the ML model may be executed to generate further blueprints. An advantage of this method is that the topological space covered by vastly more blueprints may be explored by the ML model than could be explored by iterative computational modeling of many blueprints.
The disclosure further provides methods and related devices to convert output blueprints to sequences and/or structures of engineered polypeptides, and to compare these engineered polypeptides to the target protein—using static comparison, dynamic comparison or both—and to filter the polypeptides using these comparisons.
While the methods and apparatus are described herein as processing data from a set of blueprint records, a set of scores, a set of energy terms, a set of molecular dynamics energies, a set of energy terms, or a set of energy functions, in some instances an engineered polypeptide design device 101 as shown and described with respect
The memory 102 of the engineered polypeptide design device 101 may include, for example, a memory buffer, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), an embedded multi-time programmable (MTP) memory, an embedded multi-media card (eMMC), a universal flash storage (UFS) device, and/or the like. The memory 102 may store, for example, one or more software modules and/or code that includes instructions to cause the processor 104 of the engineered polypeptide design device 101 to perform one or more processes or functions (e.g., a data preparation module 105, a computational protein modeling module 106, a machine learning model 107, and/or a molecular dynamics simulation module 108). The memory 102 may store a set of files associated with (e.g., generated by executing) the machine learning model 107 including data generated by the machine learning model 107 during the operation of the engineered polypeptide design device 101. In some instances, the set of files associated with the machine learning model 107 may include temporary variables, return memory addresses, variables, a graph of the machine learning model 107 (e.g., a set of arithmetic operations or a representation of the set of arithmetic operations used by the machine learning model 107), the graph's metadata, assets (e.g., external files), electronic signatures (e.g., specifying a type of the machine learning model 107 being exported, and the input/output tensors), and/or the like, generated during the operation of the engineered polypeptide design device 101.
The communication interface 103 of the engineered polypeptide design device 101 can be a hardware component of the engineered polypeptide design device 101 operatively coupled to and used by the processor 104 and/or the memory 102. The communication interface 103 may include, for example, a network interface card (NIC), a Wi-Fi™ module, a Bluetooth® module, an optical communication module, and/or any other suitable wired and/or wireless communication interface. The communication interface 103 may be configured to connect the engineered polypeptide design device 101 to the network 150, as described in further detail herein. In some instances, the communication interface 103 may facilitate receiving or transmitting data via the network 150. More specifically, in some implementations, the communication interface 103 may facilitate receiving or transmitting data such as, for example, a set of blueprint records, a set of scores, a set of energy terms, a set of molecular dynamics energies, a set of energy terms, or a set of energy functions through the network 150 from or to the backend service platform 160. In some instances, data received via communication interface 103 may be processed by the processor 104 or stored in the memory 102, as described in further detail herein.
The processor 104 may include, for example, a hardware based integrated circuit (IC) or any other suitable processing device configured to run and/or execute a set of instructions or code. For example, the processor 104 may be a general purpose processor, a central processing unit (CPU), a graphical processing unit (GPU), a tensor processing unit (TPU), an accelerated processing unit (APU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a complex programmable logic device (CPLD), a programmable logic controller (PLC) and/or the like. The processor 104 is operatively coupled to the memory 102 through a system bus (for example, address bus, data bus and/or control bus).
The processor 104 may include a data preparation module 105, a computational protein modeling module 106, and a machine learning model 107. The processor 104 may optionally include a molecular dynamics simulation module 108. Each of the data preparation module 105, the computational protein modeling module 106, the machine learning model 107, or the molecular dynamics simulation module 108 can be software stored in memory 102 and executed by the processor 104. For example, a code to cause the machine learning model 107 to generate a set of blueprint records can be stored in the memory 102 and executed by the processor 104. Similarly, each of the data preparation module 105, the computational protein modeling module 106, the machine learning model 107, or the molecular dynamics simulation module 108 can be a hardware-based device. For example, a process to cause the machine learning model 107 to generate the set of blueprint records may be implemented on an individual integrated circuit (IC) chip.
The data preparation module 105 can be configured to receive (e.g., from the memory 102 or the backend service platform 160) a set of data including receiving a reference target structure for a reference target. The data preparation module 105 can be further configured to generate a set of blueprint records (e.g., a blueprint file encoded in a table of alphanumeric data) from a predetermined portion of the reference target structure. In some instances, each blueprint record from the set of blueprint records may include target residue positions and scaffold residue positions, each target residue position corresponding to one target residue from the set of target residues.
In some instances, the data preparation module 105 may be further configured to encode a blueprint of a reference target structure into a blueprint record. The data preparation module 105 may further convert the blueprint record into a representation of the blueprint record that is generally suitable for use in a machine learning model. In some instances, the representation may be a one-dimensional vector of numbers, a two-dimensional matrix of alphanumerical data, a three-dimensional tensor of normalized numbers. More specifically, in some instances, the representation is a vector of an ordered list of numbers of intervening scaffold residue positions. Such representation may be used because the order of the target-residues can be inferred from the target structure, therefore the representation does not need to identify the amino acid identity of the target-residue positions. One example of such representation is described further with respect to
In some instances, the data preparation module 105 may generate and/or process a set blueprint records, a set of scores, a set of energy terms, a set of molecular dynamics energies, a set of energy terms, and/or a set of energy functions. The data preparation module 105 can be configured to extract information from the set of blueprint records, the set of scores, the set of energy terms, the set of molecular dynamics energies, the set of energy terms, or the set of energy functions.
In some instances, the data preparation module 105 may convert an encoding of the set of blueprint records to have a common character encoding such as for example, ASCII, UTF-8, UTF-16, Guobiao, Big5, Unicode, or any other suitable character encoding. In yet some other instances, the data preparation module 105 may be further configured to extract features of the blueprint record and/or the representation of the blueprint record by, for example, identifying a portion of the blueprint record or the representation of the blueprint record significant for engineering polypeptides. In some instances, the data preparation module 105 may convert the units of the set of blueprint records, the set of scores, the set of energy terms, the set of molecular dynamics energies, the set of energy terms, or the set of energy functions from the English unit such as, for example, mile, foot, inch, and/or the like, to the International System of units (SI) such as, for example, kilometer, meter, centimeter, and/or the like.
The computational protein modeling module 106 can be configured to generate a set of initial candidates of blueprint records that may serve as starting templates for computational optimization process described herein from a predetermined portion of the reference target structure. In one example, the computational protein modeling module 106 can be a Rosetta remodeler. Variations of the method employ other modeling algorithms, including without limitation molecular dynamics simulations, ab initio fragment assembly, Monte Carlo fragment assembly, machine learning structure prediction such as AlphaFold or trRosetta, structural knowledgebase-backed protein folding, neural network protein folding, sequence-based recurrent or transformer network protein folding, generative adversarial network protein structure generation, Markov Chain Monte Carlo protein folding, and/or the like. The initial candidate structures generated using Rosetta remodeler may be used as a training set for the machine learning model 107. The computational protein modeling module 106 can further computationally determine an energy term for each blueprint from the initial candidates of blueprint records. The data preparation module 105 can then be configured to generate a score from the energy term. In one example, the score can be a normalized value of the energy term. The normalized value can be a number from 0 to 1, a number from −1 to −1, a normalized value between 0 and 100, or any other numerical range. In some variations, the computational protein modeling module 106 may be based on a de novo design without template matching to the reference target structure or based on weak distance restraints where, for example, the distances between target residues are constrained to be within 1 angstrom of the target-residue distances in the target structure. Weak distance restraints may include restraints that allow variational noise distribution around distance restraints (e.g., a Gaussian noise with a specific mean and a specific variance around the distance restraints.) In some variations, the computational protein modeling module 106 may be used by smoothing or adding variational noise to any distance constraints and/or defining an objective function of a computational protein model such that the computational protein model is penalized less harshly when distant constraints are not met. Moreover, in some instances the computational protein modeling module 106 may use smooth labeling of the energy term. An advantage of this method is that by smoothing the energy term label the machine learning model 107 can more easily optimize the topological space covered by the blueprints to be explored.
The machine learning model 107 may be used to generate an improved blueprint record compared to the set of initial candidates of blueprint records. The machine learning model 107 can be a supervised machine learning model configured to receive the set of initial candidates of blueprint records and a set of scores, computed by the computational protein modeling module 106. Each score from the set of scores correspond to a blueprint records from the set of initial candidates of blueprint records. The processor 104 can be configured to associate each corresponding score and blueprint record to generate a set of labeled training data.
In some instances, the machine learning model 107 may include an inductive machine learning model and/or a generative machine learning model. The machine learning model may include a boosted decision tree algorithm, an ensemble of decision trees, an extreme gradient boosting (XGBoost) model, a random forest, a support vector machine (SVM), a feed-forward machine learning model, a recurrent neural network (RNN), a convolutional neural network (CNN), a graph neural network (GNN), an adversarial network model, an instance-based training model, a transformer neural network, and/or the like. The machine learning model 107 can be configured to include a set of model parameters including a set of weights, a set of biases, and/or a set of activation functions that, once trained, may be executed in an inductive mode to generate a score from a blueprint record or may be executed in a generative mode to generate a blueprint record from a score.
In one example, the machine learning model 107 can be a deep learning model that includes an input layer, an output layer, and multiple hidden layers (e.g., 5 layers, 10 layers, 20 layers, 50 layers, 100 layers, 200 layers, etc.). The multiple hidden layers may include normalization layers, fully connected layers, activation layers, convolutional layers, recurrent layers, and/or any other layers that are suitable for representing a correlation between the set of blueprint records and the set of scores, each score representing an energy term.
In one example, the machine learning model 107 can be an XGBoost model that includes a set of hyper-parameters such as, for example, a number of boost rounds that defines the number of boosting rounds or trees in the XGBoost model, maximum depth that defines a maximum number of permitted nodes from a root of a tree of the XGBoost model to a leaf of the tree, and/or the like. The XGBoost model may include a set of trees, a set of nodes, a set of weights, a set of biases, and other parameters useful for describing the XGBoost model.
In some implementations, the machine learning model 107 (e.g., a deep learning model, an XGBoost model, and/or the like) can be configured to iteratively receive each blueprint record from the set of blueprint records and generate an output. Each blueprint record from the set of blueprint records is associated with one score from the set of scores. The output and the score can be compared using an objective function (also referred to as ‘cost function’) to generate a first training loss value. The objective function may include, for example, a mean square error, a mean absolute error, a mean absolute percentage error, a logcosh, a categorical crossentropy, and/or the like. The set of model parameters can be modified in multiple iterations and the first objective function can be executed at each iteration until the first training loss value converges to a first predetermined training threshold (e.g. 80%, 85%, 90%, 97%, etc.).
In some implementations, the machine learning model 107 can be configured to iteratively receive each score from the set of scores and generate an output. Each blueprint record from the set of blueprint records is associated with one score from the set of scores. The output and the blueprint record can be compared using the objective function to generate a second training loss value. The set of model parameters can be modified in multiple iterations and the first objective function can be executed at each iteration of the multiple iterations until the second training loss value converges to a second predetermined training threshold.
Once trained, the machine learning model 107 may be executed to generate a set of improved blueprint records. The set of improved blueprint records may be expected to have higher scores than the set of initial candidates of blueprint records. In some instances, the machine learning model 107 may be a generative machine learning model that is trained on a first set of blueprint records (e.g., generated using Rosetta remodeler) corresponding to a first set of scores (e.g., each score having an energy term corresponding to Rosetta energy of a blueprint record from the set of blueprint records) to represent a correlation of the design space of the first set of blueprint records with the first set of scores (e.g., corresponding to energy terms). Once trained, the machine learning model 107 generates a second set of blueprint records that have a second set of scores associated with them. In some implementations, the computational protein modeling module 106 can be used to verify the second set of blueprint records and the second set of scores by computing a set of energy terms for the second set of blueprint records. The set of energy terms may be used to generate a set of ground-truth scores for the second set of blueprint records. A subset of blueprint records can be selected from the second set of blueprint records such that each blueprint record from the subset of blueprint records has a ground-truth score above a threshold. In some instances, the threshold can be a number predetermined by, for example, a user of the engineered polypeptide design device 101. In some other instances, the threshold can be a number dynamically determined based on the set of ground-truth scores.
The molecular dynamics simulation module 108 can be optionally used to verify the outputs of the machine learning model 107, after the machine learning model 107 is executed to generate the second set of blueprint records. The engineered polypeptide design device 101 may filter out a subset of the second blueprint records by generating engineered polypeptides based on the second set of blueprint records, and performing a dynamic structure comparison to the representation of the reference target structure using molecular dynamics (MD) simulations of the representation of the reference target structure and each of the structures of engineered polypeptides. For example, the molecular dynamics simulation module 108 may select a few (e.g., less than 10 hits) of the engineered polypeptides (that are based on the second set of blueprint records). In some instances, the MD simulations can be performed under boundary conditions, restraints, and/or equilibration. In some instances, the MD simulations can be performed under solution conditions including steps of model preparation, equilibration (e.g., temperatures of 100 K to 300 K), applying force field parameters and/or solvent model parameters to the representation of the reference target structure and each of the structures of engineered polypeptides. In some instances, the MD simulations can undergo restrained minimization (e.g., relieves structural clashes), restrained heating (e.g., restrained heating for 100 picoseconds and gradually increasing to an ambient temperature), relaxed restraints (e.g., relax restraints for 100 picoseconds and gradually removing backbone restraints), and/or the like.
In some implementations, the machine learning model 107 is an inductive machine learning model. Once trained, such machine learning model 107 may predict a score based on a blueprint record in a fraction of the time it normally would take by, for example, a numerical method to calculate a score for the blueprint (e.g., a computational protein modeling module, a density function theory based molecular dynamics energy simulator, and/or the like). Therefore, the machine learning model 107 can be used to estimate a set of scores of a set of blueprint records quickly to substantially improve an optimization speed (e.g., 50% faster, 2 times faster, 10 times faster, 100 times faster, 1000 times faster, 1,000,000 times faster, 1,000,000,000 times faster, and/or the like) of an optimization algorithm. In some implementations, the machine learning model 107 may generate a first set of scores for a first set of blueprint records. The processor 104 of the engineered polypeptide design device 101 may execute a code representing a set of instructions to select top performers of the first set of blueprint records (e.g., having top 10% of the first set of scores, e.g., having top 2% of the first set of scores, and/or the like). The processor 104 may further include code to verify scores of the top performers among the first set of blueprint records. In some variations, the top performers among the first set of blueprint records can be generated as output if their corresponding verified scores have a value larger than any of the first set of scores. In some variations the machine learning model 107 can be retrained based on a new data set including a second set of blueprint records and second set of scores that include the blueprint records and scores of the top performers.
The network 150 can be a digital telecommunication network of servers and/or compute devices. The servers and/or compute devices on the network can be connected via one or more wired or wireless communication networks (not shown) to share resources such as, for example, data storage or computing power. The wired or wireless communication networks between servers and/or compute devices of the network may include one or more communication channels, for example, a radio frequency (RF) communication channel(s), a fiber optic commination channel(s), and/or the like. The network can be, for example, the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a worldwide interoperability for microwave access network (WiMAX®), a virtual network, any other suitable communication system and/or a combination of such networks.
The backend service platform 160 may be a compute device (e.g., a server) operatively coupled to and/or within a digital communication network of servers and/or compute devices, such as for example, the Internet. In some variations, the backend service platform 160 may include and/or execute a cloud-based service such as, for example, a software as a service (SaaS), a platform as a service (PaaS), an infrastructure as a service (IaaS), and/or the like. In one example, the backend service platform 160 can provide data storage to store a large amount of data including protein structures, blueprint records, Rosetta energies, molecular dynamics energies, and/or the like. In another example, the backend service platform 160 can provide fast computing to execute a set of computational protein modeling, molecular dynamics simulations, training machine learning models, and/or the like.
In some variations, the procedure of the computational protein module 106 described herein can be executed in a backend service platform 160 that provides cloud computing services. In such variations, the engineered polypeptide design device 101 may be configured to send, using the communication interface 103, a signal to the backend service platform 160 to generate a set of blueprint records. The backend service platform 160 can execute a computational protein modeling process that generates the set of blueprint records. The backend service platform 160 can then transmit the set of blueprint records, via the network 150, to the engineered polypeptide design device 101.
In some variations, the engineered polypeptide design device 101 can transmit a file that includes the machine learning model 107 to a user compute device (not shown), remote from the engineered polypeptide design device 101. The user compute device can be configured to generate a set of blueprint records that meet design criteria (e.g., having a desired score). In some variations, the user compute device receives, from the engineered polypeptide design device 101, a reference target structure. The user compute device may generate a first set of blueprint records from a predetermined portion of the reference target structure such that each blueprint record includes target residue positions and scaffold residue positions. Each target residue position corresponds to one target residue from the set of target residues. The user compute device can further train the machine learning model based on a first set of blueprint records, or representations thereof, and a first set of scores. The user compute device may execute, after the training, the machine learning model to generate a second set of blueprint records having at least one desired score (e.g., meeting a certain design criteria). The second set of blueprint records may be received as input in computational protein modeling to generate engineered peptides based on the second set of blueprint records.
In a generative operation mode, the machine learning model 202 is trained on a first set of blueprint records 201 and a first set of scores 203. Once trained, the machine learning model 202 generates a second set of blueprint records having a second set of scores that are statistically higher (e.g., having higher mean value) than the first set of scores. In an inductive operation mode, the machine learning model 202 is also trained on the first set of blueprint records 201 and the first set of scores 203. Once trained, the machine learning model 202 generates a second set of scores for a second set of blueprint records. The second set of scores are a set of predicted scores based on the historical training data (e.g. the first set of blueprint records and the first set of scores) and are generated substantially faster (e.g., 50% faster, 2 times faster, 10 times faster, 100 times faster, 1000 times faster, 1,000,000 times faster, 1,000,000,000 times faster, and/or the like) than numerically calculated scores and/or energy terms that use computational protein modeling (similar to the computational protein modeling module 106 as shown and described with respect to
The method of engineered polypeptide design 300 optionally includes, at 305, determining whether to retrain the machine learning model by calculating a second set of scores (e.g., a ground-truth set of scores) by using a numerical method such as, for example, a Rosetta remodeler, an Ab initio molecular dynamics simulation, machine learning structure prediction such as AlphaFold or trRosetta, structural knowledgebase-backed protein folding, neural network protein folding, sequence-based recurrent or transformer network protein folding, generative adversarial network protein structure generation, Markov Chain Monte Carlo protein folding, and/or the like. The engineered polypeptide design device then compares the second set of scores with the set of predicted scores and based on deviation of the set of predicted scores from the second set of scores determines whether to retrain the machine learning model. The method of engineered polypeptide design 300 optionally includes, at 305, retraining, in response to the determining, the machine learning model based on (1) retraining blueprint records that include the second set of blueprint records and (2) retraining scores that include the set of predicted scores. In some configuration, the engineered polypeptide design device may concatenate the first set of blueprint records and the second set of blueprint records to generate the retrained blueprint records. The engineered polypeptide design device may further concatenate the first set of scores and the second set of scores to generate the retraining scores. In some configuration the retraining of the blueprint records only include the second set of blueprint records and the retraining scores only include the second set of scores.
The right panel of
The five blueprints shown in
The left-hand portion of the schematic illustrates converting the blueprint into a representation of the blueprint. The representation may be any representation suitable for use in a machine learning model (such as the machine learning model 107 as shown and described with respect to
An advantage of this variation of the representation of the blueprint record is that other than the first and last elements the vector is frame-shift invariant. That is, the machine learning model has available information regarding the relative positions of the target residues independent of the position of the target residue within the blueprint. This permits design of similar structures with variable structured/unstructured regions at N- and C-terminus.
The engineered polypeptide design device may perform computational protein modeling (e.g., using a computational design modeling module 106 as shown and described with respect to
In some implementations, the engineered polypeptide design device may then filter out a subset of the engineered polypeptides by a dynamic structure comparison to the representation of the reference target structure using molecular dynamics (MD) simulations of the representation of the reference target structure and each of the structures of engineered polypeptides. For example, the engineered polypeptide design device may select a few (e.g., less than 10 hits) of the engineered polypeptides. In some instances, the MD simulations can determine dynamics of the representation of the reference target structure and each of the structures of engineered polypeptides under solution conditions including steps of model preparation, equilibration (e.g., temperatures of 100 K to 300 K), and unrestrained MD simulations. In some instances, the MD simulation can include applying force field parameters and solvent model parameters to the representation of the reference target structure and each of the structures of engineered polypeptides. In some instances, the MD simulations can undergo restrained minimization for 1000 cycles (e.g., relieves structural clashes), restrained heating (e.g., restrained heating for 100 picoseconds and gradually increasing to an ambient temperature), a relaxed restraints (e.g., relax restraints for 100 picoseconds and gradually removing backbone restraints).
The method of engineered polypeptide design 1400 further may include, at step 1405, filtering the sets by structure comparison between the members of the sets of engineered polypeptides. For one variation, using machine learning based de novo folding algorithms such as AlphaFold (Senior et al. Improved protein structure prediction using potentials from deep learning. Nature 577:706-10 (2020)) or trRosetta (Yang et al. Improved protein structure prediction using predicted interresidue orientations, PNAS USA, 117:1496-1503 (2020)), the MEM is computationally folded for a variety of different epitope sequences. RMSDs are calculated between all residues (epitope and scaffold) of the de novo folded MEM and its ideal Rosetta design. The final score is the average RMSD across all of the different epitope sequences tested. RMSD may be calculated using all atoms, just C-alpha atoms, or some intermediate variation. An average RMSD of 0 angstroms is a perfect score, with higher values being worse scores.
In another variation, using Molecular Dynamics simulations, the local stability of the MEMS is tested for a variety of different epitope sequences. RMSDs are calculated between all residues (epitope and scaffold) of the de novo folded MEM and its ideal Rosetta design. The final score is the average RMSD across all of the different epitope sequences tested. RMSD may be calculated using all atoms, just C-alpha atoms, or some intermediate variation. An average RMSD of 0 angstroms is a perfect score, with higher values being worse scores.
The method of engineered polypeptide selection 1500 may include, at step 1504, selecting cells from the library. In some instances, the selecting step comprising selective proteolysis of the expressed polypeptides. In some instances, the selecting step comprising binding target ligands by the expressed polypeptides. In some instances, the selecting step comprising a FRET donor and acceptor pair where FRET signal intensity indicates degree of folding by the expressed polypeptides. Methods of selective proteolysis, ligand binding, and FRET are known in the art. The method 1500 may include, at step 1505, recovering from the selected cells selected polynucleotide sequences. For example, the cells may be labeled with a fluorescently labelled antibody, a fluorescently labeled target molecules, or a antibody/target molecule configured by detection by a secondary labelled molecule. For example, the antibody/target molecule may be biotinylated. Selective binding of the antibody/target molecule to a cell indicates that the variant polypeptide was stable to proteolysis and therefore properly folded. Other modes of selection may be performed, such as selection for binding to the antibody/target molecule without a proteolysis step. Selection may include flow-assisted cell sorting (FACS), magnetic-assisted cell sorting (MACS), and capture via an antibody/target molecule conjugated or indirectly bound to a matrix, e.g. a resin or bead, optionally a magnetic bead.
In another variation, recovered polynucleotide sequences from step 1505 are used as input to code that retrains or refines an existing machine learning model, or trains a new machine learning model, for engineered peptide generation.
The method of engineered polypeptide selection 1500 may include, at step 1506, comparing one or more variant polypeptides to the reference target by experimental structure determination. For example, the structure of one or more variant polypeptides may be determined by X-ray crystallography, nuclear magnetic resonance (NMR) (e.g., 2D- or 3D-NMR), or cryo-electron microscopy. The determined structure may then be compared, statically, by molecular dynamics in the case of NMR, or by molecular dynamics simulation, with the reference target structure of the reference target. In some instances, instead of a full structure determination, structural insight into the polypeptide structure may be obtained by circular dichroism (CD) spectroscopy, differential scanning fluorimetry, 1D NMR, or any other experimental assessment of the polypeptide folded state. This structural insight may then be compared between different polypeptide variants to assess structural similarity.
In an illustrative method, the nucleic acid library comprising polynucleotides encoding variant polypeptides is constructed as a series of Overlap Extension Polymerase Chain Reaction (OE-PCR) fragments. The ends of each fragment may be engineered to be GC-rich to aid in fragment assembly. Next, the assembled diversity may be ligated into an expression plasmid or viral genome, optionally a mammalian expression plasmid.
Diversity libraries of variant polypeptides may be designed to label the polypeptides with one, two, or more tags. For example, the polynucleotide sequences may comprise at 5′ end, 3′ end, or in the middle of the coding sequence one or more tags. For example, two or more tags may be used. Illustrative tags include histidine tags (e.g. hexahistidine) and affinity tags (e.g. FLAG). In an embodiment, the library encodes variant polypeptides fused to a C-terminal hexahistidine tag and an N-terminal FLAG tag. This dual tagging strategy may enable detection of variant polypeptides that are membrane-displayed by cells and/or that are resistant to proteolysis. Various cells, media, and transfection techniques are known in the art and may be used.
In an example, a nucleic acid library is screened for stable variants by limited proteolysis: Step 1: Cells transfected with the nucleic acid library are subjected to limited proteolysis with 20 nM chymotrypsin for 5 minutes, then proteolysis is quenched with a protease inhibitor and the cells are washed. In parallel, control cells are prepared without proteolysis. Step 2: The cells are sorted by two-color FACS using a labelled anti-His antibody and a labelled anti-FLAG antibody. Step 3: Next generation sequences (NGS) is performed the chymotrypsin-treated and untreated FACS-sorted cells.
Analysis of the NGS data for the chymotrypsin treated and untreated pools across multiple epitope sequences leads to identification of more stable general scaffold. For the chemokine receptor (CCR) family, a scaffold variant library for epitope sequences corresponding to CCR4, CCR5, and CCR8 have been screened and scaffold sequence with enhanced stability and/or sequences that are shared across the CCR4, CCR5, and CCR8 epitopes have been identified. In some case, sequences are assessed for “sequence similarity” using a score from the number of exact matches between the sequences being compared or amino acid similarities based on the BLOSUM matrix. Clusters of closely related sequences that contain the CCR4, CCR5, and CCR8 epitope sequences are identified
The disclosure provides embodiments according to the following numbered clauses.
Clause 1. A method, comprising:
Clause 2. The method of clause 1, wherein the predetermined portions of the reference target structures of the reference targets are selected by structure comparison between the predetermined portions.
Clause 3. The method of clause 1 or clause 2, wherein the structure comparison is static structure comparison using de novo folding of each of the engineered polypeptides of each of the sets of engineered polypeptides.
Clause 4. The method of any one of clauses 1 to 3, wherein the structure comparison is by dynamic structure comparison using molecular dynamics (MD) simulations of each of the engineered polypeptides of each of the sets of engineered polypeptides.
Clause 5. The method of any one of clauses 1 to 4, wherein the method further comprises expressing in cells a nucleic acid library of polynucleotides encoding variant polypeptides generated by making amino acid substitutions at scaffold residue positions of one or more of the members of the filtered sets of engineered polypeptides.
Clause 6. The method of clause 5, wherein the making amino acid substitutions comprises performing computational sequence substitution.
Clause 7. The method of clause 5 or clause 6, wherein the making amino acid substitutions comprises performing codon substitution.
Clause 8. The method of any one of clauses 5 to 7, wherein the making amino acid substitutions comprises performing mutagenesis.
Clause 9. The method of any one of clauses 1 to 8, wherein the method further comprising selecting cells from the library.
Clause 10. The method of clause 9, wherein the selecting step comprising selective proteolysis of the expressed polypeptides.
Clause 11. The method of clause 9 or clause 10, wherein the selecting step comprising binding to target ligand(s) by the expressed polypeptides.
Clause 12. The method of any one of clauses 9 to 12, wherein the selecting step comprising Forster resonance energy transfer (FRET) between a donor and acceptor moiety, where FRET signal indicates folding of the expressed polypeptides.
Clause 13. The method of any one of clauses 5 to 12, wherein the method comprises recovering from the selected cells selected polynucleotide sequences.
Clause 14. The method of clause 13, wherein the selected polynucleotide sequences are input to code that retrains or refines an existing machine learning model, or trains a new machine learning model.
Clause 15. The method of any one of clauses 1 to 14, wherein the providing step comprises:
Clause 16. The method of any one of clauses 1 to 15, comprising:
Clause 17. The method of clause 16, wherein in at least one blueprint record, the target residue positions are nonconsecutive.
Clause 18. The method of clause 16 or clause 17, wherein in at least one blueprint record, target residue positions in an order different from the order of the target residues positions in the reference target sequence.
Clause 19. The method of any one of clauses 16 to 18, comprising:
Clause 20. The method of clause 19, wherein the computational protein modeling is based on a de novo design without template matching to the reference target structure.
Clause 21. The method of clause 19 or clause 20, wherein each score from the first plurality of scores comprises an energy term and a structure-constraint matching term that is determined using one or more structural constraints extracted from the representation of the reference target structure.
Clause 22. The method of clause 19 to 21, comprising:
Clause 23. The method of clause 22, comprising:
Clause 24. The method of any one of clauses 15 to 23, wherein the at least one desired score is a preset value.
Clause 25. The method of any of one clauses 15 to 23, wherein the at least one desired score is dynamically determined.
Clause 26. A non-transitory processor-readable medium storing code representing instructions to be executed by a processor, the code comprising code to cause the processor to perform a method according to any one of clauses 1-25.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
This application claims priority to and the benefit of U.S. Patent Application No. 63/120,098, filed Dec. 1, 2020 and titled “Generalized Scaffolds for Polypeptide Display and Uses Thereof,” which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/061289 | 11/30/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63120098 | Dec 2020 | US |