The instant application contains an electronic Sequence Listing that has been submitted electronically and is hereby incorporated by reference in its entirety. The sequence listing was created on Jun. 21, 2023, is named “22-0854-US_Sequence-Listing.xml” and is 21,575 bytes in size.
Current approaches to de novo design of proteins harboring a desired binding or catalytic motif require pre-specification of an overall fold or secondary structure composition, and hence considerable trial and error can be required to identify protein structures capable of scaffolding an arbitrary functional site. Methods are needed to start from a desired functional site and jointly fill in the missing sequence and structure needed to complete a protein having the desired functional site.
In one aspect, the disclosure provides polypeptides comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:1-15, wherein any N-terminal methionine residues are optional and may be present or absent. In one embodiment, the polypeptide comprises an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:1-8, wherein the polypeptide binds to Co2+. In another embodiment, the polypeptide comprises an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:9-10, wherein the polypeptide binds to Ca2+. In a further embodiment, the polypeptide comprises an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:11-14, wherein the polypeptide binds to Rous sarcoma virus (RSV) F protein.
In one embodiment, the polypeptide comprises an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:15, wherein the polypeptide binds to Tropomyosin receptor kinase A (TrkA). In another embodiment, the polypeptide comprises an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:22, wherein the polypeptide binds to programmed cell death ligand 1 (PD-L1).
In one embodiment, the disclosure provides nucleic acids encoding the polypeptide of any embodiment disclosed herein. In another embodiment, the disclosure provides expression vectors comprising a nucleic acid of the disclosure operatively linked to a promoter. In a further embodiment, the disclosure provides host cells comprising the nucleic acid or the expression vector of any embodiment herein.
In other aspects, the disclosure provides methods for treating cancer or generating an immune response comprising administering to a subject in need thereof a polypeptide of the disclosure. In a further aspect, the disclosure provides calcium imaging probes and electron microscopy stains, comprising a polypeptide of the disclosure.
All references cited are herein incorporated by reference in their entirety. Within this application, unless otherwise stated, the techniques utilized may be found in any of several well-known references such as: Molecular Cloning: A Laboratory Manual (Sambrook, et al., 1989, Cold Spring Harbor Laboratory Press), Gene Expression Technology (Methods in Enzymology, Vol. 185, edited by D. Goeddel, 1991. Academic Press, San Diego, CA), “Guide to Protein Purification” in Methods in Enzymology (M. P. Deutshcer, ed., (1990) Academic Press, Inc.); PCR Protocols: A Guide to Methods and Applications (Innis, et al. 1990. Academic Press, San Diego, CA), Culture of Animal Cells: A Manual of Basic Technique, 2nd Ed. (R. I. Freshney. 1987. Liss, Inc. New York, NY), Gene Transfer and Expression Protocols, pp. 109-128, ed. E. J. Murray, The Humana Press Inc., Clifton, N.J.), and the Ambion 1998 Catalog (Ambion, Austin, TX).
As used herein, the singular forms “a”, “an” and “the” include plural referents unless the context clearly dictates otherwise.
As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn; N), aspartic acid (Asp; D), arginine (Arg; R), cysteine (Cys; C), glutamic acid (Glu; E), glutamine (Gln; Q), glycine (Gly; G), histidine (His; H), isoleucine (Ile; I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Trp; W), tyrosine (Tyr; Y), and valine (Val; V). All embodiments of any aspect of the disclosure can be used in combination, unless the context clearly dictates otherwise.
Unless the context clearly requires otherwise, throughout the description and the claims, the words ‘comprise’, ‘comprising’, and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to”. Words using the singular or plural number also include the plural and singular number, respectively. Additionally, the words “herein,” “above,” and “below” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of the application.
In a first aspect, the disclosure provides polypeptides comprising an amino acid sequence at least 75%, 80% 85, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ TD NO:1-15 and 22, wherein any N-terminal methionine residues are optional and may be present or absent.
As described in the examples that follow, the inventors have designed a series of proteins with specific functional domains and binding activity, as described herein.
Amino acid sequences of the polypeptides are shown in Table 1.
Any N-terminal methionine residues may be present or may be absent (i.e.: deleted in the polypeptide relative to the reference polypeptide). In other embodiments, the N- and or C-terminal 1, 2, 3, 4, or 5 residues may be deleted in the polypeptide relative to the reference polypeptide.
In one embodiment, the polypeptides comprise an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:1-8, wherein the polypeptide binds to Co2+. These polypeptides may be used, for example, for binding to metals such as iron or cobalt, or other heavy metals, as part of stains for electron microscopy.
In one embodiment, the polypeptides comprise an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:1-3, wherein the polypeptide binds to C02+. The polypeptides are shown in the examples to have the strongest Co2+-binding activity. In one embodiment, the residues highlighted in boldface font in Table 1 are conserved (i.e., identical) in the polypeptides. These residues are particularly useful for binding to Co2+.
In another embodiment, the polypeptides comprise an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:9-10, wherein the polypeptide binds to Ca2+.
These polypeptides may be used, for example, in calcium imaging probes.
In a further embodiment, the polypeptides comprise an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:11-14, wherein the polypeptide binds to Rous sarcoma virus (RSV) F protein. These polypeptides may be used, for example, as epitope-based vaccines. In some embodiments, the polypeptides of this embodiment bind to preRSVF and RSVF-site V immunogens.
In one embodiment, the polypeptides comprise comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:15, wherein the polypeptide binds to Tropomyosin receptor kinase A (TrkA). These polypeptides may be used, for example, as cancer therapeutics.
In one embodiment, the polypeptides comprise an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:22, wherein the polypeptide binds to programmed cell death ligand 1 (PD-L1). These polypeptides may be used, for example, as cancer therapeutics.
In one embodiment, substitutions relative to the reference sequence are conservative amino acid substitutions. As used herein, a “conservative amino acid substitution” means a given amino acid can be replaced by a residue having similar physiochemical characteristics, e.g., substituting one aliphatic residue for another (such as Ile, Val, Leu, or Ala for one another), or substitution of one polar residue for another (such as between Lys and Arg; Glu and Asp; or Gln and Asn). Other such conservative substitutions, e.g., substitutions of entire regions having similar hydrophobicity characteristics, are known. Amino acids can be grouped according to similarities in the properties of their side chains (in A. L. Lehninger, in Biochemistry, second ed., pp. 73-75, Worth Publishers, New York (1975)): (1) non-polar: Ala (A), Val (V), Leu (L), Ile (I), Pro (P), Phe (F), Trp (W), Met (M); (2) uncharged polar: Gly (G), Ser (S), Thr (T), Cys (C), Tyr (Y), Asn (N), Gln (Q); (3) acidic: Asp (D), Glu (E); (4) basic: Lys (K), Arg (R), His (H). Alternatively, naturally occurring residues can be divided into groups based on common side-chain properties: (1) hydrophobic: Norleucine, Met, Ala, Val, Leu, Ile; (2) neutral hydrophilic: Cys, Ser, Thr, Asn, Gln; (3) acidic: Asp, Glu; (4) basic: His, Lys, Arg; (5) residues that influence chain orientation: Gly, Pro; (6) aromatic: Trp, Tyr, Phe. Particular conservative substitutions include, but are not limited to, Ala into Gly or into Ser; Arg into Lys; Asn into Gln or into H is; Asp into Glu; Cys into Ser; Gln into Asn; Glu into Asp; Gly into Ala or into Pro; His into Asn or into Gln; Ile into Leu or into Val; Leu into Ile or into Val; Lys into Arg, into Gln or into Glu; Met into Leu, into Tyr or into Ile; Phe into Met, into Leu or into Tyr; Ser into Thr; Thr into Ser; Trp into Tyr; Tyr into Trp; and/or Phe into Val, into Ile or into Leu.
In another embodiment, the disclosure provides fusion proteins, comprising the polypeptide of any embodiment herein fused to a further functional domain. The polypeptides and peptide domains of the invention may include additional residues at the N-terminus, C-terminus, or both that are not present in the polypeptides or peptide domains of the disclosure; these additional residues are not included in determining the percent identity of the polypeptides or peptide domains of the disclosure relative to the reference polypeptide. Such residues may be any residues suitable for an intended use, including but not limited to detection tags (i.e.: fluorescent proteins, antibody epitope tags, etc.), adaptors, ligands suitable for purposes of purification (His tags, etc.), a protein antigen, a diagnostic molecule, compound or protein; a therapeutic compound or protein, or other peptide domains that add functionality to the polypeptides, etc.
The disclosure provides nucleic acids encoding the polypeptide of any embodiment or combination of embodiments of the disclosure. The nucleic acid sequence may comprise single stranded or double stranded RNA (such as an mRNA) or DNA in genomic or cDNA form, or DNA-RNA hybrids, each of which may include chemically or biochemically modified, non-natural, or derivatized nucleotide bases. Such nucleic acid sequences may comprise additional sequences useful for promoting expression and/or purification of the encoded polypeptide, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the polypeptides of the disclosure.
In a further aspect, the disclosure provides expression vectors comprising the nucleic acid of any aspect of the disclosure operatively linked to a suitable control sequence. “Expression vector” includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product. “Control sequences” operably linked to the nucleic acid sequences of the disclosure are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered “operably linked” to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors can be of any type, including but not limited plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In various embodiments, the expression vector may comprise a plasmid, viral-based vector, or any other suitable expression vector.
In another aspect, the disclosure provides host cells that comprise the polypeptides, nucleic acids, expression vectors (i.e.: episomal or chromosomally integrated) of any embodiment disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cells can be transiently or stably engineered to incorporate the nucleic acids or expression vector of the disclosure, using techniques including but not limited to bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, polycationic mediated-, or viral mediated transfection.
In another aspect, the disclosure provides methods for treating cancer, comprising administering to a subject in need thereof an amount effective to treat the cancer of a polypeptide comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:15 or 22, wherein the polypeptide binds to TrkA or PD-L1, respectively. As disclosed above, these polypeptides can be used to treat cancer, including but not limited to cancer characterized by tumors that express PD-L1 and/or TrkA.
In a further aspect, the disclosure provides method for generating an immune response in a subject, comprising administering to a subject in need thereof an amount effective to generate an immune response in the subject of a polypeptide a polypeptide comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of SEQ ID NO:11-14, wherein the polypeptide binds to RSV F protein. As disclosed above, these polypeptides can be used to generate an immune response in a subject in need thereof, including but not limited to a subject at risk or exposure to RSV, or a subject having an RSV infection. In some embodiments, the administering elicits an immune response in the subject, such that the subject is protected against infection by RSV. In some embodiments, the methods limit development of an RSV infection. As used herein, “limiting development” includes, but is not limited to accomplishing one or more of the following: (a) generating an immune response (antibody and/or cell-based) to RSV in the subject; (b) generating neutralizing antibodies against RSV in the subject (b) limiting build-up of RSV titer in the subject after exposure to RSV; and/or (c) limiting or preventing development of RSV symptoms after infection.
In each of these aspects, an “effective amount” refers to an amount of the polypeptide that is effective for treating cancer or generating an immune response.
The polypeptide are typically formulated as a pharmaceutical composition (in combination with a pharmaceutically acceptable carrier), and can be administered via any suitable route, including orally, parentally, by inhalation spray, rectally, or topically in dosage unit formulations containing conventional pharmaceutically acceptable carriers, adjuvants, and vehicles. The term parenteral as used herein includes, subcutaneous, intravenous, intra-arterial, intramuscular, intrasternal, intratendinous, intraspinal, intracranial, intrathoracic, infusion techniques or intraperitoneally. Polypeptide compositions may also be administered via microspheres, liposomes, immune-stimulating complexes (ISCOMs), or other microparticulate delivery systems or sustained release formulations introduced into suitable tissues (such as blood). Dosage regimens can be adjusted to provide the optimum desired response (e.g., a therapeutic or prophylactic response). A suitable dosage range may, for instance, be 0.1 μg/kg-100 mg/kg body weight of the polypeptide or nanoparticle thereof. The composition can be delivered in a single bolus, or may be administered more than once (e.g., 2, 3, 4, 5, or more times) as determined by attending medical personnel.
The subject may be any subject having cancer or at risk of RSV infection. In one embodiment, the subject is a mammalian subject. In another embodiment, the subject is a human subject.
In one aspect, the disclosure provides calcium imaging probes, comprising a polypeptide comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:9-10, wherein the polypeptide binds to Ca2+. As disclosed above, these polypeptides may be used, for example, for calcium detection.
In one aspect, the disclosure provides electron microscopy stains, comprising a polypeptide comprising an amino acid sequence at least 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:1-8, wherein the polypeptide binds to Co2+. As disclosed above, these polypeptides may be used, for example, to bind to heavy metals such as Co2+, in staining a sample for electron microscopy.
In another aspect, the disclosure provides methods for protein design comprising: accessing amino acid sequences and structures of reference proteins;
training a computational model, using the amino acid sequences, the structures, and fragments thereof, to perform functions comprising:
As described in the examples, the methods of the disclosure (referred to as “inpainting” or “missing information recovery” approaches) start from a desired functional site and jointly fill in the missing sequence and structure needed to complete the protein (i.e.: functional site scaffolding). As used herein, inpainting means predicting the sequence and structure of a missing region of a protein. In some examples, the functional site can be defined by a few side chains around e.g. a metal site. In other examples, a few amino acids on either side of the key residues are also considered to be part of the functional site. For example, in the di-iron case, although there are only 6 residues that contact the metals, about 30 residues can be considered part of the functional site.
In one embodiment, the structures of the reference proteins characterize bonds between amino acids of the reference proteins. In another embodiment, training the computational model comprises minimizing an overall difference between the reference proteins and proteins that are inferred by the computational model when the computational model is provided the fragments of the amino acid structures and/or structures of the reference proteins. In a further embodiment, the structures of the reference proteins include one or more of secondary structure, tertiary structure, or quaternary structure of the reference proteins.
In one embodiment, the reference proteins comprise a particular reference protein, wherein the fragments of the reference proteins comprise a fragment of the particular reference protein, and wherein training the computational model comprises:
generating an inferred fragment that together with the fragment of the particular protein defines an inferred protein, wherein the inferred fragment includes only amino acid sequence information; and
comparing the inferred protein to the particular reference protein.
In another embodiment, the reference proteins comprise a particular reference protein, wherein the fragments of the reference proteins comprise a fragment of the particular reference protein, and wherein training the computational model comprises:
generating an inferred fragment that together with the fragment of the particular protein defines an inferred protein, wherein the inferred fragment includes amino acid sequence information and structural information; and
comparing the inferred protein to the particular reference protein.
In a further embodiment, the reference proteins comprise a particular reference protein, wherein the fragments of the reference proteins comprise a fragment of the particular reference protein, and wherein training the computational model comprises:
generating an inferred fragment that together with the fragment of the particular protein defines an inferred protein, wherein the inferred fragment includes only structural information; and
comparing the inferred protein to the particular reference protein.
In a still further embodiment, the reference proteins comprise a first reference protein, a second reference protein, and a third reference protein, wherein the fragments of the reference proteins comprise a first fragment of the first reference protein, a second fragment of the second reference protein, and a third fragment of the third reference protein, and wherein training the computational model comprises:
(a) randomly selecting a task from a group that includes a first training task, a second training task, and a third training task;
(b) performing the task; and repeating steps (a) and (b) one or more times, wherein:
In one embodiment, the fragments of the reference proteins respectively include at least 15% of the amino acid sequences of the reference proteins. In another embodiment, the fragments of the reference proteins respectively include at least 15% of the structures of the reference proteins. In a further embodiment embodiment, the method further comprises randomly selecting amino acid sequences that make up the fragments of the reference proteins.
In one embodiment, the method further comprises randomly selecting the structures that make up the fragments of the reference proteins. In another embodiment, the fragments of the reference proteins each include at least 10 amino acids. In a further embodiment, the fragments of the reference proteins each include no more than 35 amino acids. In a still further embodiment, the data does not include amino acid sequence or structure other than for the functional site.
In one embodiment, the disclosure provides protein design methods, comprising receiving data comprising an amino acid sequence and/or a structure of a functional site or a portion thereof; and generating an inferred amino acid sequence and/or an inferred structure that scaffolds the functional site and maintains the amino acid sequence and the structure of the functional site. In one embodiment, the structure of the functional site characterizes bonds between amino acids of the functional site. In a further embodiment, the inferred structure characterizes bonds between amino acids of the inferred amino acid sequence.
In one embodiment, the inferred amino acid sequence is a first inferred amino acid sequence, and the inferred structure is a first inferred structure, the method further comprising:
generating a second inferred amino acid sequence and/or a second inferred structure that scaffolds the functional site and maintains the amino acid sequence and the structure of the functional site; and
determining whether (a) the first amino acid sequence and/or the first inferred structure or (b) the second amino acid sequence and/or the second inferred structure better conforms to one or more functional criteria.
In one embodiment, the data does not include amino acid sequence or structure other than for the functional site.
The disclosure also provides polypeptides designed by the protein design method of any embodiment or combination of embodiments herein.
The disclosure also provides non-transitory computer readable media storing instructions that, when executed by a computing device, cause the computing device to perform the method of any embodiment or combination of embodiments of the protein design methods herein. The non-transitory computer readable medium can be any type of memory, such as volatile memory like random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), or non-volatile memory like read-only memory (ROM), flash memory, magnetic or optical disks, or compact-disc read-only memory (CD-ROM), among other devices used to store data or programs on a temporary or permanent basis. Additionally, the non-transitory computer readable medium can store instructions. The instructions are executable by the one or more processors to cause the computing device to perform any of the functions or methods described herein.
In another embodiment, the disclosure provide computing devices comprising:
one or more processors; and
a computer readable medium storing instructions that, when executed by the one or more processors, cause the computing device to perform the method of any embodiment or combination of embodiments of the protein design methods herein.
The computing device includes one or more processors, a non-transitory computer readable medium, a communication interface, and a user interface. Components of the computing device are linked together by a system bus, network, or other connection mechanism. The one or more processors can be any type of processor(s), such as a microprocessor, a field programmable gate array, a digital signal processor, a multicore processor, etc., coupled to the non-transitory computer readable medium.
The communication interface can include hardware to enable communication within the computing device and/or between the computing device and one or more other devices. The hardware can include any type of input and/or output interfaces, a universal serial bus (USB), PCI Express, transmitters, receivers, and antennas, for example. The communication interface can be configured to facilitate communication with one or more other devices, in accordance with one or more wired or wireless communication protocols. For example, the communication interface can be configured to facilitate wireless data communication for the computing device according to one or more wireless communication standards, such as one or more Institute of Electrical and Electronics Engineers (IEEE) 801.11 standards, ZigBee standards, Bluetooth standards, etc. As another example, the communication interface can be configured to facilitate wired data communication with one or more other devices. The communication interface can also include analog-to-digital converters (ADCs) or digital-to-analog converters (DACs) that the computing device can use to control various components of the computing device or external devices.
The user interface can include any type of display component configured to display data. As one example, the user interface can include a touchscreen display. As another example, the user interface can include a flat-panel display, such as a liquid-crystal display (LCD) or a light-emitting diode (LED) display. The user interface can include one or more pieces of hardware used to provide data and control signals to the computing device. For instance, the user interface can include a mouse or a pointing device, a keyboard or a keypad, a microphone, a touchpad, or a touchscreen, among other possible types of user input devices. Generally, the user interface can enable an operator to interact with a graphical user interface (GUI) provided by the computing device (e.g., displayed by the user interface).
The description of embodiments of the disclosure is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. While the specific embodiments of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize.
Abstract Current approaches to de novo design of proteins harboring a desired binding or catalytic motif require pre-specification of an overall fold or secondary structure composition, and hence considerable trial and error can be required to identify protein structures capable of scaffolding an arbitrary functional site. Here we describe two complementary approaches to the general functional site scaffolding problem that employ the RosettaFold™ and AlphaFold™ neural networks which map input sequences to predicted structures. In the “inpainting” or “missing information recovery” approach, we start from the desired functional site and jointly fill in the missing sequence and structure needed to complete the protein in a single forward pass through a modified RosettaFold™ network. Inpainting offers greater computational efficiency and accuracy in some regimes. We illustrate the application of the new methods to the design of candidate immunogens presenting epitopes recognized by neutralizing antibodies, receptor traps for escape-resistant viral inhibition, metalloproteins and enzymes, and protein binding proteins. AlphaFold™ structure predictions suggest the designed sequences fold to the designed structures, and neutralizing antibody, metal ion, and protein target binding assays on designs expressed in bacteria confirm the designed functions.
The biochemical functions of proteins are often carried out by a subset of residues which constitute a functional site—for example, an enzyme active site or a protein or small molecule binding site—and hence the design of proteins with new functions can be divided into two steps. The first step is to identify functional site geometries and amino acid identities which produce the desired activity—this can be done using quantum chemistry calculations in the enzyme case (to identify ideal theozymes for catalyzing a desired reaction) (1-3) or fragment docking calculations in the protein binder case (4, 5); alternatively functional sites can be extracted from a native protein having the desired activity (6, 7). In this paper, we focus on the second step: given a functional site description from any source, design an amino acid sequence which folds up to a three dimensional structure containing the site. Current methods have the limitations that assumptions must be made about the secondary structure of the scaffold, and that the amino acid sequence must be generated in a subsequent sequence step, so there is no guarantee that the generated backbones are in fact designable (encodable by some amino acid sequence). An ideal method for functional de novo protein design would 1) embed the functional site with minimal distortion in a designable scaffold protein; 2) be applicable to arbitrary site geometries, searching over all possible scaffold topologies and secondary structure compositions for those optimal for harboring the specified site, and 3) jointly generate backbone structure and amino acid sequence.
In the training of AlphaFold™ and recent versions of RosettaFold™ (see methods) for structure prediction a small fraction (15%) of tokens in the MSA are masked or corrupted, and the network learns to recover this missing sequence information in addition to predicting structure. We reasoned that this ability to recover sequence information along with structural information could provide a second solution to the functional site scaffolding problem: given a functional site description, a forward pass through the network could potentially be used to complete, or “inpaint”, both protein sequence and structure in a missing/masked region of protein (
To test whether improving ability of RosettaFold™ to predict missing sequence might help it to better simultaneously reason over both sequence and structure (i.e. to inpaint), we began from a RosettaFold™ model trained for structure prediction (15) and carried out further training on fixed-backbone sequence design in addition to the standard fixed-sequence structure prediction task (Materials and Methods). Despite not being explicitly trained on inpainting (predicting the sequence and structure of a missing region of protein), this model, denoted 1 RFimplicit, was indeed able to recover small, contiguous regions missing both sequence and structure. Encouraged by this initial result, we trained a model explicitly on inpainting segments with missing sequence and structure given the surrounding protein context, in addition to sequence design and structure prediction tasks (
To evaluate in silico the quality of designs generated by our methods, we use the AlphaFold™ (AF) protein structure prediction network (18) which has high accuracy on de novo designed proteins (19). Although RosettaFold™ and AF share architectural similarities and were both trained on structures from the Protein Data Bank, the two models were trained independently and hence AF predictions can be regarded as a partially orthogonal in silico test of whether RF designed sequences fold into the intended structures, analogous to traditional ab initio folding benchmarks (13, 20).
In the following sections, we highlight the power of the inpainting methods by designing proteins containing a wide range of functional motifs (
We next explored the scaffolding of functional sites involved in metal-binding and catalysis. We designed scaffolds around a di-iron binding site, which is important in biological systems for iron storage (25) and also potentially harnessable for catalysis (26, 27). The motif, composed of four roughly parallel helical segments from E. co/i bacterioferritin (cytochrome b1), was recapitulated with sub-angstrom RMSDs with inpainting (
We next pursued scaffolding the calcium-binding EF-hand motif (30) composed of a 12-residue loop flanked by helices. Inpainting readily generated scaffolds recapitulating either 1 or 2 EF-hand motifs to within 1.0 Å RMSD of the native motif (
We next explored the design of protein-binding proteins by scaffolding short segments from known binding proteins. To design binders to the cancer checkpoint protein PD-L1, we scaffolded 2 discontiguous segments of the interfacial beta-sheet from a high-affinity mutant of PD-1 (
The cell surface receptor TrkA dimerizes upon binding to its ligand, nerve growth factor (NGF) (38). Previously, a de novo three-helix bundle was designed to bind TrkA and found to antagonize downstream signaling (4). With the goal of making bivalent TrkA binders that could stimulate signaling, we aligned two copies of the previous TrkA binder three-helix bundle to the signaling-competent TrkA structure (
Training RosettaFold™ to jointly model sequence and structure (RFjoint)Standard RosettaFold™ (15) (RF) has been trained on structure prediction (sequence inputs, structure outputs) using homolog templates (structure input). In the newer versions, we mask a portion of the input MSA and apply a loss to predictions of the masked amino acids (sequence output) to encourage the network to extract more meaning from the MSA (18, 65). RFjoint was fine tuned from a pre-trained RosettaFold™ model (RF-Nov05-2021, see Supplementary Text, “RosettaFold™ variants” section for details on the architectural details of this model). The training regime for this model, which was initially trained solely on structure prediction, is below:
Training set: 25% of examples came from the PDB (published before February 17th, 2020), which is the same training set used in the original RosettaFold™ model (15). The other 75% of examples included a distillation set of AlphaFold™ predicted structures (66). This distillation set was clustered at 30% sequence identity cutoff, and sequences sharing greater than 30% similarity to any protein in the PDB were excluded. Only proteins greater than 200 residues in length, with mean AlphaFold™ pLDDT>85 were included in training, and only residues with per-residue pLDDT>70 were included from these models. The Adam W Optimizer was used throughout training, with default pytorch parameters. The epoch size was 25600 training examples, with a batch size of 64. The learning rate for the initial round of training (200 epochs) was 0.001, with a linear warm-up for the first 1000 optimization steps. The learning rate was then decayed by a factor of 0.95 after every 10000 optimization steps. A crop size of 256 residues was used, with cropping following the same strategy as described previously (15). The number of MSA seed sequences was 128, and the number of extra MSA sequences was 1024. For the second stage of training (100 epochs), the learning rate was set of 0.0005 (no warm-up), with learning rate decay by a factor of 0.95 every 10000 optimization steps. A larger crop size (350 residues), and more MSA sequences (256 seed sequences, 2048 extra sequences) were used in this second phase of training.
Starting with this pre-trained RosettaFold™, we fine-tuned this model for inpainting, for an additional 27 epochs on three tasks (
The loss function formulation for RFjoint is as follows.
total=1.0dist+3.0aa+1.0tors+5.0FAPE+0.1lddt
where dist is a cross entropy loss over the distogram and anglegram as described in (15), predictions aa is a cross entropy loss over any masked positions in the input MSA, tors is a cross entropy loss on binned backbone dihedral angle predictions, FAPE is a backbone level frame aligned point error, as described in (18), with a relu cutoff of 20. lddt is the 1DDT loss as calculated in (15). Note that structure related losses are applied over the entire predicted protein, and the sequence cross entropy loss is only applied at masked (Tasks 1 and 2) and/or corrupted (Task 3) regions. For the fixed-backbone sequence design task (
Joint Sequence-Structure Inpainting with a Jointly Trained RosettaFold™
To apply RFjoint to protein design, we input a sequence and structure, masking certain residues in the sequence by replacing them with mask tokens and masking corresponding residues in the structure by setting their template embeddings to zero (15). We then predict the structure and sequence logits for the entire protein. The output structure, including regions that were originally both masked and unmasked, is used as the design model, and the most probable predicted amino acid at each masked position (argmax) is taken to complete the sequence. Note that in the RF-Nov05-2021 version of RosettaFold™ used to train RFjoint, as in AlphaFold™, latent representations of the output structure are ‘recycled’ back through the network to refine the final structure. During inpainting, we utilize this ‘recycling’ to refine our inpainted sequence and structure, typically recycling information 5-15 times (similar to the number of times used for structure prediction with RosettaFold™, which is typically 10). A single design of 100 amino acids in length, using 10 iterations of inpainting, takes 5.3 seconds on a GeForce RTX 2080 GPU. We refer to this prediction, with recycling, as a ‘forward pass’ through the network. The iterative inpainting method described above is approximately deterministic. To sample ensembles of outputs with small variations in sequence and structure using RFjoint, we either vary the exact boundaries of masked regions, the length of regions to replace a masked region or by varying specific input coordinates (for example, in
For each experimentally tested design case shown in this paper, we generated between 4000 and 30,000 designs, and filtered these based on the AF pLDDT, motif RMSD of AF predictions, (see supplementary text for exact cutoffs). Broadly, these included ‘confident/accurate’ AF pLDDT (>80), sub-angstrom (<1 Å) RMSD. Orthogonal filters were determined on a per-problem basis (fully outlined in the supplementary text), but broadly comprised features such as radius of gyration, Rosetta™ per-residue spatial aggregation propensity (SAP) score (67), net charge (#Arg+#Lys—#Asp—#Glu) and structural diversity. The cutoffs were typically chosen to give an experimentally tractable final number of designs. In some cases, in preparation of the final set of proteins to be ordered, and after design filtering, we performed a final visual inspection to look qualitatively at aspects such as poor core packing, presence of cavities, buried polar groups, or surface hydrophobics, which typically reduced the set of proteins by around 0-50%.
For designs that were only validated in silico, that are represented in the figures, we filtered designs predominantly on AlphaFold™ pLDDT and RMSD, as well as radius of gyration. The AlphaFold™ metrics are presented in Table 3.
The “model 4” weights were used for all AF predictions for filtering. The pLDDT was taken as the average of the residue-wise confidence values output by the network. Using AF to filter our designs has the risk of designing “adversarial examples”, or sequence-structure pairs that score well by AF that do not fold or function in reality, due to the presence of artifactual minima in the loss landscape of the structure-prediction model (68, 69). However, because we design using RosettaFold™, which is trained independently of AF (although both use the PDB as training data), any final designs must be well-predicted by two partially orthogonal networks, which is expected to provide some (although not total (70)) robustness to adversarial examples. This is supported by our finding that a high fraction of our designs are solubly expressed. Additionally, if we redesign the sequence of our highest-pLDDT designs by Rosetta™, pLDDT continues to be high, indicating that the original hallucination had a designable backbone (and isn't purely an artifact of RF or AF's loss landscape). Finally, we find that AF pLDDT of our RF-generated designs correlate well with physics-based metrics such as Rosetta™ energy and ab initio folding.
To score protein binder designs, we used a modified AlphaFold™ prediction script that took as input the design model of the target-binder complex (from inpainting) and the concatenated binder-target sequence (with a residue number gap to denote different chains). AF was asked to predict the complex structure from single-sequence, given the target protein structure as template information and its structural representation (atom coordinates) of the binder-target complex initialized to the target-binder complex design model. The confidence in AF2's prediction of the interface was assessed by the inter-chain predicted aligned error (inter-PAE), or the average value of interchain positions in the predicted aligned error matrix. We found that inter-PAE<10 Å corresponded to predicted complexes that were docked roughly correctly, while predictions with inter-PAE above this threshold usually had binder and target far apart in space. In addition to inter-PAE, we also filtered on: binder pLDDT (average residue-wise confidence over the binder from complex prediction); AF-Rosetta™ ddG (Rosetta™ ddG calculated on the AF model after minimizing interface side chains); target-aligned binder RMSD (RMSD of the binder, after aligning AF and RF models on the target).
All designs tested in E. Coli were cloned, expressed and purified using standard methods. Briefly, Golden Gate assembly with BsaI-HF (New England Biolabs) was used to insert designs into a modified pET29b+ vector containing C-terminal SNAC (71) and 6×His tags (or, in the case of EFhand_inp_1, into a modified pET29b+ vector with a C-terminal TEV cleavage site and a 6×His tag). Plasmids were transformed into BL21 bacteria. For small-scale expression tests, bacteria were cultured overnight at 37° C. in 2 ml cultures of lysogeny broth (LB) supplemented with 50 μg/mL of kanamycin. Cells were then grown in 2 ml cultures of Terrific Broth (TB) for one hour, before induction with 1 mM of IPTG for 4 hours. Cells were then lysed with B-PER supplemented with 1 mM PMSF, 0.1 mg/mL Lysozyme, 25 U/ml Benzonase, before lysate clarification by centrifugation. Lysate was incubated with 75 μl Ni-NTA resin, before washing thrice with wash buffer (25 mM Tris, 300 mM NaCl, 20 mM Imidazole, pH 7.8) and elution in 25 mM Tris, 300 mM NaCl, 250 mM Imidazole. Expression was assessed by SDS-PAGE. For larger scale cultures, cultures were grown overnight at 37° C. in autoinduction medium (72), before sonication-based lysis in wash buffer supplemented with 1 mM PMSF, 0.1 mg/mL Lysozyme, 0.01 mg/ml DNase I. After centrifugal lysate clarification, lysates were incubated with an appropriate volume of Ni-NTA resin and subsequently washed thrice with wash buffer. For purification of di-iron binding proteins, the His-tag was cleaved off by cleavage of the SNAC-tag. Briefly, after binding to the Ni-NTA resin, the protein was washed in SNAC cleavage buffer (100 mM CHES, 100 mM Acetone oxime, 100 mM NaCl, 500 mM GuHCl, pH 8.6) before addition of 2 mM NiCl2. After overnight cleavage, proteins were further purified by size exclusion chromatography on a Superose 75 column in 20 mM Hepes, 100 mM KCl, pH 7.8, and monomeric fractions pooled.
Analysis of cobalt binding to inpainted di-iron binders was performed essentially as described previously (28). Proteins (200 μM in 20 mM Hepes, 100 mM KCl, pH 7.8) were incubated overnight with (or not) an 8×molar excess (1600βM) CoCl2. Absorbance spectra were collected in a Jason V-750 spectrophotometer. Mean background absorbance (measured between 700 and 800 nm) were subtracted from all spectra. Successful designs showed absorbance peaks characteristic of cobalt coordinated in a tetra/penta-coordinate state.
Transformed yeast were cultured in TRP(−), URA(−) media for two days followed by expression culture. Samples containing˜8.5e7 cells were incubated in TBS (pH 8.0) containing 1 mM Ca2+ and washed twice with TBS only. Yeast cells were resuspended in TBS containing 50 μM Tb3+ For 3 hours and then washed twice in TBS+1 mM Ca2+. Washed samples were moved to a black bottom, plate-reader 96 plates for fluorescence spectra measurement. Fluorescence signals were collected using a flash plate reader in time-resolved fluorescence mode (TRF, delay time: 100us, integration time: 1000us, gain: 130).
Designs harboring the EF-hand motif, were purified by His-purification as described above. After size exclusion chromatography in 20 mM Hepes, 150 mM KCl, pH 7.8, the His tag was cleaved by TEV-cleavage, with the addition of 40 μM Super-TEV protease, 1 mM DTT and 0.5 mM EDTA (overnight at room temperature). To ensure the EF-hands were not bound to any residual calcium in buffers, after passing through a NiNTA-column after TEV-cleavage, protein were run on a size exclusion column equilibrated in 20 mM Hepes, 150 mM KCl, pH 7.8 buffer, which had been Chelex treated overnight to remove any residual calcium. Proteins were incubated (or not) with terbium (40 μM terbium in 5 μM protein) for 3 hours, before analysis of terbium fluorescence on a NEO2 plate reader. Samples were excited at 250 nm (to excite the tryptophan residue near the EF-hand motif), and fluorescence was measured between 450 and 650 nm, 100-1000 s after excitation.
All circular dichroism (CD) analyses except those for RSV-F site V immunogens were performed on a JASCO J-1500 CD Spectrophotometer. Di-iron binding proteins were analyzed at 6.7 μM in 20 mM Hepes, 10 mM KCl, pH 7.8, with or without an 8×molar excess of CoCl2. Analysis of the EF-hand inpaint was performed at 20 μM in chelex100-treated 20 mM Hepes, 150 mM KF, pH 7.6, in the presence or absence of 200 μM CaCl2). Analysis of the PDL-1 binder was performed at 5 μM in 20 mM Hepes, 10 mM KCl, pH 7.8. Thermal melt analyses were performed between 25° C. and 95° C., measuring CD at 222 nm. All reported measurements were measured within the linear range of the instrument.
For RSV-F designs, CD spectra were measured using a Chirascan™ V100 spectrometer in a 1-mm path-length cuvette. The protein samples were diluted to 30 μM in PBS. Wavelengths between 195 nm and 250 nm were recorded. Thermal melt analyses were performed between 20° C. and 95° C. with an increment of 2° C./min, measuring CD at 222 nm. All spectra were corrected for buffer absorption.
As an initial screen for protein binding, linear DNA were synthesized as “e-blocks” (Integrated DNA Technologies), pooled, and transformed into the yeast strain EBY100 (by electroporation if >100 designs, by the lithium acetate method otherwise) along with a pETCON3 backbone linearized at NdeI and XhoI (for Aga2p and c-Myc fusion) (4, 5). The transformed pool was inoculated into CTUG medium (yeast nitrogen base 6.7 g/L (difco)+complete amino acids−trp−ura+2% glucose) and incubated 12-16 hours at 30° C. with shaking, then diluted 200 uL+2 mL into SGCAA (yeast nitrogen base 6.7 g/L+complete amino acids 5 g/L (Bacto)+90 mM Na2HPO4+2% galactose+0.1% glucose) and incubated 12-16 hours to induce binder expression and display. For flow sorting, around 107 cells were harvested, washed 3× in TBSF (50 mM Tris-HCl pH8.0, 150 mM NaCl, 1% bovine serum albumin), incubated in TBSF with biotinylated binding target for 30 minutes at room temperature, washed 1× in TBSF, incubated for 30 minutes at room temperature in 0.1 mg/mL FITC anti-c-Myc (ICL Lab) and 70 mg/mL streptavidin R-phycoerythrin (PE) conjugate (Invitrogen), and washed 3× in TBSF. The binding target and FITC/PE were added in the same incubation when labeling with avidity. Cells were sorted on a Sony SH800 flow sorter and 103-106 FITC+/PE+cells were collected. The cells were either cultured in liquid CTUG for another round of sorting, or plated onto CTUG agar and individual colonies Sanger-sequenced to identify the designs. For trRosetta™-hallucinated PD-L1 binders and Mdm2 binders, clonal yeast cultures expressing a single design were analyzed in binding assays to confirm the results of sorting as well as to assess the binding affinity of designs. In this case, yeast culture and binding were performed identically as above except that an Attune N×T™ (Invitrogen) flow cytometer was used to analyze the cells. For all other problems, hits identified by yeast display were followed up by E. coli expression and purification.
SPR measurements were performed on a Biacorem 8K (GE Healthcare) in 10 mM HEPES pH 7.4, 150 mM NaCl, 3 mM EDTA, 0.005% v/v Surfactant P20 (GE Healthcare). Ligands were immobilized on a CM5 chip (GE Healthcare) via amine coupling. The preRSVF and RSVF-site V immunogens were immobilized at approximately 300-500 response units (RU). The site V specific RSV90 Fab was injected as analyte in two-fold serial dilutions. The flow rate was 30 μl/min for a contact time of 120 s followed by 400 s dissociation time. After each injection, the surface was regenerated using 0.1 M glycine at pH 3.0. KD values were obtained by fitting the maximum response versus log 10 Fab concentration to a sigmoid function using Python and scipy's non-linear curve_fit function.
BLI binding experiments were performed on an Octet Red96™ (ForteBio), with streptavidin coated tips (Sartorius Item no. 18-5019) and BLI buffer (10 fold dilution of 10× HBS−EP+buffer [Cytiva Item no. BR100669] supplemented with 0.1% w/v bovine serum albumin). Tips were pre-incubated in BLI buffer for at least 30 minutes before use. To collect binding data, the tips were incubated in BLI buffer for 100 s, loaded with biotinylated TrkA (30 nM in BLI buffer; a kind gift from Chris Garcia's lab) for 300 s, equilibrated in BLI buffer to obtain a baseline for 150 s, dipped into BLI buffer with the designed proteins for 900 s (association phase) and finally returned to BLI buffer for 900 s (dissociation phase). Reported responses are the change in wavelength between the beginning and end of the association phase.
The protein structure prediction performance of each RosettaFold™ variant was evaluated based on CASP14 targets and 60 recently published de novo designs (not included in the RosettaFold™ training set).
Inpainting models were fine-tuned starting from one of the pre-trained RF versions above. RFjoint was based on RF-Nov05-2021, and RFimplicit was based on RF-perceiver (see dedicated sections for precise training details).
Because RosettaFold™ only predicts backbone coordinates, we added sidechains to inpainting outputs using Rosetta™ and refined the full-atom structure by relaxing once in torsion space with predicted pairwise restraints and once in cartesian space with only pairwise distance restraints and Ca coordinate restraints. The output of the final relax step is the model used for downstream analysis and further design.
The input motif we sought to scaffold was extracted from bacterial cytochrome b-1 (PDB accession 1BCF), and comprised four approximately parallel helices (residues A18-25, A47-54, A92-99 and A123-130, harboring motif residues GluA18, GuA51, HisA54, GluA94, GluA127, HisA130. Eight potential looping orders were inpainted (
While confidently predicted by AlphaFold™ to scaffold the motif, we noticed that some designs had a higher-than-ideal number of surface hydrophobic residues (as assessed by SAP units (67)). Given the ability of RFjoint to design sequence-given-backbone, for some designs, we used RFjoint to modestly redesign the sequence to reduce the SAP score. Specifically, we redesigned hydrophobic surface residues to reduce the predicted aggregation propensity (given either the AlphaFold™ or RFjoint model as backbone-input).
The following filters were used for filtering the inpainted designs:
The 55 inpainted EF-hand designs tested experimentally contain 51 designs from RFjoint, and 4 designs from RFimplicit.
For RFjoint designs, we began with 18,000 inpainted designs: 9,000 using native 1PRW as an input template and 9,000 from a version of 1PRW where the backbone is identical but the sequence contains a K30W mutation. In all designs, we combinatorially sampled template inputs that contained
We chose to inpaint the second set of 9,000 off of the K30W mutant because the downstream functional assay (tryptophan-enhanced fluorescence) requires tryptophan to be near the ion binding site, and we reasoned that final designs might be higher quality if the model was conditioned on a TRP residue in its input, rather than retrospectively making a TRP mutation on an unconditioned design. The AF2 pLDDT distributions for these two sets of 9,000 designs were nearly identical (mean 77 vs 76), and their motif RMSD distributions were also similar. Given this, we reasoned that a K30W mutation likely would have minimal effect on a design's AF2 prediction metrics (especially given it is a surface mutation in the design). Thus for any designs which passed filters (see below) but were not conditioned on the K30W mutation, we manually added the mutation without further calculation.
We filtered this initial set of 18,000 by AF2 pLDDT>80, and the individual EFhand domain RMSDs both being<1. This yielded 1496 sequences, all of which now had the K30W mutation discussed above. We next created two mutants for each of these sequences to add a second TRP near the binding sites—T26W and F65W (numbering with reference to 1PRW). We then used AF2 to predict the structure of all mutants to ensure the addition of a second TRP was not deleterious for a design's AF2 metrics. Using a filter of AF2 pLDDT<83.7, RMSD of both domains individually<1.0, and SAP score<36, we filtered this set of 2992 designs to the final set of 51 for testing.
For 4 RFimplicit designs, we started from two hallucinated designs which initially scaffolded the EFhand motif(s) from 1PRW (Table 3), denoted here as EFhand_hal_A and EFhand_hal_B.
We inpainted 300 designs seeding off of EFhand_ha_A by combinatorially sampling template inputs that contained
We inpainted 300 designs seeding off of EFhand_hal_B by combinatorially sampling template inputs that contained
Designs were filtered using AF2 pLDDT>80 and backbone RSMD between the AF2 prediction and the native 1PRW EF-hand on at least one of the motifs. We arbitrarily chose 1 design that passed these filters from each of the two sets of 300 designs. For both proteins, two mutants were created. For both mutants, the K30W (numbering with respect to 1PRW) mutation as seen above was made. Then the T26W mutation was made for one mutant, and the F65 mutant mutation was made for the other. This process yielded 4 tested designs, one of which showed terbium binding activity in the yeast display terbium binding assay (
We used hallucination and inpainting to scaffold a 2-segment beta-sheet motif from the high-affinity consensus (HAC) PD-1 interface toward PD-L1 (5IUS chain A residues 63-82, 119-140) (36) Given the immunoglobulin-like topology of PD-1, these 2 segments do not have nearby N— and C-termini and therefore cannot easily be linked by a short hairpin; therefore, it is non-trivial to scaffold them into any fold other than their native immunoglobulin.
We generated 2 sets of inpainted designs: “free” inpaintings where only the binding motif was used as input, so RFjoint would have to generate the entire scaffold from scratch; and “guided” inpaintings where the binding motif, as well as guiding structural information input by hand, were provided. All designs were modeled in the presence of the target PD-L1, analogous to “two-chain” hallucination (Materials and Methods).
For free inpainting, we manually chose a looping order for the design to be inpainted with, starting at the N-terminus with motif segment A1 19-140 from 5IUS, then allowing 22-29 inpainted residues, then segment A63-82 from 5IUS, and finally 28-39 inpainted C-terminal residues. Additionally, we allowed RFjoint to redesign residues 67, 69, 71, 73, 75, and 77 in the input motif (i.e. mask and re-predict amino-acid identity, taking the most probable amino acid at each position, without masking structure) in case they changed from core to surface, or vice versa, after inpainting. We generated 314 designs using this approach. The successful binder pdl1_inp_1 is a refined (see below) version of a parent design from this set.
For guided inpainting, we tried to bias RFjoint to explore a topology of a beta-sheet buttressed by 2 helices that was observed in high-scoring hallucinations. To do this we manually placed 5 “guiding” residues in an input structure and asked inpainting to generate a design containing the interface motif which generally goes through the backbone atoms of the guide residues. 4 of the guide residues correspond to the rough location of N and C termini of two helices that might buttress the sheet. The 5th guide residue is placed in the middle of one of the buttressing helices, at an elevated distance above the interfacial beta-sheet so as to induce a bend in the helix to pack against the sheet without clashes. To obtain a diversity of designs, we sampled input coordinates for each guide residue from a uniform random sphere of radius 2 Å around its original manually chosen position, and also combinatorially sampled the lengths of the regions to be inpainted. Specifically, we combinatorially sampled the following template inputs, with each masked region being uniformly sampled from allowed window lengths:
Given these inputs, RFjoint was able to generate a diverse family of PD-1 mimetics with this fold. We generated 1000 parent designs using this approach, although no descendants of these parent designs ended up having binding activity.
After initial design runs, designs with pLDDT>80 and inter-chain PAE<10 were refined using RFjoint to (1) “resample” the protein by randomly re-inpainting a fraction of the residues, (2) redesign only the sequence (keeping structure) of hydrophobic surface/boundary residues or (3) changing the order in which elements of the protein appear in primary sequence while keeping the overall fold of the protein (“relooping”). Combinations of (1), (2) and (3) were used for exploring near the topology proposed by inpainting initially, as well as optimizing a design for low net charge and low SAP score. We generated a total of 2,025 refinements off of the initial “free inpainting” set and 415 refinements off of the initial “guided inpainting” set, using the AF2 predictions of designs as input backbones for refinement over a maximum of 3 rounds of filtering (pLDDT>80, inter-chain PAE<10) and refinement. The final designs for experimental characterization were redundancy-reduced by mmseqs2 at 90% identity cutoff, and then filtered by Rosetta™ DDG<−30, SAP score<40, net charge<−4, AF2 inter-PAE<10, and AF2 pLDDT>80. This final filtering yielded the pool of 31 tested sequences, one of which bound PD-L1 (
We began the design process by aligning the structure of the TrkA minibinder bound to a single domain of TrkA (PDBID: 7N3T) to the complex of TrkA with its native ligand, nerve growth factor (PDBID: 2IFG). Having obtained the relative positions of the two minibinders in a signaling competent TrkA arrangement, we defined the functional motif as residues 5-18 on each of the minibinder chains. We carried out 600 steps of gradient descent with the usual motif and hallucination losses and forcing the native identity on motif residues 5, 6, 9, 10, 12, 13, 14, 16, 17 and 18 from both minibinder chains. To avoid clashes with TrkA, we applied a repulsive loss against the coordinates from the appropriately aligned TrkA structure (σ=3.5 Å, weight=5). Because many of the residues in either of the two motif segments were further from each other than the 20 Å distogram horizon, we also found it necessary to apply a coordinate rmsd loss (weight=1), which has no such distance maximum, to encourage the two motifs to have the correct orientation to each other. The resulting 380 designs were filtered (cce loss<1.0, coordinate rmsd loss<1.5 Å and entropy loss<2.0) down to 9 seed designs. After manual inspection for designs with well-packed secondary structure elements and minimal loops, we chose to diversify one design of an elongated three helix.
To diversify the seed designs, we used inpainting to change the length and position of the two loop regions connecting each helix. First, we made 20 “jittered” structures by adding gaussian noise˜N(0,1) to “guide points” two residues inside each loop region. (Since inpainting is deterministic, this approach allowed us to sample different inpainting solutions for loops of the same length.) For each jittered structure, we inpainted the loops while varying their lengths between −3 and +7 residues of the original length, generating 1280 designs. After filtering for well folded designs (AF pLDDT>80) that interact with TrkA (inter-PAE<10 Å for at least one binding site), one design remained. This design and derivative mutants were assayed for TrkA binding by biolayer interferometry.
This application claims priority to U.S. Provisional Patent Application Ser. No. 63/366,982 filed Jun. 24, 2022, incorporated by reference herein in its entirety.
This invention was made with government support under Grant Nos. FA8750-17-C-0219 and HR0011-21-2-0012, awarded by the Defense Advanced Research Projects Agency and Grant No. R01CA240339, awarded by the National Cancer Institute and Grant No. HHSN272201700059C, awarded by the National Institute of Allergy & Infectious Diseases and Grant No. 5U19AG065156, awarded by the National Institute on Aging. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63366982 | Jun 2022 | US |