A computer readable form of the Sequence Listing is filed with this application by electronic submission and is incorporated into this application by reference in its entirety. The Sequence Listing is contained in the file created on Jul. 4, 2022 having the file name “21-0753-WO_SeqList.hml” and is 1,981 kb in size.
Protein interactions play critical roles in biology, and general approaches to disrupt or modulate these with designed proteins would have huge impact. While empirical laboratory selection approaches starting from very large antibody, DARPIN or other protein scaffold libraries can generate binders to protein targets, it is difficult at the outset to target a specific region on a target protein surface, and to sample the full space of possible binding modes. Computational methods can target specific target surface locations and provide a more principled and potentially much faster approach to binder generation than random library selection methods, as well as insight into the fundamental properties of protein interfaces (which must be understood for design to be successful). Most current methods for designing proteins to bind to a target surface utilize information derived from native complex structures on specific sidechain interactions or protein backbone placements optimal for binding. For many target proteins, there are no obvious pockets or clefts on the protein surface into which a small number of privileged sidechains can be placed, and guidance by a small number of hotspot residues limits the approach to a small fraction of possible interaction modes.
In one aspect, the disclosure provides polypeptides comprising an amino acid sequence at least 35/u, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from the group consisting of SEQ ID NO:1-1559 and 1561-1570, not including any functional domains added fused to the polypeptides (whether N-terminal, C-terminal, or internal), and wherein the 1, 2, 3, 4, or 5 N-terminal and/or C-terminal amino acid residues may be present or absent when considering the percent identity. In another embodiment, substitutions relative to the reference polypeptide are selected from the residues listed as “best” or “tolerable” at each position immediately below the reference polypeptide listed in Tables 13A-13HHH. In a further embodiment, substitutions relative to the reference polypeptide are selected from the residues listed as “best” or “tolerable” at each position immediately below the reference polypeptide listed in Tables 13A-13HHH. In one embodiment, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or all interface residues are as defined in the reference polypeptide listed in Tables 13A-13HHH. In another embodiment, protein core residues listed in Tables 13A-13HHH are substituted relative to the reference polypeptide only with conservative amino acid substitutions. In a further embodiment, insertion of amino acid residues relative to the reference polypeptide occurs at a residue indicated in the column “loop/insertion” column of Tables 13A-13HHH. In another embodiment, 1, 2, 3, 4, or 5 N-terminal and/or C-terminal amino acid residues are not included when determining the percent identity relative to the reference polypeptide. In one embodiment, all residues are included when determining the percent identity relative to the reference polypeptide.
In another embodiment, the disclosure provides fusion proteins comprising the polypeptide of any embodiment disclosed herein fused to a functional polypeptide. In a further embodiment, the disclosure provides fusion proteins comprising two or more copies of the polypeptide of any embodiment disclosed herein. In one embodiment, the two or more copies of the polypeptide are identical; in another embodiment, the two or more copies of the polypeptide are not all identical.
In further embodiments, the disclosure provides scaffold comprising 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more copies of the polypeptide or fusion protein of any embodiment disclosed herein; nucleic acids encoding the polypeptide or fusion protein of any embodiment disclosed herein; expression vectors comprising the nucleic acid of any embodiment disclosed herein operatively linked to a suitable control sequence: host cells comprising the polypeptide, fusion protein, scaffold, nucleic acid, and/or expression vector of any embodiment disclosed herein; pharmaceutical compositions comprising: (a) the polypeptide, fusion protein, scaffold, nucleic acid, expression vector, and/or host cell of any embodiment disclosed herein; and (b) a pharmaceutically acceptable carrier; and uses of or methods for using the polypeptide, fusion protein, scaffold, nucleic acid, expression vector, host, and/or pharmaceutical composition of any embodiment disclosed herein for any suitable use as disclosed herein.
All references cited are herein incorporated by reference in their entirety. Within this application, unless otherwise stated, the techniques utilized may be found in any of several well-known references such as: Molecular Cloning: A Laboratory Manual (Sambrook, et al., 1989, Cold Spring Harbor Laboratory Press), Gene Expression Technology (Methods in Enzymology, Vol. 185, edited by D. Goeddel, 1991. Academic Press, San Diego, CA), “Guide to Protein Purification” in Methods in Enzymology (M. P. Deutshcer, ed., (1990) Academic Press, Inc.); PCR Protocols: A Guide to Methods and Applications (Innis, et al. 1990. Academic Press, San Diego, CA), Culture of Animal Cells: A Manual of Basic Technique, 2nd Ed. (R. I. Freshney. 1987. Liss, Inc. New York, NY), Gene Transfer and Expression Protocols, pp. 109-128, ed. E. J. Murray, The Humana Press Inc., Clifton, N.J.), and the Ambion 1998 Catalog (Ambion, Austin, TX).
As used herein, the singular forms “a”, “an” and “the” include plural referents unless the context clearly dictates otherwise.
As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn: N), aspartic acid (Asp; D), arginine (Arg: R), cysteine (Cys: C), glutamic acid (Glu: E), glutamine (Gln; Q), glycine (Gly: G), histidine (His; H), isoleucine (Ile: I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Trp: W), tyrosine (Tyr; Y), and valine (Val; V).
In all embodiments of polypeptides disclosed herein, any N-terminal methionine residues are optional (i.e.: the N-terminal methionine residue may be present or may be deleted).
All embodiments of any aspect of the disclosure can be used in combination, unless the context clearly dictates otherwise.
Unless the context clearly requires otherwise, throughout the description and the claims, the words ‘comprise’, ‘comprising’, and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense: that is to say, in the sense of “including, but not limited to”. Words using the singular or plural number also include the plural and singular number, respectively %. Additionally, the words “wherein,” “above,” and “below” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of the application.
In a first aspect, the disclosure provides polypeptides comprising an amino acid sequence at least 35%, 40%, 45% 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of any one of SEQ ID NO: 1-1559, not including any functional domains added fused to the polypeptides (whether N-terminal, C-terminal, or internal), and wherein the 1, 2, 3, 4, or 5 N-terminal and/or C-terminal amino acid residues may be present or absent when considering the percent identity.
The reference pol-peptide sequences are provided in Tables 1-12. The polypeptides of the disclosure bind specifically to a defined protein target, including binding proteins for a diverse set of different protein targets, as detailed herein. Biophysical characterization demonstrates that exemplary binders tested are hyperstable and bind their targets with nanomolar to picomolar affinities.
A number of the polypeptides were subjected to site-saturation mutagenesis (SSM), as described in the examples, to demonstrate the ability to make significant changes to the primary amino acid sequence while maintaining activity of the polypeptide. In one embodiment, substitutions relative to the reference polypeptide are selected from the residues listed as “best” or “tolerable” at each position immediately below the reference polypeptide listed in Tables 13A-13HHH.
In this embodiment, for example, where the reference polypeptide is Cd3_mb1 (SEQ ID NO: 759), residue 1 may be N, K, or V; residue 2 may be E, R, K, Q, H, S, W, V, T, M, A, I, Y, F, C and so forth.
In another embodiment, the substitutions relative to the reference polypeptide are selected from the residues listed as “best” ˜ at each position immediately below the reference polypeptide in Table 13. In this embodiment, for example, where the reference polypeptide is Cd3_mb1 (SEQ ID NO: 759), residue 1 may be N or K; residue 2 may be E, R, or K; and so forth, as detailed in Tables 13A-13HHH.
In another embodiment, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or all interface residues are as defined in the reference polypeptide in Tables 13A-13HHH (i.e.: position is denoted with an “X” in the “at interface” column). In this embodiment, for example, where the reference polypeptide is Cd3_mb1 (SEQ ID NO:759), residues 2, 4, 6, 26, 28, 30, 34, 36, 38, 40, 57, 59, and 61 are interface residues, as detailed in Table 13A.
In a further embodiment, protein core residues (core residue positions denoted with an “X” in the “protein core” column in Tables 13A-13HHH) are substituted relative to the reference polypeptide only with conservative amino acid substitutions. In this embodiment, for example, where the reference polypeptide is Cd3_mb1 (SEQ ID NO:759), residues 3, 5, 7, 12, 16, 20, 22, 25, 37, 39, 42, 47, 50, 54, 56, and 58 are core residues, as detailed in Table 13A.
In one embodiment, insertion of amino acid residues relative to the reference polypeptide occurs at a residue indicated in the column “loop/insertion” (i.e.: residues denoted with an “X” in the “loop/insertion” column of Tables 13A-13HHH). In this embodiment, for example, where the reference polypeptide is Cd3 mb1 (SEQ ID NO:759), residues 8, 9, 20-23, 31-33, 41-43, and 55-56 are loop/insertion residues, as detailed in Table 13A.
The polypeptides may incorporate any insertion relative to the reference polypeptide (i.e.: additional amino acids inserted into the sequence). In one embodiment, the insertions are made at loop regions in the polypeptides, as noted in the column “loop/insertion”). The insertion may be a single amino acid, a large functional domain, or any other amino acid insertion as suitable for an intended purpose.
Tables 13-A-13HHH provide details on interface, core, and loop residues and “best” and “tolerable” amino acid substitutions relative to specific binding proteins shown in Tables 1-12.
In one embodiment, amino acid substitutions relative to the reference polypeptide are conservative amino acid substitutions. As used herein, “conservative amino acid substitution” means a given amino acid can be replaced by a residue having similar physiochemical characteristics, e.g., substituting one aliphatic residue for another (such as Ile, Val, Leu. or Ala for one another), or substitution of one polar residue for another (such as between Lys and Arg, Glu and Asp, or Gln and Asn). Other such conservative substitutions, e.g., substitutions of entire regions having similar hydrophobicity characteristics, are known. Polypeptides comprising conservative amino acid substitutions can be tested in any one of the assays described herein to confirm that a desired activity. e.g. antigen-binding activity and specificity of a native or reference polypeptide is retained. Amino acids can be grouped according to similarities in the properties of their side chains (in A. L. Lehninger, in Biochemistry, second ed., pp. 73-75, Worth Publishers, New York (1975)): (1) non-polar: Ala (A), Val (V), Leu (L), Ile (I), Pro (P), Phe (F), Trp (W), Met (M); (2) uncharged polar: Gly (G), Ser (S), Thr (T), Cys (C), Tyr (Y), Asn (N), Gin (Q): (3) acidic: Asp (D), Glu (E); (4) basic: Lys (K), Arg (R). His (H). Alternatively, naturally occurring residues can be divided into groups based on common side-chain properties: (1) hydrophobic: Norleucine, Met, Ala, Val, Leu, Ile; (2) neutral hydrophilic: Cys, Ser, Thr, Asn, Gln; (3) acidic: Asp, Glu; (4) basic: His, Lys, Arg; (5) residues that influence chain orientation: Gly, Pro; (6) aromatic: Trp, Tyr, Phe. Non-conservative substitutions will entail exchanging a member of one of these classes for another class. Particular conservative substitutions include, for example; Ala into Gly or into Ser; Arg into Lys; Asn into Gln or into H is; Asp into Glu; Cys into Ser; Gln into Asn; Glu into Asp; Gly into Ala or into Pro; His into Asn or into Gln; Ile into Leu or into Val; Leu into Ile or into Val; Lys into Arg, into Gln or into Glu; Met into Leu, into Tyr or into Ile; Phe into Met, into Leu or into Tyr; Ser into Thr; Thr into Ser; Trp into Tyr; Tyr into Trp; and/or Phe into Val, into Ile or into Leu.
In some embodiments, the percent identity of the polypeptide to the reference polypeptide does not include any functional domains added fused to the polypeptides (whether N-terminal, C-terminal, or internal), and wherein the 1, 2, 3, 4, or 5 N-terminal and/or C-terminal amino acid residues may be present or absent when considering the percent identity.
In another embodiment, 1, 2, 3, 4, or 5 N-terminal and/or C-terminal amino acid residues are not included when determining the percent identity of the polypeptide relative to the reference polypeptide. In another embodiment, all residues are included when determining the percent identity relative to the reference polypeptide.
In another embodiment, the disclosure provides fusion proteins comprising the polypeptide of any embodiment disclosed herein fused to a functional polypeptide. Any suitable functional polypeptide may be used, including but not limited to a therapeutic polypeptide, diagnostic polypeptide, targeting polypeptide, scaffold polypeptide, or polypeptide that confers stability on the fusion protein. Such fusion proteins may be used, for example, to target the functional polypeptide to the target of the polypeptides of the disclosure in or on cells. In one embodiment, the fusion protein comprises two or more copies of the polypeptide of any embodiment of the target binding polypeptides of the disclosure. In one such embodiment, the two or more copies of the polypeptide are identical. In the embodiments, the two or more copies of the polypeptide are not all identical. In all these embodiments, the fusion protein components may be directly adjacent in the fusion protein, or may be separated by an amino acid linker of any suitable length and amino acid composition.
In another embodiment, the disclosure provides scaffolds comprising 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more copies of the polypeptide or fusion protein of any embodiment disclosed herein. Any suitable scaffold can be used, including but not limited to designed polypeptide scaffolds, virus-like particles, beads, etc.
In another aspect the disclosure provides nucleic acids encoding the polypeptide or fusion protein of any embodiment or combination of embodiments of the disclosure. The nucleic acid sequence may comprise single stranded or double stranded RNA (such as an mRNA) or DNA in genomic or cDNA form, or DNA-RNA hybrids, each of which may include chemically or biochemically modified, non-natural, or derivatized nucleotide bases. Such nucleic acid sequences may comprise additional sequences useful for promoting expression and/or purification of the encoded polypeptide, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the polypeptides of the disclosure.
In a further aspect, the disclosure provides expression vectors comprising the nucleic acid of any aspect of the disclosure operatively linked to a suitable control sequence. “Expression vector” includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product. “Control sequences” operably linked to the nucleic acid sequences of the disclosure are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered “operably linked” to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors can be of any type, including but not limited plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In various embodiments, the expression vector may comprise a plasmid, viral-based vector, or any other suitable expression vector.
In another aspect, the disclosure provides host cells that comprise the nucleic acids, expression vectors (i.e.: episomal or chromosomally integrated), non-naturally occurring polypeptides, fusion protein, or compositions disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cells can be transiently or stably engineered to incorporate the nucleic acids or expression vector of the disclosure, using techniques including but not limited to bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, polycationic mediated-, or viral mediated transfection.
In another aspect, the present disclosure provides pharmaceutical compositions, comprising one or more polypeptides, fusion proteins, compositions, nucleic acids, expression vectors, and/or host cells of the disclosure and a pharmaceutically acceptable carrier. The pharmaceutical compositions of the disclosure can be used, for example, in the methods of the disclosure described below. The pharmaceutical composition may comprise in addition to the polypeptide of the disclosure (a) a lyoprotectant; (b) a surfactant; (c) a bulking agent; (d) a tonicity adjusting agent; (e) a stabilizer; (f) a preservative and/or (g) a buffer.
In another aspect, the disclosure provides uses and methods for use of the polypeptides, fusion proteins, scaffolds, nucleic acids, expression vectors, host cells, and/or pharmaceutical compositions of the disclosure for any suitable use as disclosed herein. In one non-limiting embodiment, the polypeptides, fusion proteins, scaffolds, nucleic acids, expression vectors, host cells, and/or pharmaceutical compositions are used as a targeting moiety, to direct a “payload” to the target to which the polypeptide binds. In one embodiment, the payload may be a functional domain as described herein, and the polypeptide may be provided as a fusion protein with a polypeptide functional domain. The payload may include, but is not limited to, a detectable moiety (fluorescent protein, luminescent compound or protein, radioactive isotope, etc.), a therapeutic functional domain, and/or a diagnostic functional domain.
In another non-limiting embodiment, the polypeptides, fusion proteins, scaffolds, nucleic acids, expression vectors, host cells, and/or pharmaceutical compositions are used as a therapeutic moiety. In non-limiting embodiments, the methods may comprise treating a tumor or infection, such as a viral infection. The protein targets fall into two classes: (1) human cell surface or extracellular proteins involved in signaling, for which binders have utility as therapeutics for treating a tumor (Tropomyosin receptor kinase A (TrkA)15, Fibroblast growth factor receptor 2 (FGFR2)16, Epidermal growth factor receptor (EGFR)17, Platelet-derived growth factor receptor (PDGFR)18, Insulin receptor (Insulin®)19, Insulin-like growth factor 1 receptor (IGFIR)20, Angiopoietin-1 receptor (Tie2)21, Interleukin-7 receptor alpha (IL-7Rα)22, CD3 delta chain (CD3δ)23, Transforming growth factor beta (TGF-β)24); and (2) pathogen surface proteins for which binding proteins have therapeutic utility in treating infections (Influenza A H3 hemagglutinin (H3)25 (H3_mb series of proteins disclosed herein), VirB8-like protein from Rickettsia typhi (VirB8)26, and the SARS-CoV-2 coronavirus spike protein (LCB series of proteins).
As used herein, “treat” or “treating” means accomplishing one or more of the following: (a) reducing the severity of the disorder; (b) limiting or preventing development of symptoms characteristic of the disorder(s) being treated; (c) inhibiting worsening of symptoms characteristic of the disorder(s) being treated; (d) limiting or preventing recurrence of the disorder(s) in patients that have previously had the disorder(s); and (e) limiting or preventing recurrence of symptoms in patients that were previously symptomatic for the disorder(s).
The subject may be any subject that has a relevant disorder. In one embodiment, the subject is a mammal, including but not limited to humans, dogs, cats, horses, cattle, etc. In one embodiment, the subject is a human subject.
In another aspect, the disclosure provides methods for designing protein binding proteins from target structural information alone comprising any steps or combination of steps as described in the examples that follow.
The design of proteins that bind to a specific site on the surface of a target protein using no information other than the three-dimensional structure of the target remains an outstanding challenge. We describe a general solution to this problem, which starts with a broad exploration of the very large space of possible binding modes and interactions, and then intensifies the search in the most promising regions. We demonstrate its very broad applicability by de novo design of binding proteins to 12 diverse protein targets with very different shapes and surface properties. Biophysical characterization shows that the binders, which are all smaller than 65 amino acids, are hyperstable and bind their targets with nanomolar to picomolar affinities. Crystal structures of four of the binder-target complexes were solved, and all four are very close to the corresponding computational design models (data not shown). Experimental data on nearly half a million computational designs and hundreds of thousands of point mutants provide detailed feedback on the strengths and limitations of the method and of our current understanding of protein-protein interactions, and should guide improvement of both. Our approach now enables targeted design of binders to sites of interest on a wide variety of proteins for therapeutic and diagnostic applications.
We sought to develop a general approach to design of high affinity binders to arbitrary protein targets that addresses two major challenges. First, in the general case, there are no clear sidechain interactions or secondary structure packing arrangements that can mediate strong interactions with the target; instead there are a very large number of individually very weak possible interactions. Second, the number of ways of choosing from these numerous weak interactions to incorporate into a single binding protein is combinatorially large, and any given protein backbone is unlikely to be able to simultaneously present sidechains that can encompass any preselected subset of these interactions. To motivate our approach, consider the simple analogy of a very difficult climbing wall with only a few good footholds or handholds distant from each other. Previous “hotspot” based approaches correspond to focusing on routes involving these footholds/handholds, but this greatly limits the possibilities and there may be no way to connect them into a successful route. An alternative is to first, identify all possible handholds and footholds, no matter how poor, second, have thousands of climbers select subsets of these, and try to climb the wall, third, identify those routes that were most promising, and fourth, have a second group of climbers explore them in detail. Following this analogy, we devised a multi-step approach to overcome the above two challenges by 1) enumerating a large and comprehensive set of disembodied sidechain interactions with the target surface, 2) identifying from large in silico libraries of protein backbones those that can host many of these sidechains without clashing with the target, 3) identifying recurrent backbone motifs in these structures, and 4) generating and placing against the target a second round of scaffolds containing these interacting motifs (
We began by docking disembodied amino acids against the target protein, and storing the backbone coordinates and target binding energies of the typically billions of amino acids making favorable hydrogen bonding or non-polar interactions in a 6-dimensional spatial hash table for rapid lookup (
For docking against the rotamer interaction field, it is desirable to have a very large set of protein scaffold options, as the chance that any one scaffold can house many interactions is small. The structure models of these scaffolds must be quite accurate so that the positioning is correct. Using fragment assembly {Koga, 2012 #14}, piecewise fragment assembly {Linsky, 2021 #16}, and helical extension {Maguire, 2021 #15}, we designed a large set of miniproteins ranging in length from 50 to 65 amino acids containing larger hydrophobic cores than previous miniprotein scaffold libraries {Chevalier, 2017 #1}, which makes the protein more stable and more tolerant to introduction of the designed binding surfaces. 84,690 scaffolds spanning 5 different topologies with structural metrics predictive of folding were encoded in large oligonucleotide arrays and 34,507 were found to be stable using a high-throughput proteolysis based protein stability assay {Rocklin, 2017 #8}. We experimented with several approaches for docking these stable scaffolds against the target structure rotamer interaction field, balancing overall shape complementarity with maximizing specific rotamer interactions. The most robust results were obtained using direct low resolution shape matching {Schneidman-Duhovny, 2005 #9} followed by grid based refinement of the rigid body orientation in the RIF (RIFDock™). This resulted in better Rosetta™ binding energies (ddGs) and packing after sequence design (contact molecular surface, see below) than shape matching alone with PatchDock™ (
Because of the loss in resolution in the hashing used to build the RIF, and the necessarily approximate accounting for interactions between sidechains (see methods), we found that evaluation of the RIF solutions is considerably enhanced by full combinatorial optimization using the Rosetta™ forcefield, allowing the target sidechains to repack and the scaffold backbone to relax. Full combinatorial sequence optimization is quite CPU intensive, however, and to enable rapid screening through millions of alternative backbone placements, we developed a rapid pre-screening method using Rosetta™ to identify promising RIF docks. We found that including only hydrophobic amino acids, using a reduced set of rotamers than in standard Rosetta™ design calculations, and a more rapidly computable energy function sped design more than 10-fold while retaining a strong correlation with results after full sequence design (next paragraph); this pre-screen (referred to as the “Predictor” below) substantially improved the binding energies and shape complementarity of the final designs as far more RIF solutions could be processed.
We observed that application of standard Rosetta™ design to the set of filtered docks in some cases resulted in models with buried unsatisfied polar groups and other suboptimal properties. To overcome these limitations, we developed a combinatorial sequence design protocol that maximizes shape and chemical complementarity with the target while avoiding buried polar atoms. Sequence compatibility with the scaffold monomer structure was increased using a structure based sequence profile (Brunette, 2020 #17), the cross-interface interactions were upweighted during the Monte Carlo-based sequence design stage to maximize the contacts between the binder and the target (see method for the ProteinProteinInterfaceUpweighter), and rotamers containing buried unsatisfiable polar atoms were eliminated prior to packing and buried unsatisfied polar atoms penalized by a pair-wise decomposable pseudo-energy term {Coventry, 2021 #18}. This protocol yielded amino acid sequences more strongly predicted to fold to the designed structure and to bind the target than standard Rosetta™ interface design.
In the course of developing the overall binder design pipeline, we noticed upon inspection that even designs with favorable Rosetta™ binding free energies, large changes in Solvent Accessible Surface Area (SASA) upon binding, and high shape complementarity (SC) often lacked dense packing and interactions involving several secondary structural elements. We developed a quantitative measure of packing quality in much closer accord with visual assessment—the contact molecular surface (see methods)—which balances interface complementarity and size in a manner that explicitly penalizes poor packing. We used this metric to help select designs at both the rapid Predictor stage and after full sequence optimization (see Methods).
The space sampled by the search over structure and sequence space is enormous: tens of thousands of possible protein backbones×nearly one billion possible disembodied sidechain interactions with target×1016 interface sequences per scaffold placement. Sampling of spaces of this size is necessarily incomplete, and many of the designs at this stage contained buried unsatisfied polar atoms (only rotamers that cannot make hydrogen bonds in any context are excluded at the packing stage) and cavities. To generate improved designs, we intensified the search around the best of the designed interfaces. We developed a resampling protocol which first extracts all the secondary structural motifs making good contacts with the target protein from the first “broad search” designs, clusters these motifs based on their backbone coordinates and rigid body placements, and then selects the binding motif in each cluster with the best per-position weighted Rosetta™ binding energy; around 2,000 motifs were selected for each target. These motifs, which are privileged because they contain a much greater density of favorable side chain interactions with the target than the rest of the designs, were then used to guide another round of docking and design. Scaffolds from the library were superimposed on the privileged motifs, the favorable-interacting motif residues transferred to the scaffold, and the remainder of the scaffold sequence optimized to make further interactions with the target, allowing backbone flexibility to increase shape complementarity with the target (
Previous protein binder design approaches have been tested on only one or two targets, which limits assessment of their generality. To robustly test our new binder design pipeline, we selected thirteen native proteins of considerable current interest spanning a wide range of shapes and biological functions. These proteins fall into two classes: first, human cell surface or extracellular proteins involved in signaling, for which binders could have utility as probes of biological mechanism and potentially as therapeutics (Tropomyosin receptor kinase A (TrkA) {Wiesmann, 1999 #30}, Fibroblast growth factor receptor 2 (FGFR2) {Plotnikov, 2000 #31}. Epidermal growth factor receptor (EGFR) {Garrett, 2002 #21}, Platelet-derived growth factor receptor (PDGFR) {Shim, 2010 #22}. Insulin receptor (Insulin®) {Croll, 2016 #23}, Insulin-like growth factor 1 receptor (IGF1R) {Xu, 2018 #24}, Angiopoietin-1 receptor (Tie2) {Barton, 2006 #25}, Interleukin-7 receptor alpha (IL-7Rα) {McElroy, 2009 #26}, CD3 delta chain (CD38) {Amett, 2004 #27}, Transforming growth factor beta (TGF-β) {Radaev, 2010 #28}); and second, pathogen surface proteins for which binding proteins could have therapeutic utility (Influenza A H3 hemagglutinin (H3) {Ekiert, 2012 #20}, VirB8-like protein from Rickettsia typhi (VirB8) {Gillespie, 2015 #29}, and the SARS-CoV-2 coronavirus spike protein) (
Using the above protocol, we designed 15,000-100,000 binders for each of thirteen target sites on the twelve native proteins (see Methods; we chose two sites on the EGF receptor). Synthetic oligonucleotides (230 bp) encoding the 50-65 residue designs were cloned into a yeast surface expression vector, the designs were displayed on the surface of yeast, and those which bind their target enriched by several rounds of fluorescence-activated cell sorting using fluorescently labelled target proteins. The starting and enriched populations were deep sequenced, and the fraction of each design after each sort was determined by comparing the frequency of the design in the parent and child pools. From multiple sorts at different target protein concentrations, we determined, as a proxy for binding Kd's, the midpoint concentration (SC50) in the binding transitions for each design in the library (Table 14 and Methods).
aSSM sorts used to estimate the number of binders.
To assess whether the top enriched designs for each target fold and bind as in the corresponding computational design models, and to investigate the sequence dependence of folding and binding, we generated high resolution footprints of the binding surface by sorting site saturation mutagenesis libraries (SSMs) in which every residue was substituted with each of the 20 amino acids one at a time. For the majority, but not all, enriched designs, substitutions at the binding interface and in the protein core were less tolerated than substitutions at non-interface surface positions (
We expressed the highest affinity combinatorially-optimized binders for each target in E. coli for more detailed structural and functional characterization. All of the designs were in the soluble fraction, and could be readily purified by nickel-NTA chromatography. All had circular dichroism spectra consistent with the design model, and most (9 out of 13) were stable at 95° C. (
The receptor tyrosine kinases TrkA, FGFR2, PDGFR. EGFR, InsulinR. IGF1R and Tie2 are key regulators of cellular processes and are involved in the development and progression of many types of cancer{Lemmon, 2010 #32}. We designed binders targeting the native ligand binding sites for PDGFR, EGFR (on both domain I and domain III, the binders are then referred as EGFRn_mb and EGFRc_mb respectively), InsulinR. IGFIR and Tie2, and targeting surface regions proximal to the native ligand binding sites for TrkA and FGFR2 (
Hemagglutinin (HA) is the main target for influenza A virus vaccine and drug development, and it can be genetically classified into two main subgroups, group 1 and group 2{Webster, 1992 #83; Nobusawa, 1991 #84}. The HA stem region is an attractive therapeutic epitope, as it is highly conserved across all the influenza A subtypes and targeting this region can block the low pH-induced conformational rearrangements associated with membrane fusion, which is vital to the virus infection {Bullough, 1994 #85; Ekiert, 2009 #82}. Protein {Fleishman, 2011 #3; Chevalier, 2017 #1}, peptide {Kadam, 2017 #33} and small molecule inhibitors {van Dongen, 2019 #34} have been designed to bind to the stem region of group 1 HA to neutralize the influenza A viruses, but none of them is able to recognize the group 2 HA. Neutralizing antibodies targeting the stem region of group 2 HA have been identified through screening of large B-cell libraries after vaccination or infection, and some of them showed broad specificity and neutralized both group 1 and group 2 influenza A viruses {Corti, 2011 #35; Joyce, 2016 #86}. However, rational design of group 2 HA stem region binders remains a longstanding challenge, let alone the de novo designed pan-specific HA stem region binders which can bind both group 1 HA and group 2 HA. The challenge is mainly due to three differences between the group 1 HA and the group 2 HA: the group 2 HA stem region contains more polar residues and is more hydrophilic; in group 2 HA, Trp21 adopts a configuration roughly perpendicular to the surface of the targeting groove, which makes the targeting groove much shallower and less hydrophobic; the group 2 HA is glycosylated at Asn38, and the carbohydrate side chains covers the hydrophobic groove and protected the HA stem region from binding by antibodies or designed binders. We applied our new method to design binders to H3 HA (A/Hong Kong/1/1968), the main pandemic subtype of group 2 influenza virus, and obtained a binder with an affinity of 320 nM to the wild type H3 (
With the outbreak of the SARS-CoV-2 coronavirus pandemic we applied our method to design miniproteins targeting the receptor binding domain of the SARS-CoV-2 Spike protein near the ACE2 binding site to block receptor engagement. Due to the pressing need for coronavirus therapeutics, we recently described the results of these efforts {Cao, 2020 #2} ahead of those described in this manuscript; As in the case of FGFR2, IL-7Rα and VirB8, the method yielded picomolar binders, which are among the most potent compounds known to inhibit the virus in cell culture (IC50 0.15 ng/ml) and subsequent animal experiments have shown that they provide potent protection against the virus in vivo {Case, 2021 #43}. The modular nature of the miniprotein binders enables their rapid integration into designed diagnostic biosensors which we have demonstrated for both influenza and SARS-CoV-2 binders {Quijano-Rubio, 2021 #69}.
The designed binding proteins are all very small proteins (<65 amino acids), and many are 3-helix bundles. To evaluate their target specificity, we tested the highest affinity binder to each target for binding to all other targets. There was very little cross reactivity (
Structural validation There is a clear reorientation of the oligosaccharide at Asn38 compared with the unbound HK68/H3 structure, which has been seen in HK68/H3 structures bound with stem region neutralizing antibodies {Corti, 2011 #35; Joyce, 2016 #86}, and this explains the different binding affinities of the H3 binder with the wild type H3 HA (A/Hong Kong/1/1968) and the deglycosylated H3 variant (N38D) in the BLI assay (
Our de novo binder design method and the large data set (810,000 binder designs and 240,000 single mutants) generated here provide a starting point for investigating the fundamental physical chemistry of protein-protein interactions, and for developing and assessing computational models of protein-protein interactions. Across all targets, there was a strong correlation between success rate and the hydrophobicity of the targeted region (
Our success in designing nM affinity binders for 14 target sites demonstrates that binding proteins can be designed de novo using only information on the structure of the target protein, without need for prior information on binding hotspots or fragments from structures of complexes with binding partners. The success also suggests that our design pipeline provides a quite general solution to the de novo protein interface design problem that goes far beyond previously described methods. This work is a major step forward towards the longer range goal of direct computational design of high affinity binders starting from structural information alone. We expect the binders created here, and new ones created with the method moving forward, will find wide utility as signaling pathway antagonists as monomeric proteins and as tunable agonists when rigidly scaffolded in multimeric formats, and in diagnostics and therapeutics for pathogenic disease. More generally, the ability to rapidly and robustly design high affinity binders to arbitrary protein targets could transform the many areas of biotechnology and medicine that rely on affinity reagents.
The crystal structures of HA (PDB: 4FNK) {Ekiert, 2012 #20}, EGFR (PDB: 1MOX, 4UV7) {Garrett, 2002 #21; Lim, 2016 #36}, PDGFR (PDB: 3MJG) {Shim, 2010 #22}, IR (PDB: 4ZXB) {Croll, 2016 #23}, IGFIR (PDB: 5U8R) {Xu, 2018 #24}, Tie2 (PDB: 2GY7) {Barton, 2006 #25}, IL-7Rα (PDB: 3DI3) {McElroy, 2009 #26}, CD3 (PDB: 1XIW) {Arnett, 2004 #27}, TGF-β (PDB: 3KFD) {Radaev, 2010 #28} and VirB8 (PDB: 403V) {Gillespie, 2015 #29} were refined in the Rosetta™ energy field constrained by experimental diffraction data. The crystal structures of TrkA (PDB: 1WWW) {Wiesmann, 1999 #30} and FGFR2 (PDB: 1EV2) {Plotnikov, 2000 #31} were refined with the Rosetta™ FastRelax protocol with coordinate constraints. The targeting chain or the selected targeting region were extracted and used as the starting point for docking and design. To run PatchDock™ {Schneidman-Duhovny, 2005 #9}, the scaffolds were mutated to poly-valine first and default parameters were used to generate the raw docks. Rifdock™ was used to generate the rotamer interacting field by docking billions of individual disembodied amino acids to the selected targeting regions {Dou, 2018 #7}. In detail, hydrophobic sidechain R-groups are docked against the target using a branch-and-bound search to quickly identify favorable interactions with the target, and polar sidechain R-groups are enumeratively sampled around every target bond donor or acceptor. To identify backbone placements from which these interactions can be made, side chain rotamer conformations are grown backwards for all R-group placements, and their backbone coordinates stored in a 6-dimensional spatial hash table for rapid lookup. For the hierarchical searching protocol, the miniprotein scaffold library (50-65 residues in length) was docked into the field of the inverse rotamers using a branch-and-bound searching algorithm from low resolution spatial grids to high resolution spatial grids. For the PatchDock™+Rifdock™ protocols, the PatchDock™ outputs were used as seeds for the initial positioning of the scaffolds and the docks were further refined in the finest resolution rotamer interaction field. These docked conformations were further optimized to generate shape and chemically complementary interfaces using the Rosetta™ FastDesign protocol, activating between side-chain rotamer optimization and gradient-descent-based energy minimization. Serval improvements were added to the sequence design protocol to generate better sequences for both folding and binding. These include a better repulsive energy ramping strategy {Maguire, 2021 #15}, upweighting cross-interface energies, a pseudo-energy term penalizing buried unsatisfied polar atoms {Coventry, 2021 #18} and a sequence profile constraint based on native protein fragments {Brunette, 2020 #17}. Computational metrics of the final design models were calculated using Rosetta™, which includes ddg, shape complementary and interface buried solvent accessible surface area, contact molecular surface, etc, for design selection. All the script and flag files to run the programs are in the Supplementary file.
The binding energy and interface metrics for all the continuous secondary structure motifs (helix, strand and loop) were calculated for the designs generated in the broad search stage. The motifs with good interaction (based on binding energy and other interface metrics, like SASA, contact molecular surface) with the target were extracted and aligned using the target structure as the reference. All the motifs were then clustered based on an energy based-TMalign™ like clustering algorithm. Briefly, all the motifs were sorted based on the interaction energy with the target, and the lowest energy motif in the unclustered pool was selected as the center of the first cluster. A similar score between this motif and every motif remaining in the unclustered pool was calculated based on the TMalign™ algorithm {Zhang, 2005 #37} without any further superimposition. Those motifs within a threshold similar score (default 0.7) from the current cluster center were removed from the unclustered pool and added to the new cluster. The lowest-energy motif remaining in the unclustered pool was selected as the center of the next cluster, and the second step was repeated. This process continued for subsequent clusters until no motifs remained in the unclustered pool. The best motif from each cluster was then selected based on the per-position weighted Rosetta™ binding energy, using the average energy across all the aligned motifs at each position as the weight. Around 2,000 best motifs were selected and the scaffold library was superimposed onto these motifs using the MotifGraft mover {Silva, 2016 #38}. Interface sequences were future optimized and computational metrics were computed for the final optimized designs as described in the broad search stage. CPU-time requirements to produce 100,000 designed binders to be tested experimentally were typically around 100.000 CPU-hours (usually at least 10× as many binders were computationally designed than were ordered).
A severe speed mismatch exists between the docking methods (RifDock™ and Focused search) and the subsequent full sequence design step. While the docking methods can typically produce outputs every 1 to 3 seconds, the full sequence design can take upwards of 4 minutes. To remedy this situation, a step was designed to take about 20 seconds that would be more predictive than metrics evaluated on raw docks, but faster than the full sequence design.
A stripped down version of the Rosetta™ beta_nov16 score function was used to design only with hydrophobic amino acids. Specifically, fa_elec, lkball[iso,bridge,bridge_unclp], and the_intra_terms were disabled as these proved to be the slowest energy methods by profiling. All that remained were Lennard-Jones, implicit solvation, and backbone-dependent one-body energies (fa_dun, p_aa_pp, rama_prepro). Additionally, flags were used to limit the number of rotamers built at each position (See Supplementary Information).
After the rapid design step, the designs are minimized twice: once with a low-repulsive score function and again with a normal-repulsive score function. Metrics of interest were then evaluated including like Rosetta™ ddG, Contact Molecular Surface, and Contact Molecular Surface to critical hydrophobic residues.
Using the fact that these predicted metrics correlate with the values after full sequence design, a Maximum Likelihood Estimator (functional form similar to logistic regression) was used to give each predicted design a likelihood that it should be selected to move forward. A subset of the docks to be evaluated are subjected to the full sequence design, and their final metric values calculated. With a “goal threshold” for each filter, each fully-designed output can be marked as “pass” or “fail” for each metric independently. Then, by binning the fully-designed outputs by their values from the rapid trajectory and plotting the fraction of designs that pass the “goal threshold”, the probability that each predicted design passes each filter can be calculated (sigmoids are fitted to smooth the distribution). From here, the probability of passing each filter may be multiplied together to arrive at the final probability of passing all filters. This final probability can then be used to rank the designs and pick the best designs to move forward to full sequence optimization. The rapid design protocol here is used merely to rank the designs, not to optimize them; the raw, non-rapid-designed docks are the structures carried forward.
Solvent-accessible surface area (SASA) is a measure of the exposure of amino acids to the solvent and it is typically calculated by methods involving in-silico rolling of a spherical probe, which approximates a water molecule (radius 1.4 Å), around a full-atom protein model. Delta-SASA upon protein-protein binding has been widely used to analyze native protein interactions. Unlike the crystal structures of the native protein complexes, design models for the de novo interactions are usually imperfectly packed, and contain many holes or cavities. If the sizes of the holes or cavities in the interface are smaller than the rolling probe, the SASA can not capture those holes and cavities and the real contacts are usually overestimated by the delta-SASA metric. The contact molecular surface was developed to capture the flaws of the de novo designed interactions. Firstly, the molecule surfaces of the binder and the target were calculated by the triangularization algorithm in the Rosetta™ shape complementary filter. For each triangle, the distance to the closest triangle on the other side was calculated and used to downweight the area of the triangle by the equation: A′=A*exp(−0.5*distance2) Then all the down-weighted areas were summed up to get the contact molecular surface. In this way, the real contacts between the target and the binder are penalized by the cavities and holes in the interface. The contact molecular surface was implemented as the ContactMolecularSurface filter in the Rosetta™ macromolecular modelling suite.
Upweight Protein interface Interactions
Rosetta™ sequence design starts from generating an interaction graph by calculating the energies between all designable rotamer pairs {Leaver-Fay, 2011 #39}. The best rotamer combinations are searched using a Monte Carlo Simulated Annealing protocol by optimizing the total energy of the protein (monomer/complex). To obtain more contacts between the binder and the target protein, we can upweight the energies of all the cross interface rotamer pairs by a defined factor. In this way, the Monte Carlo protocol will be biased to find solutions with better cross interface interactions. The upweight protein interface interaction protocol was implemented as the ProteinProteinInterfaceUpweighter task operation in the Rosetta™ macromolecular modelling suite.
All protein sequences were padded to 65aa by adding a (GGGS)n (SEQ ID NO: 1574) linker at the C terminal of the designs, to avoid the biased amplification of short DNA fragments during PCR reactions. The protein sequences were reversed translated and optimized using DNAworks2.0 {Hoover, 2002 #1} with the S. cerevisiae codon frequency table. Oligo pool encoding the de novo designs and the point mutant library were ordered from Agilent Technologies. Combinatorial libraries were ordered as IDT (Integrated DNA Technologies) ultramers with the final DNA diversity ranging from 1e6 to 1e7.
All libraries were amplified using Kapa HiFi Polymerase (Kapa Biosystems) with a qPCR machine (BioRAD CFX96). In detail, the libraries were firstly amplified in a 25 ul reaction, and PCR reaction was terminated when the reaction reached half maximum yield to avoid over amplification. The PCR product was loaded to a DNA agarose gel. The band with the expected size was cut out and DNA fragments were extracted using QlAquick™ kits (Qiagen, Inc.). Then, the DNA product was re-amplified as before to generate enough DNA for yeast transformation. The final PCR product was cleaned up with a QlAquick™ Clean up kit (Qiagen, Inc.). For the yeast transformation, 2-3 μg of linearized modified pETcon™ vector (pETcon3) and 6 μg of insert were transformed into EBY100 yeast strain using the protocol as described in {Benatuil, 2010 #12}.
DNA libraries for deep sequencing were prepared using the same PCR protocol, except the first step started from yeast plasmid prepared from 5×107 to 1×108 cells by Zymoprep™ (Zymo Research). Illumina adapters and 6-bp pool-specific barcodes were added in the second qPCR step. Gel extraction was used to get the final DNA product for sequencing. All different sorting pools were sequenced using Illumina NextSeq™ sequencing.
Influenza A hemagglutinin (HA) ectodomain was expressed using a baculovirus expression system as described previously {Stevens, 2004 #62; Ekiert, 2012 #63}. Briefly, each HA was fused with gp67 signal peptide at the N-terminus and to a BirA biotinylation site, thrombin cleavage site, trimerization domain and His-tag at the C-terminus. Expressed HAs were purified using metal affinity chromatography using Ni-NTA resin. For binding studies, each HA was biotinylated with BirA and purified by gel filtration using S200 16/90 column on ÄKTA protein purification system (GE Healthcare). The biotinylation reactions contained 100 mM Tris (pH 8.5), 10 mM magnesium acetate, 10 mM ATP, 50 μM biotin and <50 mM NaCl, and were incubated at 37° C. for 1 hr.
For TrkA, the DNA encoding human TrkA ECD (residues 36-382) was cloned into pAcBAP, a derivative of pAcGP67-A modified to include a C-terminal biotin acceptor peptide (BAP) tag (GLNDIFEAQKIEWHE) (SEQ ID NO:1571) followed by a 6×HIS tag for affinity purification. It was then transfected into Trichoplusia ni (High Five) cells (Invitrogen) using the BaculoGold™ baculovirus expression system (BD Biosciences) for secretion and purified from the clarified supernatant via Ni-NTA followed by size exclusion chromatography with a Superdex™-200 column in sterile Phosphate Buffer Saline (PBS) (Cat. 20012-027; Gibco). The ectodomains of FGFR2 (residues 147-366, Uniprot ID P21802), EGFR (residues ID 25-525, Uniprot ID P00552), PDGFR (residues 33-314, Uniprot ID P09619), InsulinR (residues ID 28-953, Uniprot ID P06213), IGF1R (residues 31-930, Uniprot ID P08069), Tie2 (residues 23-445, Uniprot ID Q02763), IL-7Rα (residues 37-231, Uniprot ID P16871) were expressed in mammalian cells with a IgK Signal peptide (METDTLLLWVLLLWVPGSTG) (SEQ ID NO:1572) at the N-terminus and a C-terminal tag (GSENLYFQGSHHHI-HHHGSGLNDIFEAQKIEWHE) (SEQ ID NO:1573) which contains a TEV cleavage site, a 6-His-tag and an AviTag. VirB8 was expressed in E. coli with a C-terminal AviTag as previously described {Gillespie, 2015 #19}. The proteins were purified by Ni-NTA, and polished with size exclusion chromatography. Then, the AviTag-proteins were biotinylated with the BirA biotin-protein ligase bulk reaction kit (Avidity) following the manufacturer's protocol and the excessive biotin was removed through size exclusion chromatography. Biotinylated CD3 protein was bought from Abcam (Cat #ab205994). TGF-β was bought from Acro Biosystems (Cat #TG1-H8217). IGF1 was bought from Sigma (Cat #407251-100 ug). Insulin was bought from Abcam (Cat #ab123768). The caged Ang1-Fc protein was prepared as described previously {Divine, 2021 #77}, and was kindly provided by George Ueda.
S. cerevisiae EBY100 strain cultures were grown in C-Trp-Ura media supplemented with 2% (w/v) glucose. For induction of expression, yeast cells were centrifuged at 6,000×g for 1 min and resuspended in SGCAA media supplemented with 0.2% (w/v) glucose at the cell density of 1×10{circumflex over ( )}7 cells per ml and induced at 30° C. for 16-24 h. Cells were washed with PBSF (PBS with 1% (w/v) BSA) and labelled with biotinylated targets using two labeling methods, with-avidity and without-avidity labeling. For the with-avidity method, the cells were incubated with biotinylated target, together with anti-c-Myc fluorescein isothiocyanate (FITC, Miltenyi Biotech) and streptavidin-phycoerythrin (SAPE, ThermoFisher). The concentration of SAPE in the with-avidity method was used at % concentration of the biotinylated targets. For the without-avidity method, the cells were firstly incubated with biotinylated targets, washed, secondarily labelled with SAPE and FITC. All the original libraries of de novo designs were sorted using the with-avidity method for the first few rounds of screening to fish out weak binder candidates, followed by several without-avidity sorts with different concentrations of targets. For SSM libraries, two rounds of without-avidity sorts were applied and in the third round of screening, the libraries were titrated with a series of decreasing concentrations of targets to enrich mutants with beneficial mutations. The combinatorial libraries were sorted to convergence by decreasing the target concentration with each subsequent sort and collecting only the top 0.1% of the binding population. The final sorting pools of the combinatorial libraries were plated on C-trp-ura plates and the sequences of individual clones were determined by Sanger sequencing. The competition sort was done following the without-avidity protocols with a very minor modification. Briefly, the biotinylated target proteins (H1, H3, TrkA, InsulinR, IGFIR, PDGFR and Tie2) were first incubated with an excessive amount of competitors (FI6v3. FI6v3, NGF, insulin, IGF1, PDGF and caged Ang1-Fc) respectively for 10 mins, and the mixture was used for labeling the cells. The non-specificity reagent was prepared using the protocol as described in {Xu, 2013 #13}. For non-specificity sort, the cells were firstly washed with PBSF and incubated with the non-specificity reagent at the concentration of 100 μg/ml for 30 mins. The cells were then washed and secondarily labelled with SAPE and FITC for cell sorting. The cells were then labeled with RBD using the above mentioned protocol.
Genes encoding the designed protein sequences were synthesized and cloned into modified pET-29b(+) E. coli plasmid expression vectors (GenScript™, N-terminal 8 His-tag followed by a TEV cleavage site). For all designed proteins, the sequence of the N-terminal tag is MSHHHHHHHHSENLYFQSGGG (SEQ ID NO: 1560 (unless otherwise noted), which is followed immediately by the sequence of the designed protein. For proteins expressed with the maltose binding protein (MBP) tag, the corresponding genes were subcloned into a modified pET-29b(+) E. coli plasmid, which has a N-terminal 6 His-tag and a MBP tag. Plasmids were transformed into chemically competent E. coli Lemo21 cells (NEB). For the designs for TrkA, FGFR2, EGFR. IR, IGF1R, Tie2, IL-7Rα, TGF-β and the MBP tagged miniproteins, protein expression was performed using the Studier autoinduction media supplemented with antibiotic, and cultures were grown overnight. For designs for HA, PDGFR and CD3δ, the E. coli cells were grown in LB media at 37° C. until the cell density reached 0.6 OD600. Then, IPTG was added to the final concentration of 500 mM and the cells were grown overnight at 22° C. for expression. The cells were harvested by spinning at 4,000×g for 10 min and then resuspended in lysis buffer (300 mM NaCl, 30 mM Tris-HCL, pH 8.0, with 0.25% CHAPS for cell assay samples) with DNAse and protease inhibitor tablets. The cells were lysed with a sonicator for 4 minutes total (2 minutes on time, 10 sec on-10 sec off) with an amplitude of 80%. Then the soluble fraction was clarified by centrifugation at 20,000×g for 30 min. The soluble fraction was purified by Immobilized Metal Affinity Chromatography (Qiagen) followed by FPLC size-exclusion chromatography (Superdex™ 75 10/300 GL, GE Healthcare). All protein samples were characterized with SDS-PAGE with the purity higher than 95%. Protein concentrations were determined by absorbance at 280 nm measured using a NanoDrop™ spectrophotometer (Thermo Scientific) using predicted extinction coefficients.
Far-ultraviolet CD measurements were carried out with an JASCO-1500 equipped with a temperature-controlled multi-cell holder. Wavelength scans were measured from 260 to 190 nm at 25, 95° C. and again at 25° C. after fast refolding (˜5 min). Temperature melts monitored dichroism signal at 222 nm in steps of 2° C./minute with 30s of equilibration time. Wavelength scans and temperature melts were performed using 0.3 mg/ml protein in PBS buffer (20 mM NaPO4, 150 mM NaCl, pH 7.4) with a 1 mm path-length cuvette. Melting temperatures were determined fitting the data with a sigmoid curve equation. 9 out of the 13 designs retained more than half of the mean residue ellipticity values, which indicated the Tm values are greater than 95° C. Tm values of the other designs were determined as the inflection point of the fitted function.
Biolayer interferometry binding data were collected on an Octet RED96™ (ForteBio) and processed using the instrument's integrated software. For minibinder binding assays, biotinylated targets were loaded onto streptavidin-coated biosensors (SA ForteBio) at 50 nM in binding buffer (10 mM HEPES (pH 7.4), 150 mM NaCl, 3 mM EDTA, 0.05% surfactant P20, 1% BSA) for 360 s. Analyte proteins were diluted from concentrated stocks into the binding buffer. After baseline measurement in the binding buffer alone, the binding kinetics were monitored by dipping the biosensors in wells containing the target protein at the indicated concentration (association step) and then dipping the sensors back into baseline/buffer (dissociation). The binding affinities of Tie2- and IGF1R-mini binders were low, and MBP tagged proteins were used for the binding assay to amplify the binding signal. The binding assay for the Insulin receptor (IR) designs were conducted with Amine Reactive Second-Generation (AR2G ForteBio) Biosensors with the recommended protocol. In brief, the miniproteins were immobilized onto the AR2G tips and the InsulinR were used as the analyte with the indicated concentrations.
For the cross-reactivity assay, each target protein was loaded onto SA tips at the concentration of 50 nM for 325s. The tips were dipped into the miniprotein wells for 300s (association) and then dipped into the blank buffer wells for 600s (dissociation). The maximum raw bio-layer Interferometry signal binding was used as the indicator of binding strength. The maximum signal among all the miniprotein binders for a specific target was used to normalize the data for heatmap plotting.
Apparent SC50 Estimation from FACS and NGS
The Pear™ program {Zhang, 2014 #61} was used to assemble the fastq files from the Next Generation Sequencing runs. Translated, assembled reads were matched against the ordered designs to determine the number of counts for each design in each pool.
The critical assumption to the fitting here is to pretend that the yeast cells displaying a particular design will follow a modified version of the standard KD equation relating fraction bound to concentration:
The next assumption is that all designs have the same expression level on yeast surface and that 100% of yeast cells express well enough to be collected in the “expression” gate.
These two assumptions, while probably false, allow fitting the data with only one free parameter per design and no global free parameters. The correct version of EQ-KD for this experiment likely has a different shape and slope from a perfect sigmoid, the net effect of correcting this would be that all SC50 are scaled by a constant factor (which would not affect the relative comparisons made here). It can be shown by analyzing the data that different designs result in different expression levels on yeast (one can examine the fraction_collectedi for strong binders at concentrations where binding should be saturated). The net result is that experimentally, EQ-KD is multiplied by a constant between 0 and 1 for each design. This constant seems to range from 0.2 to 0.7. As such, when fitting the data, fraction_collectedi values above 0.2 are considered saturating. However, because the 0.2 mark may represent 90% collection for poorly-expressing designs and 30% collection for strongly-expressing designs, the resulting SC50 fits may vary by up to 5-fold. The alternative is to try to estimate an expression level: however, this becomes increasingly difficult with weaker binders that never saturate the experiment.
Apparent SC50 Estimation from FACS and NGS: Point Estimates
The following equation may be used to determine the fraction_collectedi for a single design in a single sort:
This point-estimate method is best suited for asking the question: which designs have SC50<SC50,0? by determining the expected fraction_collectedi for a given sorting concentration and SC50,0. The sorting concentration and SC50,0 should be selected such that EQ-KD results in an expected fraction_collectedi less than 0.2 in order to circumvent the expression issues mentioned above. Then, any designs with fraction_collectedi greater than the cutoff may say their SC50 is less than SC50,0. Designs with low numbers of counts are suspect, see the Doubly-Transformed Yeast Cells section. For this analysis, any designs with fewer than max_possible_passenger_cells cells were eliminated.
This method may be applied to avidity sorts, however, the resulting SC50 would be the SC50 during avidity experiments. It is unclear to the authors what the precise mathematical effect of avidity is and as such we do not compare avidity SC50S with non-avidity SC50S.
Apparent SC50 Estimation from FACS and NGS: Doubly-Transformed Yeast Cells
Doubly-transformed yeast cells represent a major source of error in these experiments. While rare, a yeast cell that contains two plasmids, one of a strong binder and one of a non-binder, will carry the non-binder plasmid through the sorting process. The net result is that the non-binder will end up with counts that track the strong binder, however, at a greatly reduced absolute number. (Rare is a relative term here. While the odds of any two specific plasmids being in one cell is low, in the entire pool of yeast, doubly transformed cells seem to be quite common.)
We chose to address this issue by making the following assumption: non-binders that take advantage of a doubly-transformed yeast cell do so from precisely one double-transformation event. In other words, we assume that the same non-binding plasmid did not get doubly transformed into two separate strong-binding yeast. This assumption allows us to estimate the largest number of cells we would expect to see from a doubly-transformed plasmid:
With this number in hand, one can set a floor for the number of cells that one would expect to see. Any design with fewer than this number of cells cannot be considered for calculations because it is unclear whether or not that cell is part of a doubly-transformed yeast cell. On the whole, this method reduces false-positive binders, but also removes true-positive binders that did not transform well. It is wise to simply drop designs from the downstream calculations that did not transform well.
Apparent SC50 Estimation from FACS and NGS: Full Estimate
Estimation of an upper and lower bound on the SC50 from the data may be performed by looking at an arbitrary number of sorting experiments. Taking a P(SC50=SC50,0|data) and performing Bayesian analysis, one arrives at a confidence interval for the actual SC50 value. This analysis may be performed at every sort and the resulting distributions combined to produce a robust estimate.
Each sort may be modeled as a binomial distribution where: p=fraction_collected from EQ-KD using concentration=sorting_concentration and SC50=SC50,0; n=cells_sortedi; and x=cells_collectedi. By performing this analysis at a range of SC50,0 values and examining the probability this could happen by the binomial distribution, one arrives at P(SC50=SC50,0|data). Specifically for this analysis, the cdf of the binomial was used with the null hypothesis that SC50=SC50,0.
Care should be taken for the valid range of p. As stated previously, it is wise to cap the expected value of p to 0.2 to account for expression levels and to floor the value such that n*p does not fall below max__possible_passenger_cells. In our implementation, if x falls into a range that has been clipped, a probability of 1 is returned.
In order to remove artifacts from design and to discover the best orientation for each SSM mutation, all binders were relaxed using the Rosetta™ beta_nov16 score function before calculations began (30 replicates using 5 repeats of cartesian FastRelax™ taking the best scoring model). Relaxation of point mutants then used the standard cartesian FastRelax procedure and allowed all residues within 10 Å of the mutation to relax. The backbone coordinates of those residues on the binder were allowed to relax while the target was held constant. The best of 3 (as evaluated by Rosetta™ energy) was chosen as the representative model.
In order to validate that the designed binder was folded into the correct shape and was using its designed interface to bind to the target, the entropy of the interface, monomer core, and monomer surface were examined. For each position on the binder, the sequence entropy (Shannon entropy) of each position was calculated using the observed frequencies of each amino acid in the Next Generation Sequencing data. The specific pool that was chosen for this analysis was the pool with concentration closest to 10-fold lower than the calculated SC50 of the parent.
After the per-position sequence entropy was calculated, the average per-position entropy of the SASA-hidden positions contacting the target (interface core), the SASA-hidden positions not contacting the target (monomer core), and the fully exposed positions not contacting the target (monomer surface) were calculated. A simple subtraction was performed according to EQ-ENTROPY:
Finally, the probability that the score could have come from totally random data was computed by performing the above calculation on the actual data, and then performing the same calculation 100 times, but randomly mismatching the observed counts among all SSM point mutations. In this way, the experimental noise is kept constant among the 100 decoy datasets. The final step to arrive at a p-value was to calculate the mean and standard deviation of the 100 decoy intermediate_entropy_scores and to find the p-value with the Normal CDF function of the binder's intermediate_entropy_score.
In order to further assess the accuracy of the design model, the correlation between the predicted effect on binding by Rosetta™ was compared with the experimental data. The effect from Rosetta™ can be broken into two components: monomer stabilization/destabilization and interface stabilization/destabilization. The effect on the monomer energy will affect the fraction of the proteins that are folded in solution. This fraction of folded proteins will then worsen the affinity because only the folded proteins are able to bind. The effect on the monomer stability was estimated by taking the difference in Rosetta™ energy between the native relaxed dock and the mutant relaxed dock and looking only at the change in Rosetta™ score of the docked protein (excluding energies arising from cross-interface edges). The effect on the target energy was calculated the same was and was considered to directly affect the binding energy. The binding energy was calculated by taking the difference in Rosetta™ score between the docked and undocked conformations (but with no repacking or minimization in the unbound form).
The effect on the P(fold_monomer) was estimated by first determining the predicted ΔGfold of the native protein.
Using EQ-ddg_monomer and EQ-PFOLD, the predicted ΔGfold for the native design was estimated by performing a least-squares fit of all mutations that did not occur in residues at the interface. A rudimentary confidence interval was created by allowing all ΔGfold values that resulted in a root mean squared error of within 0.25 kcal/mol of the best ΔGfold value. Typical confidence intervals spanned 3 kcal/mol.
With the ΔGfold in hand, the predicted effect on the binding energy could be computed according to EQ-DDG_SUM. The values of ΔGfold inside the confidence range for ΔGfold that produced the largest and smallest ΔddGRosetta were used to produce a confidence interval for ΔddGaRosetta.
The per-position accuracy was assessed by determining whether the confidence interval for ΔddGRosetta was compatible with the confidence interval for the SC50 from the experimental data. A buffer of 1 kcal/mol was allowed.
With the per-position accuracies in hand, the overall percentage of mutations that Rosetta™ was able to explain in the monomer_core and interface_core was assessed. This produced an overall Rosetta™ accuracy score.
In the same way as the Entropy score, 100 decoys with randomly shuffled SC50 values were subjected to the same procedure. The mean and standard deviation of the decoys was determined and the p-value for the Rosetta™ score was determined using the Normal CDF function.
This application claims priority to U.S. Provisional Patent Application Ser. No. 63/221,327 filed Jul. 13, 2021, incorporate by reference herein in its entirety.
This invention was made with government support under Grant No. FA8750-17-C-0219, awarded by the Defense Advanced Research Projects Agency and Grant No. HDTRA1-16-C-0029, awarded by the Defense Threat Reduction Agency and Grant Nos. HHSN272201700059C and R01 AG063845, awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/073590 | 7/11/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63221327 | Jul 2021 | US |