This application contains a Sequence Listing that has been submitted electronically as an XML file named 52271-0006WO1-SL_ST26.xml. The XML file, created on Nov. 16, 2022, is 71,488 bytes in size. The material in the XML file is hereby incorporated by reference in its entirety.
Described herein are methods and systems useful, for example, for degron identification, and also, for example, for predicting, identifying, classifying, and selecting neosubstrates of E3 ligases.
Protein biosynthesis and degradation is a dynamic process which sustains normal cell homeostasis. The ubiquitin-proteasome system is a master regulator of protein homeostasis, by which proteins are initially targeted for poly-ubiquitination by E3 ligases and then degraded into short peptides by the proteasome. Nature evolved diverse peptidic motifs, termed degrons, to signal substrates for degradation. A need exists for the development of methods that efficiently and accurately assess the structural basis of E3 ligase degron recognition and identify proteins capable of being targeted for degradation by the E3 ligase machinery.
The E3 ubiquitin ligase complex ubiquitinates many other proteins and can be manipulated with small molecules to trigger targeted degradation of specific substrate proteins of interest, including proteins that are not naturally targeted for degradation. Binding of substrate proteins with the E3 ubiquitin ligase complex is permitted if certain features, known as degrons, are present on the substrate proteins.
In some cases, binding of small molecules (e.g., molecular glues) to E3 ligase substrate receptors such as cereblon (CBRN) modulates the substrate selectivity of the complex, e.g., by changing the molecular surface of the E3 ligase substrate receptor protein, effectively hijacking the innate in vivo protein degradation system in order to degrade specific target proteins, e.g., for therapeutic effect (sometimes referred to as targeted protein degradation).
Molecular glues stabilize protein-protein interactions (e.g., between an E3 ligase substrate receptor protein and a neosubstrate), and, in cases where they lead to degradation of the neosubstrate, they are known as molecular glue degraders. Molecular glue degraders are a recently discovered therapeutic modality, with several clinically approved drugs (e.g. indisulam and lenalidomide), whose targets would have been otherwise considered undruggable. Molecular glue degraders have the potential to become the only modality capable of downregulating the large fraction of the proteome (>75%) considered undruggable using other approaches.
This raises the challenge of identifying neosubstrates and/or neosurfaces, in effect matching targets to particular E3 ligases, given a known or a yet unknown molecular glue. Thus, a critical need exists to identify neodegrons complementary to putative neosurfaces.
A need exists for alternative methods for the identification of target proteins (e.g., neosubstrates) capable of being targeted by E3 ligase machinery. Thus, described herein are, among other things, methods for the identification of target proteins capable of being targeted by E3 ligase machinery based on protein surface features.
Thus, described herein are, among other things, methods for the identification of substrate proteins capable of being targeted by E3 ligase machinery based on the protein molecular surface (quinary) representation of protein structure. The methods are useful, for example, in matching E3 ligases (e.g., an E3 ligase substrate receptor protein such as CRBN) to degrons (e.g., in target proteins), in the presence or absence of a molecular glue.
While degrons have been identified and described based on their primary and secondary structures (see, e.g., WO2022/153220), the use of surface features (the quinary protein structure) to identify degrons has not been performed in the art. The methods described herein provide, for the first time, the identification of degrons based on their surface features. The methods described herein are useful, for example, to identify degrons independently of their underlying primary sequence and secondary structure, based on how similar their molecular surface is to known degrons (degron mimicry) and/or their complementary to an E3 ligase substrate receptor protein surface or E3 ligase substrate receptor protein neosurface (e.g., induced by a molecular glue) (E3 complementarity).
The ability to identify degrons in this manner allows for the identification of degrons in completely unrelated proteins with no underlying structural similarity.
Thus, provided herein are methods for generating a degron similarity score for one or more protein(s), comprising: a) providing a first set of molecular surface features from a first set of one or more protein(s) comprising one or more known degron(s) of an E3 ligase substrate receptor and/or one or more predicted degron(s) of the E3 ligase substrate receptor; b) providing a second set of molecular surface features from a second set of one or more protein(s); and c) calculating a similarity score for the protein(s) of the second set by comparing the first and second sets of molecular surface features.
Also provided herein are methods for identifying a predicted neosubstrate of an E3 ligase, comprising: a) calculating a degron similarity score for one or more protein(s), according to any of the methods described herein; and b) based on the similarity score, identifying one or more of the protein(s) of the second set as a predicted neosubstrate(s) of the E3 ligase.
Also provided herein are methods for identifying a putative neosubstrate of an E3 ligase, comprising: a) identifying a predicted neosubstrate using any of the methods described herein; b) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the putative neosubstrate is a substrate of the E3 ligase; and c) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase.
Also provided herein are methods for classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase, comprising: a) calculating a degron similarity score for one or more protein(s) according to any of the methods described herein; b) based on the similarity score, identifying the protein(s) of the second set as a predicted neosubstrate of the E3 ligase or not; and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) i) if, based on said testing or having tested, the predicted neosubstrate is determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a substrate of the E3 ligase; else ii) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase, thereby classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase.
Also provided herein are methods for selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates, comprising: a) calculating a degron similarity score for one or more protein(s) according to any of the methods described herein; b) based on the similarity score, identifying a subset of the potential neosubstrates as predicted neosubstrate(s); and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) if, based on said testing or having tested, the predicted neosubstrate is determined not to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase and selecting it from the set of potential neosubstrates, thereby selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates.
In some embodiments, the E3 ligase substrate detection assay is selected from the group consisting of a proximity assay, a binding assay, and a degradation assay.
In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if degradation of the predicted neosubstrate is detected.
In some embodiments, the method comprises: testing or having tested a putative neosubstrate identified, classified, or selected by the method of any one of the methods described herein in an E3 ligase substrate detection assay with a binding modulator of the E3 ligase, and, if, based on said testing or having tested, the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator, identifying the putative neosubstrate as a neosubstrate of the E3 ligase. In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if degradation of the predicted neosubstrate is detected.
In some embodiments, the one or more degron(s) is selected from the group consisting of N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, G-loop degrons, and combinations thereof. In some embodiments, the degron(s) are N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, or G-loop degrons. In some embodiments, the G-loop degron(s): (i) comprise or consist of the amino acid sequence X1-X2-X3-X4-G-X6, wherein: each of X1, X2, X3, X4, and X6 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (ii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7, wherein: each of X1, X2, X3, X4, X6, and X7 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7-X8; wherein: each of X1, X2, X3, X4, X6, X7, and X8 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iv) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is selected from the group consisting of asparagine, aspartic acid, and cysteine; X2 is selected from the group consisting of isoleucine, lysine, and asparagine; X3 is selected from the group consisting of threonine, lysine, and glutamine; X4 is selected from the group consisting of asparagine, serine, and cysteine; X5 is glycine; and X6 is selected from the group consisting of glutamic acid and glutamine; (v) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is asparagine; X2 is isoleucine; X3 is threonine; X4 is asparagine; X5 is glycine; and X6 is glutamic acid; (vi) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is aspartic acid; X2 is lysine; X3 is lysine; X4 is serine; X5 is glycine; and X6 is glutamic acid; and/or (vii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is cysteine; X2 is asparagine; X3 is glutamine; X4 is cysteine; X5 is glycine; and X6 is glutamine.
In some embodiments, the degron(s): (i) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (ii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid; (iii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine; (iv) comprise or consists of the amino acid motif ETGE (SEQ ID NO: 1); (v) comprise or consists of the amino acid motif DLG; (vi) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine. In some cases the degron comprises or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix; and/or (vii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some embodiments, the E3 ligase comprises an E3 ligase substrate receptor protein selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
In some embodiments, the E3 ligase binding modulator is a compound shown in Table 1 or Table 2, or a pharmaceutically acceptable salt thereof, or a stereoisomer thereof.
In some embodiments, the second set of one or more protein(s) or set of potential neosubstrates comprises or consists of one or more of the proteins in Table 3.
In some embodiments: (i) the E3 ligase comprises the E3 ligase substrate receptor CRBN and the degron(s) are G-loop degron(s); (ii) the E3 ligase comprises the E3 ligase substrate receptor BTRC and the degron(s) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (iii) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid; (iv) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine; (v) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif ETGE ((SEQ ID NO: 1) and/or DLG; (vi) the E3 ligase comprises the E3 ligase substrate receptor MDM2 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine; (vii) the E3 ligase comprises the E3 ligase substrate receptor MDM2 and the degron(s) comprise or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix; or (viii) the E3 ligase comprises the E3 ligase substrate receptor VHL and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some embodiments, the molecular surface features comprise geometric and/or chemical features. In some embodiments, the geometric features are selected from the group consisting of shape index, distance-dependent curvature, geodesic polar coordinates, radial (angular) coordinates, and combinations thereof. In some embodiments, the chemical features are selected from the group consisting of hydropathy index, continuum electrostatics, location of free electrons, location of free proton donors, and combinations thereof. In some embodiments, the similarity score is calculated using a geometric deep learning model. In some embodiments, the geometric deep learning model is a neural network. In some embodiments, the neural network is trained on complementarity of E3 ligase surface(s) to known degron surface(s). In some embodiments, the neural network is trained on similarity to known and/or predicted degron surface(s).
In some embodiments, the second set of proteins comprises proteins that are not in the first set of proteins. In some embodiments, the second set of proteins does not include any proteins from the first set of proteins.
In some embodiments, the first set of molecular surface features consists of molecular surface features from one or more protein(s) comprising one or more known degron(s) of an E3 ligase substrate receptor. In some embodiments, the first set of molecular surface features consists of molecular surface features from one or more protein(s) comprising one or more predicted degron(s) of the E3 ligase substrate receptor. In some embodiments, the first set of molecular surface features consists of molecular surface features from one or more protein(s) comprising one or more known degron(s) of an E3 ligase substrate receptor and molecular surface feature(s) of one or more protein(s) comprising one or more predicted degron(s) of the E3 ligase substrate receptor.
In some embodiments, the known degron(s) of an E3 ligase substrate receptor are derived from a crystal structure.
Also provided herein are methods for generating a degron complementarity score for one or more protein(s), comprising: a) providing a first set of molecular surface features from a first set of one or more protein(s) comprising one or more E3 ligase substrate receptor proteins; b) providing a second set of molecular surface features from a second set of one or more protein(s); and c) calculating a complementarity score for the protein(s) of the second set by comparing the first and second sets of molecular surface features.
Also provided herein are methods for identifying a predicted neosubstrate of an E3 ligase, comprising: a) calculating a degron complementarity score for one or more protein(s) according to any one of the methods described herein; and b) based on the complementarity score, identifying one or more of the protein(s) of the second set as a predicted neosubstrate(s) of the E3 ligase.
Also provided herein are methods for identifying a putative neosubstrate of an E3 ligase, comprising: a) identifying a predicted neosubstrate according to any one of the methods described herein; b) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the putative neosubstrate is a substrate of the E3 ligase; and c) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase.
Also provided herein are methods for classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase, comprising: a) calculating a degron complementarity score for one or more protein(s) according to any one of the methods described herein; b) based on the complementarity score, identifying the protein(s) of the second set as a predicted neosubstrate of the E3 ligase or not; and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) i) if, based on said testing or having tested, the predicted neosubstrate is determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a substrate of the E3 ligase; else ii) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase, thereby classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase.
Also provided herein are methods for selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates, comprising: a) calculating a degron complementarity score for one or more protein(s) according to any one of the methods described herein; b) based on the complementarity score, identifying a subset of the potential neosubstrates as predicted neosubstrate(s); and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) if, based on said testing or having tested, the predicted neosubstrate is determined not to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase and selecting it from the set of potential neosubstrates, thereby selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates.
In some embodiments, the E3 ligase substrate detection assay is selected from the group consisting of a proximity assay, a binding assay, and a degradation assay. In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if degradation of the predicted neosubstrate is detected.
Also provided herein are methods of identifying a neosubstrate of an E3 ligase, comprising: testing or having tested a putative neosubstrate identified, classified, or selected by the method of any one of the methods described herein in an E3 ligase substrate detection assay with a binding modulator of the E3 ligase, and, if, based on said testing or having tested, the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator, identifying the putative neosubstrate as a neosubstrate of the E3 ligase.
In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if degradation of the predicted neosubstrate is detected.
In some embodiments, the one or more degron(s) is selected from the group consisting of N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, G-loop degrons, and combinations thereof. In some embodiments, the degron(s) are N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, or G-loop degrons. In some embodiments, the G-loop degron(s): (i) comprise or consist of the amino acid sequence X1-X2-X3-X4-G-X6, wherein: each of X1, X2, X3, X4, and X6 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (ii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7, wherein: each of X1, X2, X3, X4, X6, and X7 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7-X8; wherein: each of X1, X2, X3, X4, X6, X7, and X8 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iv) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is selected from the group consisting of asparagine, aspartic acid, and cysteine; X2 is selected from the group consisting of isoleucine, lysine, and asparagine; X3 is selected from the group consisting of threonine, lysine, and glutamine; X4 is selected from the group consisting of asparagine, serine, and cysteine; X5 is glycine; and X6 is selected from the group consisting of glutamic acid and glutamine; (v) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is asparagine; X2 is isoleucine; X3 is threonine; X4 is asparagine; X5 is glycine; and X6 is glutamic acid; (vi) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is aspartic acid; X2 is lysine; X3 is lysine; X4 is serine; X5 is glycine; and X6 is glutamic acid; and/or (vii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is cysteine; X2 is asparagine; X3 is glutamine; X4 is cysteine; X5 is glycine; and X6 is glutamine.
In some embodiments, the degron(s): (i) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (ii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid; (iii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine; (iv) comprise or consists of the amino acid motif ETGE (SEQ ID NO: 1); (v) comprise or consists of the amino acid motif DLG; (vi) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine. In some cases the degron comprises or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix; and/or (vii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some embodiments, the E3 ligase comprises an E3 ligase substrate receptor protein selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
In some embodiments, the E3 ligase binding modulator is a compound shown in Table 1 or Table 2, or a pharmaceutically acceptable salt thereof, or a stereoisomer thereof.
In some embodiments, the second set of one or more protein(s) or set of potential neosubstrates comprises or consists of one or more of the proteins in Table 3.
In some embodiments: (i) the E3 ligase comprises the E3 ligase substrate receptor CRBN and the degron(s) are G-loop degron(s); (ii) the E3 ligase comprises the E3 ligase substrate receptor BTRC and the degron(s) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (iii) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid;
In some embodiments, the molecular surface features comprise geometric and/or chemical features. In some embodiments, the geometric features are selected from the group consisting of shape index, distance-dependent curvature, geodesic polar coordinates, radial (angular) coordinates, and combinations thereof. In some embodiments, the chemical features are selected from the group consisting of hydropathy index, continuum electrostatics, location of free electrons, location of free proton donors, and combinations thereof. In some embodiments, the complementarity score is calculated using a geometric deep learning model. In some embodiments, the geometric deep learning model is a neural network. In some embodiments, the neural network is trained on complementarity of E3 ligase surface(s) to known degron surface(s). In some embodiments, the neural network is trained on similarity to known and/or predicted degron surface(s).
In some embodiments, the second set of proteins comprises proteins that are not in the first set of proteins. In some embodiments, the second set of proteins does not include any proteins from the first set of proteins.
Also provided herein are methods for generating a degron score for one or more protein(s), comprising: a) providing a set of molecular surface features from a set of one or more protein(s); and c) calculating a degron score for the protein(s) by comparing the molecular surface features to a reference set of molecular surface(s).
Also provided herein are methods for identifying a predicted neosubstrate of an E3 ligase, comprising: a) calculating a degron score for one or more protein(s) according to any one of the methods described herein; and b) based on the degron score, identifying one or more of the protein(s) of the second set as a predicted neosubstrate(s) of the E3 ligase.
Also provided herein are methods for identifying a putative neosubstrate of an E3 ligase, comprising: a) identifying a predicted neosubstrate according to any one of the methods described herein; b) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the putative neosubstrate is a substrate of the E3 ligase; and c) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase.
Also described herein are methods for classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase, comprising: a) calculating a degron score for one or more protein(s) according to any one of the methods described herein; b) based on the degron score, identifying the protein(s) of the second set as a predicted neosubstrate of the E3 ligase or not; and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) i) if, based on said testing or having tested, the predicted neosubstrate is determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a substrate of the E3 ligase; else ii) if, based on said testing or having tested, the predicted neosubstrate is not determined to be a substrate of the E3 ligase, classifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase, thereby classifying protein(s) as substrate(s) and/or putative neosubstrate(s) of an E3 ligase.
Also provided herein are methods for selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates, comprising: a) calculating a degron score for one or more protein(s) according to any one of the methods described herein; b) based on the degron score, identifying a subset of the potential neosubstrates as predicted neosubstrate(s); and c) for one or more of the predicted neosubstrate(s), testing or having tested the predicted neosubstrate in an E3 ligase substrate detection assay without a binding modulator of the E3 ligase to determine if the predicted neosubstrate is substrate of the E3 ligase; and d) if, based on said testing or having tested, the predicted neosubstrate is determined not to be a substrate of the E3 ligase, identifying the predicted neosubstrate as a putative neosubstrate of the E3 ligase and selecting it from the set of potential neosubstrates, thereby selecting putative neosubstrate(s) of an E3 ligase from a set of potential neosubstrates.
In some embodiments, the E3 ligase substrate detection assay is selected from the group consisting of a proximity assay, a binding assay, and a degradation assay. In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the predicted neosubstrate is determined to be a substrate of the E3 ligase if degradation of the predicted neosubstrate is detected.
Also provided herein are methods of identifying a neosubstrate of an E3 ligase, comprising: testing or having tested a putative neosubstrate identified, classified, or selected by the method of any one of the methods described herein in an E3 ligase substrate detection assay with a binding modulator of the E3 ligase, and, if, based on said testing or having tested, the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator, identifying the putative neosubstrate as a neosubstrate of the E3 ligase.
In some embodiments: (i) the E3 ligase substrate detection assay is a proximity assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if an interaction between the putative neosubstrate and E3 ligase is detected; (ii) the E3 ligase substrate detection assay is a binding assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if binding of the neosubstrate and E3 ligase is detected; or (iii) the E3 ligase substrate detection assay is a degradation assay and the putative neosubstrate is determined to be a substrate of the E3 ligase in the presence of the E3 ligase binding modulator if degradation of the predicted neosubstrate is detected.
In some embodiments, the one or more degron(s) is selected from the group consisting of N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, G-loop degrons, and combinations thereof. In some embodiments, the degron(s) are N-degrons, C-degrons, phosphodegrons, oxygen-dependent degrons, or G-loop degrons. In some embodiments, the G-loop degron(s): (i) comprise or consist of the amino acid sequence X1-X2-X3-X4-G-X6, wherein: each of X1, X2, X3, X4, and X6 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (ii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7, wherein: each of X1, X2, X3, X4, X6, and X7 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7-X8; wherein: each of X1, X2, X3, X4, X6, X7, and X8 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine; (iv) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is selected from the group consisting of asparagine, aspartic acid, and cysteine; X2 is selected from the group consisting of isoleucine, lysine, and asparagine; X3 is selected from the group consisting of threonine, lysine, and glutamine; X4 is selected from the group consisting of asparagine, serine, and cysteine; X5 is glycine; and X6 is selected from the group consisting of glutamic acid and glutamine; (v) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is asparagine; X2 is isoleucine; X3 is threonine; X4 is asparagine; X5 is glycine; and X6 is glutamic acid; (vi) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is aspartic acid; X2 is lysine; X3 is lysine; X4 is serine; X5 is glycine; and X6 is glutamic acid; and/or (vii) comprise or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is cysteine; X2 is asparagine; X3 is glutamine; X4 is cysteine; X5 is glycine; and X6 is glutamine.
In some embodiments, the degron(s): (i) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (ii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid; (iii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine; (iv) comprise or consists of the amino acid motif ETGE (SEQ ID NO: 1); (v) comprise or consists of the amino acid motif DLG; (vi) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine. In some cases the degron comprises or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix; and/or (vii) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some embodiments, the E3 ligase comprises an E3 ligase substrate receptor protein selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
In some embodiments, the E3 ligase binding modulator is a compound shown in Table 1 or Table 2, or a pharmaceutically acceptable salt thereof, or a stereoisomer thereof.
In some embodiments, the second set of one or more protein(s) or set of potential neosubstrates comprises or consists of one or more of the proteins in Table 3.
In some embodiments: (i) the E3 ligase comprises the E3 ligase substrate receptor CRBN and the degron(s) are G-loop degron(s); (ii) the E3 ligase comprises the E3 ligase substrate receptor BTRC and the degron(s) comprise or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid; (iii) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid; (iv) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine; (v) the E3 ligase comprises the E3 ligase substrate receptor KEAP1 and the degron(s) comprise or consists of the amino acid motif ETGE ((SEQ ID NO: 1) and/or DLG; (vi) the E3 ligase comprises the E3 ligase substrate receptor MDM2 and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine; (vii) the E3 ligase comprises the E3 ligase substrate receptor MDM2 and the degron(s) comprise or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix; or (viii) the E3 ligase comprises the E3 ligase substrate receptor VHL and the degron(s) comprise or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some embodiments, the molecular surface features comprise geometric and/or chemical features. In some embodiments, the geometric features are selected from the group consisting of shape index, distance-dependent curvature, geodesic polar coordinates, radial (angular) coordinates, and combinations thereof. In some embodiments, the chemical features are selected from the group consisting of hydropathy index, continuum electrostatics, location of free electrons, location of free proton donors, and combinations thereof. In some embodiments, the degron score is calculated using a geometric deep learning model. In some embodiments, the geometric deep learning model is a neural network. In some embodiments, the neural network is trained on complementarity of E3 ligase surface(s) to known degron surface(s). In some embodiments, the neural network is trained on similarity to known and/or predicted degron surface(s).
In some embodiments, the second set of proteins comprises proteins that are not in the first set of proteins. In some embodiments, the second set of proteins does not include any proteins from the first set of proteins.
In some embodiments of any of the methods described herein, the E3 ligase is CRBN.
Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure.
Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
As used in the specification and claims, the singular forms “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a sample” includes a plurality of samples, including mixtures thereof.
The terms “determining,” “measuring,” “evaluating,” “assessing,” “assaying,” and “analyzing” are often used interchangeably herein to refer to forms of measurement. The terms include determining if an element is present or not (for example, detection). These terms can include quantitative, qualitative or quantitative and qualitative determinations. Assessing can be relative or absolute. “Detecting the presence of” can include determining the amount of something present in addition to determining whether it is present or absent depending on the context.
As used herein, the term “about” a number refers to that number plus or minus 10% of that number. The term “about” a range refers to that range minus 10% of its lowest value and plus 10% of its greatest value.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials are described herein for use in the present invention; other, suitable methods and materials known in the art can also be used. The materials, methods, and examples are illustrative only and not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.
Other features and advantages of the invention will be apparent from the following detailed description and figures, and from the claims.
Described herein are methods and compounds useful, for example, for predicting, identifying, classifying, and selecting neosubstrates of E3 ligases using, for example, molecular surface features of protein(s). The molecular surface is a higher-level representation of protein structure than protein structure or sequence and the methods described herein provide an improvement, for example, over methods utilizing lower level representation(s) of protein structure.
E3 ligases recognize protein substrates and, when complexed with E2 conjugating enzymes loaded with ubiquitin, results in ubiquitination of the protein. E3 ligases and their substrate receptor proteins are known and described in the art, for example, in Ishida et al., “E3 Ligase Ligands for PROTACs: How They Were Found and How to Discover New Ones,” SLAS Discovery 26(4):484-502 (2021).
Cereblon (CRBN), for example, forms an E3 ubiquitin ligase complex with damaged DNA binding protein 1 (DDB1), Cullin-4A (CUL4A), and regulator of cullins 1 (ROC1).
In some cases, the E3 ligase substrate receptor protein is an E3 ligase substrate receptor protein selected from the group consisting of CRBN (e.g., UniProtKB Q96SW2), VHL (e.g., UniProtKB P40337), BIRC1 (e.g., UniProtKB Q13075), BIRC2 (e.g., UniProtKB Q13490), BIRC3 (e.g., UniProtKB Q13489), BIRC4 (e.g., UniProtKB P98170), BIRC5 (e.g., UniProtKB O15392), BIRC6 (e.g., UniProtKB Q9NR09), BIRC7 (e.g., UniProtKB Q96CA5), BIRC8 (e.g., UniProtKB Q96P09), KEAP1 (e.g., UniProtKB Q14145), DCAF15 (e.g., UniProtKB Q66K64), RNF4 (e.g., UniProtKB P78317) RNF4 isoform 2 (e.g., UniProtKB P78317-2), RNF114 (e.g., UniProtKB Q9Y508), RNF114 isoform 2 (e.g., UniProtKB Q9Y508-2), DCAF16 (e.g., UniProtKB Q9NXF7) AHR (e.g., UniProtKB P35869), MDM2 (e.g., UniProtKB Q00987), UBR2 (e.g., UniProtKB Q8IWV8), SPOP (e.g., UniProtKB Q43791), KLHL3 (e.g., UniProtKB Q9UH77), KLHL12 (e.g., UniProtKB Q53G59), KLHL20 (e.g., UniProtKB Q9Y2M5), KLHDC2 (e.g., UniProtKB Q9Y2U9), SPSB1 (e.g., UniProtKB Q96BD6), SPSB2 (e.g., UniProtKB Q99619), SBSB4 (e.g., UniProtKB Q96A44), SOCS2 (e.g., UniProtKB O14508), SOCS6 (e.g., UniProtKB O14544), FBXO4 (e.g., UniProtKB Q9UKT5), FBXO31 (e.g., UniProtKB Q5XUX0), BTRC (e.g., UniProtKB Q9Y297), FBW7 (e.g., UniProtKB Q969H0), CDC20 (e.g., UniProtKB Q12834), ITCH (e.g., UniProtKB Q96J02), PML (e.g., UniProtKB P29590), TRIM21 (e.g., UniProtKB P19474), TRIM24 (e.g., UniProtKB O15164), TRIM33 (e.g., UniProtKB Q9UPN9), GID4 (e.g., UniProtKB Q8IVV7), and DCAF11 (e.g., UniProtKB Q8TEB1).
In some cases, the E3 ligase is an E3 ligase selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
In some cases, the E3 ligase is at least 80%, e.g., at least 90%, at least 95%, or at least 99% identical to an E3 ligase selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
In some cases, the E3 ligase is an enzymatically active portion of an E3 ligase selected from the group consisting of CRBN (SEQ ID NO: 3), CRBN isoform 2 (SEQ ID NO: 2), VHL (SEQ ID NO: 9), BIRC1 (SEQ ID NO: 10), BIRC2 (SEQ ID NO: 11), BIRC3 (SEQ ID NO: 12), BIRC4 (SEQ ID NO: 13), BIRC5 (SEQ ID NO: 14), BIRC6 (SEQ ID NO: 15), BIRC7 (SEQ ID NO: 16), BIRC8 (SEQ ID NO: 17), KEAP1 (SEQ ID NO: 18), DCAF15 (SEQ ID NO: 19), RNF4 (SEQ ID NO: 20) RNF4 isoform 2 (SEQ ID NO: 21), RNF114 (SEQ ID NO: 22), RNF114 isoform 2 (SEQ ID NO: 23), DCAF16 (SEQ ID NO: 24) AHR (SEQ ID NO: 25), MDM2 (SEQ ID NO: 26), UBR2 (SEQ ID NO: 27), SPOP (SEQ ID NO: 28), KLHL3 (SEQ ID NO: 29), KLHL12 (SEQ ID NO: 30), KLHL20 (SEQ ID NO: 31), KLHDC2 (SEQ ID NO: 32), SPSB1 (SEQ ID NO: 33), SPSB2 (SEQ ID NO: 34), SBSB4 (SEQ ID NO: 35), SOCS2 (SEQ ID NO: 36), SOCS6 (SEQ ID NO: 37), FBXO4 (SEQ ID NO: 38), FBXO31 (SEQ ID NO: 39), BTRC (SEQ ID NO: 40), FBW7 (SEQ ID NO: 41), CDC20 (SEQ ID NO: 42), ITCH (SEQ ID NO: 43), PML (SEQ ID NO: 44), TRIM21 (SEQ ID NO: 45), TRIM24 (SEQ ID NO: 46), TRIM33 (SEQ ID NO: 47), GID4 (SEQ ID NO: 48), and DCAF11 (SEQ ID NO: 49).
The cereblon protein, encoded by the gene CRBN, is the substrate recognition component of a DCX (DDB1-CUL4-X-box) E3 protein ligase complex that mediates the ubiquitination and subsequent proteasomal degradation of target proteins.
The hydrophobic tri-tryptophan cage is the canonical thalidomide-binding domain at the C-terminal end of CRBN. The glutarimide moiety of immunomodulatory imide drugs (IMiDs) such as thalidomide bind into this high conserved hydrophobic pocket, with the phthalamide ring exposed on the surface of the CRBN protein. See Chopra et al., “Protein Degradation for Drug Discovery,” Drug Discovery Today: Technologies 31:5-13 (2019).
The human cereblon protein (NCBI Gene ID 51185; UniProt ID Q96SW2) encodes the following transcripts and isoforms, of which NM_016302.4 (SEQ ID NO: 3, transcript 1) is the canonical transcript:
Isoform 1 of human CRBN (SEQ ID NO: 3) has the following features:
Known mutants of human CRBN isoform 1 (SEQ ID NO: 3) have the following features:
Isoform 1 of human CRBN (SEQ ID NO: 3) comprises a Lon N-terminal domain at positions 81-317, the canonical binding domain CULT (cereblon domain of unknown activity, binding cellular Ligands and; Thalomide) at positions 318-426, and canonical thalomide binding region at positions 378-386 (Chamberlain et al. Nat. Struct. Mol. Biol. 21:803-9 (2014)). The CULT domain binds thalidomide and related drugs, such as pomalidomide and lenalidomide. Drug binding leads to a change in substrate specificity of the human DCX (DDB1-CUL4-X-box) E3 protein ligase complex, while no such change is observed in rodents (Chamberlain et al. Nat. Struct. Mol. Biol. 21:803-9 (2014)).
In some cases, the cereblon protein is human cereblon protein. In some cases, the cereblon protein comprises or consists of SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8. In some cases, the cerebelon protein is at least 80% identical to SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8, e.g., at least 9000, at least 9500 or at least 99% identical to SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8.
In some cases, the cereblon protein is human cereblon protein without the leading methionine (M). In some cases, the cereblon protein comprises or consists of SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8 without the leading methionine (M). In some cases, the cerebelon protein is at least 800% identical to SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8 without the leading methionine (M), e.g., at least 90%, at least 95% or at least 99% identical to SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, or SEQ ID NO: 8 without the leading methionine (M).
In some cases, the cereblon protein is a mutant that is unable to bind compounds, e.g., an E3 ligase binding modulator, e.g., a cereblon binding modulator described herein, at a canonical binding site.
In some cases, the cereblon protein, e.g., a cereblon protein described herein, comprises point mutations at the positions corresponding to Y384 and/or W386 of SEQ ID NO: 3. In some cases, the cereblon protein, e.g., a cereblon protein described herein, comprises point mutations at the positions corresponding to Y384 and W386 of SEQ ID NO: 3. In some cases, the mutations are Y384A and/or W386A.
In some cases, the cereblon protein comprises or consists of SEQ ID NO: 3 with point mutations at Y384 and/or W386. In some cases, the cereblon protein comprises or consists of SEQ ID NO: 3 with point mutations at both Y384 and W386. In some cases, the mutations are Y384A and/or W386A.
The methods described herein are useful, for example, for identifying neosubstrates of E3 ligases. In some cases, the methods are used to validate and/or identify targets that selectively interact with, e.g., cereblon within the E3 ubiquitin ligase complex, in the presence of a compound, e.g., an E3 ligase binding modulator such as a molecular glue, e.g., a cereblon binding modulator such as a CRBN molecular glue.
E3 ligase binding modulators, e.g., cereblon binding modulators, are described, for example, in WO2021/069705, WO2021/053555, WO2022/152821, WO2022/219407, and WO2022219412, which are hereby incorporated by reference in their entirety.
In some cases, the E3 ligase binding modulator, e.g., cereblon binding modulator, is a compound shown in Table 1 or Table 2, or a pharmaceutically acceptable salt thereof, or a stereoisomer thereof.
In some cases, the E3 ligase binding modulator is a molecular glue.
A molecular glue is a small molecule that stabilizes the interaction of two or more biomolecules (e.g., proteins) at a protein-protein interaction (PPI) interface, e.g., by chemically inducing or strengthening surface interactions between the proteins. In some cases, the molecular glue stabilizes the interaction of an E3 ligase substrate receptor protein and one or more target protein(s).
In some cases, the molecular glue functions as a molecular glue drug by modulating (e.g., increasing or promoting) one or more of: the stability of protein-protein interaction(s), degradation of protein(s), sequestration of protein(s) (e.g., into specific regions of a cell), phosphorylation of protein(s), de-phosphorylation of protein(s), and stabilization of protein(s).
In some cases, the modulation is directly of the target protein (the “glued” target). In some cases, the modulation is indirect (e.g., of a target downstream of the “glued” target).
Thalidomide and immunomodulatory imide drugs (IMiDs), such as lenalidomide, and pomalidomide, are examples of molecular glue drugs that induce degradation of normally unrecognized target proteins (sometimes referred to as “neosubstrates”) by generating an interaction between an E3 ligase substrate receptor (e.g., cereblon) and a target protein (e.g., IKZF1/3).
Molecular glue drugs, such as these, that induce the degradation of protein(s) are sometimes referred to as a molecular glue degraders. Molecular glue degraders are believed to create neosubstrate recognition interfaces on the surface of the E3 ligase substrate receptor protein that engage in induced protein-protein interactions with neosubstrates.
The compositions and methods describe herein are useful, for example, in identification and/or prediction of degrons on the surface of a protein, e.g., on the surface of a neosubstrate, potential neosubstrate, predicted neosubstrate and/or putative neosubstrate of an E3 ligase target protein and/or E3 ligase binding modulator target protein.
In the context of molecular glue degraders, for example, in some cases the target protein is the protein the protein that interfaces (e.g., binds) with the E3 ligase substrate receptor. In some cases, the target protein comprises a degron.
Degrons are structural features on the surface of a protein that mediate recruitment of and degradation by an E3 ligase complex, e.g., an E3 ligase complex described herein. Degrons are described, for example, in Lucas and Ciulli, “Recognition of Substrate Dependent Degrons by E3 Ubiquitin Ligases and Modulation by Small-Molecule Mimicry Strategies,” Current Opinion in Structural Biology 44:101-10 (2017). For CRBN, for example, a β-hairpin loop containing a glycine at a key position (G-loop) has been found as a degron based on the interaction of CK1a, GSPT1, and Zn-fingers with CRBN in their X-ray structures. See, e.g., Matyskiela et al., “A Novel Cereblon Modulator Recruits GSPT1 to the RL4 (CRBN) Ubiquitin Ligase, Nature 535(7611):252-7 (2016); Petzold et al. «Structural basis of lenalidomide-induced CK1α degradation by the CRL4CRBN ubiquitin ligase, “Nature, 532(7597), 127-130 (2016); Furihata et al., “Structural bases of IMiD selectivity that emerges by 5-hydroxythalidomide,” Nat Commun. 11(1):4578 (2020); Sievers et al., “Defining the human C2H2 zinc finger degrome targeted by thalidomide analogs through CRBN,” Science 362(6414):eaat0572 (2018); and Wang et al., “Acute pharmacological degradation of Helios destabilizes regulatory T cells,” Nat. Chem. Bio. 17(6):711-17 (2021).
Degrons have been described and/or identified based on their primary, secondary, or tertiary protein structures. In some cases, a degron is described and/or identified in terms of its quaternary structure (e.g., in complex). In some cases, a degron is described and/or identified in the context of a crystal structure (e.g., a PDB structure). For CRBN, for example, there are six known degrons in nine crystal structures (PDB ids: 6UML, 6H0G, 6H0F, 5FQD, 5HXB, 6XK9, 7LPS, 7BQU, and 7BQV).
In some cases, the degron is a small molecule dependent degron (i.e., is a structural feature on the surface of the protein that mediates recruitment of and degradation by an E3 ligase in the presence of an E3 ligase binding modulator, e.g., an E3 ligase binding modulator described herein). In some cases, the degron is a small molecule independent degron (i.e., is a structural feature on the surface of the protein that mediates recruitment of and degradation by an E3 ligase in the absence of an E3 ligase binding modulator, e.g., an E3 ligase binding modulator described herein).
Degrons may be present on the surface of the protein target as it is expressed or added to the protein target via a linker (e.g., a proteolysis targeting chimera (PROTAC), see, e.g., Pavia and Crews, “Targeted Protein Degradation: Elements of PROTAC Design,” Curr Opin Chem Biol 50:111-19 (2019).
Degrons include, e.g., N-degrons and C-degrons, which are known and described in the art. See, e.g., Lucas and Ciulli 2017; see also, e.g., Timms and Koren, “Typing up Loose Ends: the N-degron and C-degron Pathways of Protein Degradation,” Biochem Soc Trans 48(4):1557-67 (2020).
Degrons also include, e.g., phosphodegrons and oxygen-dependent degrons (ODDs), which are also known and described in the art. See, e.g., Lucas and Ciulli 2017. In some cases, the degron comprises or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid.
In some cases, the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid.
In some cases, the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine.
In some cases, the degron comprises or consists of the amino acid motif ETGE (SEQ ID NO: 1). In some cases, the degron comprises or consists of the amino acid motif DLG.
In some cases, the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine. In some cases the degron comprises or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix.
In some cases, the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
Degrons also include, e.g., G-loop degrons. Thus, in some cases, the E3 ligase binding target is a protein comprising an E3 ligase-accessible loop, e.g., a cereblon-accessible loop, e.g., a G-loop.
In some cases, the G-loop degron comprises or consist of the amino acid sequence X1-X2-X3-X4-G-X6, wherein: each of X1, X2, X3, X4, and X6 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7, wherein: each of X1, X2, X3, X4, X6, and X7 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6-X7-X8; wherein: each of X1, X2, X3, X4, X6, X7, and X8 are independently selected from any one of the natural occurring amino acids; and G (i.e. X5) is glycine.
In some cases, a distance from X1 to X4 is less than about 7 angstroms. In some cases, X1 and X4 are the same. In some cases, X1 is aspartic acid or asparagine and X4 is serine or threonine.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is selected from the group consisting of asparagine, aspartic acid, and cysteine; X2 is selected from the group consisting of isoleucine, lysine, and asparagine; X3 is selected from the group consisting of threonine, lysine, and glutamine; X4 is selected from the group consisting of asparagine, serine, and cysteine; X5 is glycine; and X6 is selected from the group consisting of glutamic acid and glutamine.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is asparagine; X2 is isoleucine; X3 is threonine; X4 is asparagine; X5 is glycine; and X6 is glutamic acid.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is aspartic acid; X2 is lysine; X3 is lysine; X4 is serine; X5 is glycine; and X6 is glutamic acid.
In some cases, the G-loop degron comprises or consists of the amino acid sequence X1-X2-X3-X4-G-X6, wherein X1 is cysteine; X2 is asparagine; X3 is glutamine; X4 is cysteine; X5 is glycine; and X6 is glutamine.
In some cases, the degron comprises or consists of an amino acid sequence of about 2 to about 15 amino acids in length. In some cases, the degron comprises or consists of an amino acid sequence of about 6 to about 12 amino acids in length. In some cases, the degron comprises or consists of at least about 6 amino acids. In some cases, the degron comprises or consists of at least about 7 amino acids. In some cases, the degron comprises or consists of at least about 8 amino acids. In some cases, the degron comprises or consists of at least about 9 amino acids. In some cases, the amino degron comprises or consists of at least about 10 amino acids. In some cases, the G-loop degron is 6, 7, or 8 amino acids long.
In some cases, the target protein is a protein listed in the table below or a variant, derivative, ortholog, or homolog thereof.
The molecular surface is a higher-level representation of protein structure than protein structure or sequence. It models a protein as a continuous shape with geometric and chemical features. See Richards et al., “Ann. Rev. Biophysics Bioeng. 6:151-76 (2003).
The molecular surface is useful for the methods described herein, for example, for identifying proteins with similar and/or complementary surface features, predicting molecular interactions between an E3 ligase and a target protein and/or binding modulator. Thus, in some cases, the methods described herein comprise providing molecular surface feature(s) of one or more protein(s). Molecular surface features that are useful for the methods described herein include, for example, geometric features and/or chemical features.
In some cases, the molecular surface features are extracted from a crystal structure. In some cases, the crystal structure is a ligand bound (i.e. holo). In some cases, the crystal structure is unbound (i.e. apo). In some cases, the molecular surface features are extracted from a computer modeled structure. In some cases, the computer modeled structure is ligand bound. In some cases, the computer modeled structure is unbound.
In some cases, the molecular surface features are obtained from a database. For example, the Protein Data Bank (PDB, rcsb.org) or the AlphaFold Protein Structure Database (alphafold.ebi.ac.uk).
PDB is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids (Nucleic Acids Res. 2019 Jan. 8; 47(D1):D520-D528. doi: 10.1093/nar/gky949). The data is submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organizations (e.g. PDBe—pdbe.org, PDBj—pdbj.org, RCSB—rcsb.org/pdb, and BMRB—bmrb.wisc.edu). The PDB is overseen by an organization called the Worldwide Protein Data Bank—wwPDB—.
In some embodiments, providing molecular surface feature(s) comprises determining a three-dimensional structure experimentally, e.g., using X-ray crystallyography, nuclear magnetic resonance (NMR spectroscopy), cry-electron microscropy (cryoEM), small-angle X-ray scattering (SAXS), small-angle neutron scattering (SANS), or combinations thereof.
In some embodiments, providing molecular surface feature(s) comprises modeling of the three-dimensional structural context, e.g., if the three-dimensional structure of the identified protein is not known.
In some cases, modeling of the three-dimensional structural context is carried out using computer modeling. In some cases, the computer modeling is carried out using an artificial intelligence program, e.g., according to the methods described in Jumper et al., “Highly Accurate Protein Structure Prediction with AlphaFold,” Nature 596:583-89 (2021) or Evans et al., “Protein Complex Prediction with AlphaFold-Multimer,” bioRxiv doi.org/10.1101/2021.10.04.463034 (2021).
The molecular surface feature(s) can be provided together or separately. In some cases, the structure of one or more of the proteins is a ligand bound (i.e. holo) structure. In some cases, the structure of one or more of the proteins is unbound (i.e. apo).
In some cases, the molecular surface features(s) are based on the three-dimensional structure of a region of a protein, e.g., the interface region of the protein that participates in (or is hypothesized to participate in) a PPI.
In some cases, for example, where the three-dimensional structures are unbound, starting structure(s) are built by superimposing the three-dimensional structures onto a reference structure.
In some cases, the molecular surface feature (s) are provided as parameters in digital format, e.g., in a MasIF data file, for use in the methods described herein. Thus, in some cases, the methods described herein comprise providing data defining the molecular surface feature(s) of two or more proteins (or fragments thereof).
In some cases, the molecular surface feature(s) are geometric feature(s) and/or chemical feature(s).
In some cases, the surface feature(s) are geometric feature(s). In some cases, the geometric feature(s) are selected from the group consisting of a shape index (Koenderink et al., “Surface Shape and Curvature Scales,” Image Vis. Comput. 10:557-64 (1992), which is hereby incorporated by reference in its entirety), distance-dependent curvature (Yin et al., “Fast Screening of Protein Surfaces using Geometric Invariant Fingerprints” Proc. Natl. Acad. Sci. USA 106:16622-26 (2009), which is hereby incorporated by reference in its entirety), geodesic polar coordinate(s), radial (angular) coordinate(s), and combinations thereof. In other cases, the geometric features are learned directly from the underlying tertiary structure of the protein and its atomic arrangements.
In some cases, the surface feature(s) are chemical feature(s). In some cases, the chemical feature(s) are selected from the group consisting of hydropathy index (Kyte et al., “A Simple Method for Displaying the Hydropathic Character of a Protein” J. Mol. Biol. 157:105-32 (1982)), continuum electrostatics (Jurrus et al. “Improvements to the APBS Biomolecular Solvation Software Suite,” Protein Sci. 27:112-28 (2018), which is hereby incorporated by reference in its entirety), location of free electrons (Kortemme et al., “An Orientation-Dependent Hydrogen Bonding Potential Improves Prediction of Specificity and Structure for Proteins and Protein-Protein Complexes,” J. Mol. Biol. 326:1239-59 (2003), which is hereby incorporated by reference in its entirety), location of free proton donors (Kortemme et al., “An Orientation-Dependent Hydrogen Bonding Potential Improves Prediction of Specificity and Structure for Proteins and Protein-Protein Complexes,” J. Mol. Biol. 326:1239-59 (2003), which is hereby incorporated by reference in its entirety), and combinations thereof. In other cases, the chemical feature are learned directly from the underlying tertiary structure of the protein and its atomic arrangements.
Provided herein are compositions and methods for identification, classification, and/or selection of substrates and/or neosubstrates of E3 ligase(s), e.g., E3 ligase(s) described herein.
In some cases, the methods described herein comprise providing a set of molecular surface features, e.g., as described herein, of one or more protein(s). In some cases, the set of molecular surface features describes a protein surface. In some cases, the set of molecular surface features describes a space complementary to a protein surface.
In some cases, the methods described herein comprise providing a set of molecular surface features (e.g., molecular surface features described herein) of E3 ligase substrate receptor protein(s). In some cases, the molecular surface features of the E3 ligase substrate receptor protein is in an unbound state (e.g., an E3 ligase “surface”). In some cases, the molecular surface features of the E3 ligase substrate receptor protein is in a bound state (e.g., an E3 ligase “neosurface”).
In some cases, the methods described herein comprise providing a first set of molecular surface features, e.g., molecular surface features described herein, derived from a set of proteins having degron(s) of an E3 ligase (e.g., an E3 ligase substrate receptor protein) and/or predicted to have degron(s) of the E3 ligase (e.g., the E3 ligase substrate receptor protein), e.g., degron(s) described herein.
In some cases, the E3 ligase substrate receptor protein is Cereblon (CRBN; e.g., human CRBN), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, e.g., as described herein, and the degron is a G-loop degron, e.g., as described herein.
In some cases, the E3 ligase substrate receptor protein is BTRC (e.g., human BTRC, e.g., SEQ ID NO: 40), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif D-Z-G-X-Z, D-Z-G-X-X-Z, D-Z-G-X-X-X-Z, or D-Z-G-X-X-X-X-Z, wherein D is aspartic acid, each X is independently any naturally occurring amino acid, and Z is selected from the group consisting of pS (phosphorylated serine), aspartic acid, and glutamic acid.
In some cases, the E3 ligase substrate receptor protein is KEAP1 (e.g., human KEAP1, e.g., SEQ ID NO: 18), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is selected from the group consisting of aspartic acid, asparagine, and serine; X2 is any one of the naturally occurring amino acids; X3 is selected from the group consisting of aspartic acid, glutamic acid, and serine; X4 is selected from the group consisting of threonine, asparagine, and serine; X5 is glycine; and X6 is glutamic acid.
In some cases, the E3 ligase substrate receptor protein is KEAP1 (e.g., human KEAP1, e.g., SEQ ID NO: 18), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8-X9, wherein X1 is leucine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is glutamine; X5 is aspartic acid; X6 is any one of the naturally occurring amino acids; X7 is aspartic acid; X8 is leucine; and X9 is glycine.
In some cases, the E3 ligase substrate receptor protein is KEAP1 (e.g., human KEAP1, e.g., SEQ ID NO: 18), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif ETGE ((SEQ ID NO: 1) and/or DLG.
In some cases, the E3 ligase substrate receptor protein is MDM2 (e.g., human MDM2, e.g., SEQ ID NO: 26), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine.
In some cases, the E3 ligase substrate receptor protein is MDM2 (e.g., human MDM2, e.g., SEQ ID NO: 26), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consisting of the amino acid motif X1-X2-X3-X4-X5-X6-X7-X8, wherein X1 is phenylalanine; X2 is any one of the naturally occurring amino acids; X3 is any one of the naturally occurring amino acids; X4 is any one of the naturally occurring amino acids; X5 is tryptophan; X6 is any one of the naturally occurring amino acids; X7 is any one of the naturally occurring amino acids; and X8 is selected from the group consisting of valine, isoleucine, and leucine forms an α-helix.
In some cases, the E3 ligase substrate receptor protein is VHL (e.g., human VHL, e.g., SEQ ID NO: 9), or a variant, derivative, ortholog, or homolog thereof, e.g., an enzymatically active variant, derivative, ortholog, or homolog thereof, and the degron comprises or consists of the amino acid motif X1-X2-X3-X4-X5-X6, wherein X1 is leucine; X2 is any naturally occurring amino acid; X3 is any naturally occurring amino acid; X4 is leucine; X5 is alanine; and X6 is proline or hydroxylated proline (e.g., 4(R)-L-hydroxyproline).
In some cases, the methods described herein include providing a second set of molecular surface features derived from a second set of one or more proteins. In some cases, the one or more proteins comprise or consist of human proteins. In some cases, the one or more proteins are selected from the proteins in Table 3. In some cases, the first and second sets of proteins are mutually exclusive. In some cases, the first and second sets of proteins overlap by one or more proteins.
In some cases, the methods described herein include calculating a similarity and/or complementary score for protein(s) of the second set. In some cases, calculating the similarity score includes comparing first and second sets of molecular surface features, e.g., the molecular surface features described herein.
In some cases, providing a first set of molecular surface features, providing a second set of molecular surface features, calculating a similarity score, and/or calculating a complementarity score is carried out using a pipeline that exploits geometric deep learning to process the molecular surface data which lies in a non-euclidean domain.
In some cases, the methods described herein comprise identifying predicted neosubstrate(s) of E3 ligase(s) based on a similarity and/or complementarity score, e.g., as described herein, using a geometric deep learning model trained on a set of protein-protein interactions to produce embeddings that are similar for surface patches that are similar or (e.g., an interaction fingerprint).
In some cases, the methods described herein comprise identifying predicted neosubstrate(s) of E3 ligase(s) based on a similarity and/or complementarity score, e.g., as described herein, using interaction fingerprints produced by a geometric deep learning model trained on a set of degron and/or putative degron molecular surface feature(s)).
In some cases, the methods described herein comprise identifying predicted degron(s) of neosubstrate(s) of E3 ligase(s) based on similarity to a set of degrons that comprises predicted degrons identified based on interaction fingerprints produced by a geometric deep learning model trained on a set of molecular surface features complementary to the E3 ligase (e.g., an interaction fingerprint).
In some cases, the methods described herein comprise testing or having tested protein(s), e.g., predicted neosubstrate(s) in an E3 ligase substrate detection assay. In some cases, the assay is carried out in the absence of a binding modulator of the E3 ligase. In some cases, the assay is carried out in the presence of a binding modulator of the E3 ligase.
E3 ligase substrate detection assays are described, for example, in Liu et al., “Assays and Technologies for Developing Proteolysis Targeting Chimera Degraders,” Future Medicinal Chemistry 12(12):1155-79 (2020).
E3 ligase substrate detection assays include, for example, binding/ternary binding affinities and ternary complex formation assays used to profile, for example, ternary complex formation, population, stability, binding affinities, cooperative or kinetics such as fluorescence polarization (FP) assay, an amplified luminescent proximity homogenous assay (ALPHA), time-resolved fluorescence energy transfer assay (TR-FRET), isothermal titration calorimetry (ITC), surface plasma resonance (SPR), bio-layer interferometry (BLI), nano-bioluminescence resonance energy transfer (nano-BRET), size exclusive chromatography (SEC), crystallography, co-immunoprecipitation (Co-IP), mass spectrometry (MS), and protein-fragment complementation (e.g., NanoBiT®). See, e.g., Liu et al., 2020.
E3 ligase substrate detection assays include, for example, protein ubiquitination assays. See, e.g., Liu et al., 2020.
E3 ligase substrate detection assays include, for example, target degradation assays such as immunoassays, reporter assays, mass spectrometry (MS), protein degradation-based phenotypic screening such as amplified luminescent proximity homogenous assay (ALPHA), bio-layer interferometry (BLI), cellular thermal shift assay (CETSA), co-immunoprecipitation (Co-IP), cryogenic electron microscopy (Cryo-EM), differential scanning fluorimetry (DSF), fluorescence polarization (FP), isothermal titration calorimetry (ITC), microscale thermophoresis (MST), NanoLuc binary technology (Nano-BiT), nano-bioluminescence resonance energy transfer (BRET), surface plasma resonance (SPR), time-resolved fluorescence energy transfer (TR-FRET), tandem ubiquitin-binding entities-amplified luminescent proximity homogenous and enzyme-linked immunosorbent assay (TUBE-ALPHALISA), and tandem ubiquitin-binding entities-dissociation-enhanced lanthanide fluorescent immunoassay (TUBE-DELFIA). See, e.g., Liu et al., 2020.
In some cases, the E3 ligase substrate detection assay is a proximity assay. In some cases, the E3 ligase substrate detection assay is a binding assay. In some cases, the E3 ligase substrate detection assay is a degradation assay.
In some cases, the proximity assay is a homogeneous time resolved fluorescence (HTRF) assay. In some cases, the proximity assay is a quantitative proteomics assay. In some cases, the proximity assay is a biotinylation assay, e.g., a promiscuous biotinylation assay.
In some cases, the degradation assay is a High efficiency Binary Technology (HiBiT) assay.
In some cases, the degradation assay is a quantitative proteomics assay.
In some cases, the E3 ligase substrate detection assay is a yeast-2-hybrid system. See, e.g., Kohalmi et al., “Identification and Characterization of Protein Interactions Using the Yeast-2-Hybrid System,” In: Gelvin S. B., Schilperoort R. A. (eds) Plant Molecular Biology Manual. Springer, Dordrecht (1998). In some cases, the E3 ligase substrate detection assay is a yeast-3-hybrid system. See, e.g., Glass et al., “The Yeast Three-Hybrid System for Protein Interactions,” Methods Mol. Biol 1794:195-205 (2018).
In some cases, the E3 ligase substrate detection assay is a genomic construct based method, e.g., as described in Sievers et al., “Defining the Human C2H2 Zinc Finger Degrome Targeted by Thalidomide Analogs through CRBN,” Science 362(6414):eaat0572 (2018).
In some cases, the E3 ligase substrate detection assay is an indirect screen, e.g., to detect changes in gene and/or protein expression.
The polypeptide and nucleic acid sequences described herein are described using their IUPAC ambiguity codes (Table 4), unless otherwise noted.
In some cases, the polypeptide or nucleic acid sequences described herein have at least 80%, e.g., at least 85%, 90%, 95%, 98%, or 100% identity to a polypeptide or nucleic acid sequence provided herein, e.g., has differences at up to 1%, 2%, 5%, 10%, 15%, or 20% of the residues of the sequence provided herein replaced, e.g., with conservative mutations, e.g., including or in addition to the mutations described herein.
To determine the percent identity of two nucleic acid sequences, the sequences are aligned for optimal comparison purposes (e.g., gaps can be introduced in one or both of a first and a second amino acid or nucleic acid sequence for optimal alignment and non-homologous sequences can be disregarded for comparison purposes). The length of a reference sequence aligned for comparison purposes is at least 80% of the length of the reference sequence, and in some embodiments is at least 90% or 100%. The nucleotides at corresponding amino acid positions or nucleotide positions are then compared. When a position in the first sequence is occupied by the same nucleotide as the corresponding position in the second sequence, then the molecules are identical at that position (as used herein nucleic acid “identity” is equivalent to nucleic acid “homology”). The percent identity between the two sequences is a function of the number of identical positions shared by the sequences, taking into account the number of gaps, and the length of each gap, which need to be introduced for optimal alignment of the two sequences.
Percent identity between a subject polypeptide or nucleic acid sequence (i.e. a query) and a second polypeptide or nucleic acid sequence (i.e. target) is determined in various ways that are within the skill in the art, for instance, using publicly available computer software such as Smith Waterman Alignment (Smith, T. F. and M. S. Waterman (1981) J Mol Biol 147:195-7); “BestFit” (Smith and Waterman, Advances in Applied Mathematics, 482-489 (1981)) as incorporated into GeneMatcher Plus™, Schwarz and Dayhof (1979) Atlas of Protein Sequence and Structure, Dayhof, M. O., Ed, pp 353-358; BLAST program (Basic Local Alignment Search Tool; (Altschul, S. F., W. Gish, et al. (1990) J Mol Biol 215: 403-10), BLAST-2, BLAST-P, BLAST-N, BLAST-X, WU-BLAST-2, ALIGN, ALIGN-2, CLUSTAL, or Megalign (DNASTAR) software. In addition, those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximal alignment over the length of the sequences being compared. In general, for target proteins or nucleic acids, the length of comparison can be any length, up to and including full length of the target (e.g., 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100%). For the purposes of the present disclosure, percent identity is relative to the full length of the query sequence.
For purposes of the present disclosure, the comparison of sequences and determination of percent identity between two sequences can be accomplished using a Blossum 62 scoring matrix with a gap penalty of 12, a gap extend penalty of 4, and a frameshift gap penalty of 5.
Conservative substitutions typically include substitutions within the following groups: glycine, alanine; valine, isoleucine, leucine; aspartic acid, glutamic acid, asparagine, glutamine; serine, threonine; lysine, arginine; and phenylalanine, tyrosine.
The following examples are included for illustrative purposes only and are not intended to limit the scope of the invention.
A high-level representation of protein structure, the molecular surface, displays patterns of chemical and geometric features that fingerprint a protein's modes of interactions with other biomolecules. Proteins performing similar interactions may share common fingerprints, independent of their evolutionary history. Fingerprints may be difficult to grasp by visual analysis but could be learned from large-scale datasets. MaSIF (Molecular Surface Interaction Fingerprinting) (P. Gainza et al., Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods 17, 184-192 (2020)) is a conceptual framework based on a geometric deep learning (GDL) method (M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, P. Vandergheynst, Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34, 18-42 (2017)) to capture fingerprints that drive specific biomolecular interactions.
MaSIF exploits GDL to learn interaction fingerprints in protein molecular surfaces. First, MaSIF decomposes a surface into overlapping radial patches with a fixed geodesic radius (
Within the MaSIF framework, MaSIF-search was developed (
To test this concept, a database with >100K pairs of interacting protein surface patches with high shape complementarity, as well as a set of randomly chosen surface patches, to be used as non-interacting patches, was developed. A trio of protein surface patches with the labels, binder, target, and random patches were fed into the MaSIF-search network (
Performance on the test set shows that the descriptor Euclidean distances for interacting surface patches is much lower than that of non-interacting patches, resulting in a ROC AUC of 0.99 (
Next, MaSIF-search was used to predict the structure of known protein-protein complexes. Ideally, one would be able to predict whether two proteins interact simply by comparing their respective fingerprints, avoiding a time-consuming, systematic exploration of the 3D docking space. It was found that fingerprint descriptors can provide an initial and fast evaluation of candidate binding partners. However, a better performance can be achieved by including a subsequent stage where candidate patches (referred to as decoys) selected by the Euclidean fingerprint distance of the patches center points to the target patch are rescored using fingerprints of neighboring points within the patch. Specifically, the MaSIF-search workflow entails two stages (
To benchmark MaSIF-search a scenario was simulated where the binding site of a target protein is known, and one attempts to recapitulate the true binder of a protein among many other binders. Specifically, MaSIF-search was benchmarked in 100 bound protein complexes randomly selected from the testing set (disjoint from the training set). For each complex, the center of the interface in the target protein was selected, and then an attempt was made to recover the bound complex within the 100 binder proteins comprising the test set (
Even though MaSIF was trained only on co-crystallized protein complexes, the method was also tested in a benchmark set of 40 proteins crystallized in the unbound (apo) state. Since unbound docking is significantly more challenging, the success criteria were changed to finding the correct complex within the top-1000, top-100, and top-10, for all methods (
In order to utilize molecular surface features for the identification of degron fingerprints, a first-in-kind method was developed for identifying putative degrons based on the similarity of molecular surface features (patches).
Unlike previous approaches using molecular surface representations (see, e.g., Yin et al., “Fast Screening of Protein Surfaces Using Geometric Invariant Fingerprints,” PNAS 106(39):1662-26 (2009)), the machine learning approach does not rely on ‘handcrafted’ descriptors that are manually optimized vectors that describe protein surface features. Such approaches are limited in their usefulness and application, as it is difficult to determine a prior the right set of features for a given prediction task. See, e.g., Gainza et al., “Deciphering Interaction Fingerprints from Protein Molecular Surfaces Using Geometric Deep Learning,” Nature Methods 17:184-92 (2020).
Furthermore, one of the challenges of performing machine learning on CRBN degrons is how little data is available. There are only 9 publicly available structures of 6 known degrons (IKZF1, IKZF2, SALL4, CK1a, GSPT1, ZNF692), which represents a very important challenge in terms of learning using any deep learning tool. Where the number of data points for training is limited, the usefulness of a machine learning algorithm trained on those data points, in order to identify similar data points, will be limited.
Here, a database of all protein surface patches recognized by E3 ligases was constructed using a modification of the MaSIF framework. The method was originally trained to minimize the Euclidian distance between the fingerprint descriptors of a binder and target, and to maximize the distance between the descriptors of target and random (i.e., trained on complementarity rather than similarity), to identify complementary surfaces (i.e., predicted protein-protein interactions). To avoid and overcome the difficulties noted above in training an algorithm to search for degrons based on similarity, the MaSIF model was not re-trained.
Rather, the algorithm was modified to perform matching of surface patches recognized by E3 ligases (that is, MaSIF was modified to search for similarity rather than complementarity), as depicted in
During the matching stage the different patches were clustered in an unsupervised fashion, providing cluster/families of proteins that display similar surface fingerprints and that can potentially engage (the same) E3 ligases, as shown in
The structurally characterized proteome was searched for similar surface patches. A target list of potential E3 substrates was assembled based on the presence of similar surface patch(es).
As a final embodiment of the fingerprint matching, structural complexes between E3 ligases and predicted substrates were docked in three-dimensional space. These docked complexes were used for the search of chemical compounds to facilitate the formation of ternary complexes.
A first-in-kind machine learning based approach is presented to learn features of degrons directly from the molecular surface of degron containing proteins. Unlike the method described in Example 2, this method is trained on degron data.
As noted in Example 2, one of the challenges of performing machine learning on CRBN degrons is how little data is available. The surface-based approach described in Example 1, however, was found to be remarkably capable of learning from a small number of examples, if the training examples are increased using data augmentation, as described herein.
In this method, a protein surface, with per-vertex features (shape index, distance dependent curvature, APBS electrostatics, hydrophobicity, and free/proton electrons), as well as a system of geodesic polar coordinates (angular and radial) for each decomposed patch from the surface was used as input. The output was the same protein surface, but where each vertex has assigned a single value, which is the predicted score for that surface vertex as a degron. This score was represented by a regression score from 0 to 1.
To augment the training data set, the 6 known degrons in 9 crystal structures (PDB ids: 6UML, 6H0G, 6H0F, 5FQD, 5HXB, 6XK9, 7LPS, 7BQU, 7BQV) were used as input to identify similar surfaces, as described in Example 2, and added to the training set. For each of the input structures (either known or augmented), the structure was placed in complex with CRBN, forming a complex between the input structure and CRBN. Then, a surface was computed for both the input structure and for CRBN. The points in the surface of the input structure that belong to the buried surface area of the interface with CRBN were labeled as the degron. Points outside this buried surface area of the interface were labeled as non-degron.
The neural network was then trained using these labeled input structure examples (known or augmented). The input during training was a protein surface, with per-vertex features (shape index, distance dependent curvature, APBS electrostatics, hydrophobicity, and free/proton electrons), as well as a system of geodesic polar coordinates (angular and radial) for each decomposed patch from the surface. In the forward pass, the surface passed over three layers of geodesic convolution, and the output layer was a sigmoid activation function (details of the architecture are shown in
The neural network was validated in multiple ways. First, multiple examples from the training set were separated into a testing set to validate the learning. In addition, several proteins identified from a yeast-3-hybrid assay (
Overall, fAIceit-degron is transformative for several reasons. First, it is capable of learning from a very small number of examples. Second, it can learn from the surface which is the best representation of structural degrons, as it is the shape of the protein that is recognized by CRBN. Finally, fAIceit-degron is generalizable to other applications and degron types.
A database of CRBN degrons was constructed using this method, although, as noted above, it can be generalized to other applications and degron types as well.
A first-in-kind method was developed for identifying putative neosubstrates through proteome-wide searches of surface complementarity to E3 ligase substrate receptors. This method allows, for the first time, an efficient method for scanning vast databases of proteins for neosubstrates complementary to a neosurface (e.g., of a molecular glue bound E3 ligase substrate receptor such as CRBN). The method performs up to 4000× faster than traditional docking tools.
Structural complexes between E3 ligases and predicted substrates were docked in three-dimensional space and these docked complexes were used for the search of chemical compounds to facilitate the formation of ternary complexes, as follows.
Surface fingerprints for a set of potential neosubstrates were prepared for binding to an E3 ligase substrate receptor based on complementarity using a modification of the MasIF framework described in Example 1. Briefly, all structures available for a given gene (PDB and AlphaFold2) were processed by computing chemical features and output with extracted chains and surface features. Then MasIF input was generated and geodesic and radial (angular) coordinates were computed for each patch. Geometric features for each patch were computed and the chemical features which were previously read as input were assigned to each vertex in the patch. MasIF was then used to compute the interface propensity for each patch in the protein, and a fingerprint describing each patch. The fingerprint was used to compare to E3 ligase surfaces (and, in this case, neosurfaces).
Neosurface features of E3 ligase substrate receptors (including CRBN) were generated for a set of binary complexes of E3 ligase substrate receptors and small molecules, in this example, CRBN in complex with a series of molecular glues. MasIF was modified to receive the neosurface (protein+small molecule) and generate fingerprints and angular/geodesic coordinates as for the potential neosubstrates.
Some of the neosurface fingerprints were extracted from crystal structures (in this case PDB entries) of CRBN bound to a particular molecular glue (PDB ids: 6UML, 6H0G, 6H0F, 5HXB, 6XK9, 7LPS, 7BQU, 7BQV). Some of the neosurface fingerprints were generated by docking molecular glues to CRBN in silico.
MaSIF, as originally implemented, is unable to generate molecular surface fingerprints for these small molecules or binary complexes. To overcome this deficiency, new code was developed to process this type of biomolecule to compute the features of the entire neosurface, making no distinction between protein and small molecule, and assigning all small molecules the hydrophobicity of Tyrosine. Neosurfaces were then processed by computing chemical features, as for neosubstrates, and MasIF input was generated as described above and fingerprints were generated and compared to neosubstrate surfaces.
The fAIceit-complementarity method allows, for the first time, proteome-wide searches of surface complementary, e.g., to E3 ligase substrate receptor proteins such as CRBN, and for the scanning of vast databases of proteins for neosubstrates complementary to a neosurface.
The fingerprints describing the E3 ligase neosurfaces were matched to the neosubstrate surfaces and, for those under a threshold Euclidian distance, a plurality of alignments was generated and scored and filtered to identify potential degrons.
Global docking using MaSIF_search using apo-CRBN (i.e., CRBN without a small molecule bound) or holo-CRBN (i.e., CRBN with a small molecule bound) was carried out against the structurally characterized proteome to identify potential targets for an E3 Ligase Complex. An example of a protein surface is depicted in
Global docking using MaSIF_search of holo-CRBN was carried out against the structurally characterized proteome. To generate a holo-CRBN for use in this method, a small molecule E3 ligase binding modulator was parameterized and included in the E3 ligase structures. Predicted complexes of potential targets docked to holo-E3 ligase were identified.
Testing distinct ligand descriptors based on geometry, chemistry and different structural representations was carried out. Generic training/test sets for small molecule-protein interactions were created and/or identified (e.g., PDBbind database) and processed for compatibility with MaSIF.
Training MaSIF-ligand for the identification of complementary ligands in drug-receptors was carried out. Structural descriptors and learning approaches for capturing the interactions of the small molecules with the proteins' surface patches was identified. The performance of MaSIF-ligand was evaluated by the ability of identifying the correct ligands or ligand fragments for their respective pockets.
A generative pipeline of ligands for E3-substrate-compound ternary complexes was created, stemming only from the surface signature of a given target. Approaches like variational autoencoders can be used. MaSIF-ligand was explicitly tested with E3 ligase ternary pairs to score existing ligands and to generate ligands.
Predicted E3 ligase target ligands were identified.
Putative neosubstrates of CRBN were identified using the methods described in Examples 2-4.
Yeast three hybrid experiments were carried out to identify molecular glue induced interactions between CRBN and cDNA library-derived targets, as depicted in
As shown in
As shown in
Putative neosubstrates of CRBN were identified using the methods described in Example 3. The CRBN neosurface was used to find novel substrates (e.g., as depicted in
sapiens OX = 9606 GN = VHL PE = 1 SV = 2
sapiens OX = 9606 GN = NAIP PE = 1 SV = 3
sapiens OX = 9606 GN = BIRC2 PE = 1 SV = 2
sapiens OX = 9606 GN = BIRC3 PE = 1 SV = 2
sapiens OX = 9606 GN = BIRC5 PE = 1 SV = 3
sapiens OX = 9606 GN = BIRC6 PE = 1 SV = 2
sapiens OX = 9606 GN = BIRC7 PE = 1 SV = 2
sapiens OX = 9606 GN = BIRC8 PE = 1 SV = 2
sapiens OX = 9606 GN = RNF4
sapiens OX = 9606 GN = SPSB1 PE = 1 SV = 1
sapiens OX = 9606 GN = SPSB2 PE = 1 SV = 1
sapiens OX = 9606 GN = SPSB4 PE = 1 SV = 1
sapiens OX = 9606 GN = ITCH PE = 1 SV = 2
sapiens OX = 9606 GN = TRIM24 PE = 1 SV = 3
sapiens OX = 9606 GN = GID4 PE = 1 SV = 1
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/280,508, filed on Nov. 17, 2021, and U.S. Provisional Application Ser. No. 63/419,550, filed on Oct. 26, 2022. The entire contents of the foregoing are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/050242 | 11/17/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63280508 | Nov 2021 | US | |
63419550 | Oct 2022 | US |