The Sequence Listing written in file 048536-593001WO Sequence Listing_ST25.txt, created Jul. 12, 2018, 7,148 bytes, machine format IBM-PC, MS Windows operating system, is hereby incorporated by reference.
Many natural proteins contain precisely oriented cofactors that enable their functions, yet the de novo design of proteins that bind cofactors with atomic-scale precision has remained a significant challenge. De novo protein design critically tests our understanding of protein folding and function, and can provide new frameworks that combine man-made materials with protein scaffolds. Highly accurate design of porphyrin-binding proteins, validated by high-resolution structure determination, has presented a major unsolved challenge. Disclosed herein, inter alia, are solutions to these and other problems in the art.
In an aspect is provided a computer-implemented method, including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
In an aspect is provided a system, including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
In another aspect is provided a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
In an aspect is provided a protein sequence obtainable based on the energy minimization calculation using the method, the system, or the non-transitory computer-readable medium as described herein.
In an aspect is provided a protein, or conservatively modified variant thereof, having the sequence:
Protein catalysis requires atomic-level orchestration of side chains, substrates, and cofactors, yet the ability to design a small-molecule-binding protein entirely from first principles with a precisely predetermined structure has not been demonstrated. Herein we describe a novel protein, PS1, which binds a highly electron-deficient, non-natural porphyrin at temperatures up to 100° C. The high-resolution structure of holo-PS1 is in sub-A agreement with the design. The structure of apo-PS1 retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry. Our results illustrate the unification of core packing and binding site definition as a central principle of ligand-binding protein design.
“Analog,” or “analogue” is used in accordance with its plain ordinary meaning within Chemistry and Biology and refers to a chemical compound that is structurally similar to another compound (i.e., a so-called “reference” compound) but differs in composition, e.g., in the replacement of one atom by an atom of a different element, or in the presence of a particular functional group, or the replacement of one functional group by another functional group, or the absolute stereochemistry of one or more chiral centers of the reference compound. Accordingly, an analog is a compound that is similar or comparable in function and appearance but not in structure or origin to a reference compound.
The terms “a” or “an,” as used in herein means one or more. In addition, the phrase “substituted with a[n],” as used herein, means the specified group may be substituted with one or more of any or all of the named substituents. For example, where a group, such as an alkyl or heteroaryl group, is “substituted with an unsubstituted C1-C20 alkyl, or unsubstituted 2 to 20 membered heteroalkyl,” the group may contain one or more unsubstituted C1-C20 alkyls, and/or one or more unsubstituted 2 to 20 membered heteroalkyls.
A “detectable agent” or “detectable moiety” is a composition detectable by appropriate means such as spectroscopic, photochemical, biochemical, immunochemical, chemical, magnetic resonance imaging, or other physical means. For example, useful detectable agents include 18F, 32P, 33P, 45Ti, 47Sc, 52Fe, 59Fe, 62Cu, 64Cu, 67Cu, 67Ga, 68Ga, 77As, 86Y, 90Y. 89Sr, 89Zr, 94Tc, 94Tc, 99mTc, 99Mo, 105Pd, 105Rh, 111Ag, 111In, 123I, 124I, 125I, 131I, 142Pr, 143Pr, 149Pm, 153Sm, 154-1581Gd, 161Tb, 166Dy, 166Ho, 169Er, 175Lu, 177Lu, 186Re, 188Re, 189Re, 194Ir, 198Au, 199Au, 211At, 211Pb, 212Bi, 212Pb, 213Bi, 223Ra, 225Ac, Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, 32P, fluorophore (e.g. fluorescent dyes or chromophores), phosphor (e.g., phosphorescent dyes or chromophores), lumophore (luminescent dyes or chromophores), electron-dense reagents, enzymes (e.g., as commonly used in an ELISA), biotin, digoxigenin, paramagnetic molecules, paramagnetic nanoparticles, ultrasmall superparamagnetic iron oxide (“USPIO”) nanoparticles, USPIO nanoparticle aggregates, superparamagnetic iron oxide (“SPIO”) nanoparticles, SPIO nanoparticle aggregates, monochrystalline iron oxide nanoparticles, monochrystalline iron oxide, nanoparticle contrast agents, liposomes or other delivery vehicles containing Gadolinium chelate (“Gd-chelate”) molecules, Gadolinium, radioisotopes, radionuclides (e.g. carbon-11, nitrogen-13, oxygen-15, fluorine-18, rubidium-82), fluorodeoxyglucose (e.g. fluorine-18 labeled), any gamma ray emitting radionuclides, positron-emitting radionuclide, radiolabeled glucose, radiolabeled water, radiolabeled ammonia, biocolloids, microbubbles (e.g. including microbubble shells including albumin, galactose, lipid, and/or polymers; microbubble gas core including air, heavy gas(es), perfluorcarbon, nitrogen, octafluoropropane, perflexane lipid microsphere, perflutren, etc.), iodinated contrast agents (e.g. iohexol, iodixanol, ioversol, iopamidol, ioxilan, iopromide, diatrizoate, metrizoate, ioxaglate), barium sulfate, thorium dioxide, gold, gold nanoparticles, gold nanoparticle aggregates, two-photon fluorophores, hyperpolarizable chromophores, or haptens and proteins or other entities which can be made detectable, e.g., by incorporating a radiolabel into a peptide or antibody specifically reactive with a target peptide. A detectable moiety is a monovalent detectable agent or a detectable agent capable of forming a bond with another composition.
Radioactive substances (e.g., radioisotopes) that may be used as imaging and/or labeling agents in accordance with the embodiments of the disclosure include, but are not limited to, 18F, 32P, 33P, 45Ti, 47Sc, 52Fe, 59Fe, 62Cu, 64Cu, 67Cu, 67Ga, 68Ga, 77As, 86Y, 90Y, 89Sr, 89Zr, 94Tc, 94Tc, 99mTc, 99Mo, 105Pd, 105Rh, 111Ag, 111In, 123I, 124I, 125I, 131I, 142Pr, 143Pr, 149Pm, 153Sm, 154-1581Gd, 161Tb, 166Dy, 166Ho, 169Er, 175Lu, 177Lu, 186Re, 188Re, 189Re, 194Ir, 198Au, 199Ab, 211At, 211Pb, 212Bi, 212Pb, 223Ra, and 225Ac. Paramagnetic ions that may be used as additional imaging agents in accordance with the embodiments of the disclosure include, but are not limited to, ions of transition and lanthanide metals (e.g. metals having atomic numbers of 21-29, 42, 43, 44, or 57-71). These metals include ions of Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu.
An amino acid residue in a protein “corresponds” to a given residue when it occupies the same essential structural position within the protein as the given residue.
The term “isolated” when applied to a nucleic acid or protein denotes that the nucleic acid or protein is essentially free of other cellular components with which it is associated in the natural state. It can be, for example, in a homogeneous state and may be in either a dry or aqueous solution. Purity and homogeneity are typically determined using analytical chemistry techniques such as polyacrylamide gel electrophoresis or high performance liquid chromatography. A protein that is the predominant species present in a preparation is substantially purified.
The term “amino acid” refers to naturally occurring and synthetic amino acids, as well as amino acid analogs and amino acid mimetics that function in a manner similar to the naturally occurring amino acids. Naturally occurring amino acids are those encoded by the genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, γ-carboxyglutamate, and O-phosphoserine. Amino acid analogs refers to compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Amino acid mimetics refers to chemical compounds that have a structure that is different from the general chemical structure of an amino acid, but that function in a manner similar to a naturally occurring amino acid. The terms “non-naturally occurring amino acid” and “unnatural amino acid” refer to amino acid analogs, synthetic amino acids, and amino acid mimetics, which are not found in nature.
Amino acids may be referred to herein by either their commonly known three letter symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical Nomenclature Commission. Nucleotides, likewise, may be referred to by their commonly accepted single-letter codes.
The terms “polypeptide,” “peptide” and “protein” are used interchangeably herein to refer to a polymer of amino acid residues, wherein the polymer may in embodiments be conjugated to a moiety that does not consist of amino acids. The terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymers. A “fusion protein” refers to a chimeric protein encoding two or more separate protein sequences that are recombinantly expressed as a single moiety. In embodiments, the protein includes at least 30 amino acid residues. A protein may be characterized as having a protein backbone. A “protein backbone” is used herein in accordance with its ordinary meaning and refers to the polymer of amino acid residues that create a continuous chain. For example, a protein backbone may refer to the series of amino acid residues covalently linked together, e.g.,
wherein each R independently represents optionally different amino acid side chains. In embodiments, the protein backbone includes core amino acid residues and ligand binding amino acid residues. In embodiments, the protein backbone includes core amino acid residues. In embodiments, the protein backbone includes ligand binding amino acid residues.
As may be used herein, the terms “nucleic acid,” “nucleic acid molecule,” “nucleic acid oligomer,” “oligonucleotide,” “nucleic acid sequence,” “nucleic acid fragment” and “polynucleotide” are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides covalently linked together that may have various lengths, either deoxyribonucleotides or ribonucleotides, or analogs, derivatives or modifications thereof. Different polynucleotides may have different three-dimensional structures, and may perform various functions, known or unknown. Non-limiting examples of polynucleotides include a gene, a gene fragment, an exon, an intron, intergenic DNA (including, without limitation, heterochromatic DNA), messenger RNA (mRNA), transfer RNA, ribosomal RNA, a ribozyme, cDNA, a recombinant polynucleotide, a branched polynucleotide, a plasmid, a vector, isolated DNA of a sequence, isolated RNA of a sequence, a nucleic acid probe, and a primer. Polynucleotides useful in the methods of the disclosure may comprise natural nucleic acid sequences and variants thereof, artificial nucleic acid sequences, or a combination of such sequences.
A polynucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA). Thus, the term “polynucleotide sequence” is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching. Polynucleotides may optionally include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
“Conservatively modified variants” applies to both amino acid and nucleic acid sequences. With respect to particular nucleic acid sequences, “conservatively modified variants” refers to those nucleic acids that encode identical or essentially identical amino acid sequences. Because of the degeneracy of the genetic code, a number of nucleic acid sequences will encode any given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every position where an alanine is specified by a codon, the codon can be altered to any of the corresponding codons described without altering the encoded polypeptide. Such nucleic acid variations are “silent variations,” which are one species of conservatively modified variations. Every nucleic acid sequence herein which encodes a polypeptide also describes every possible silent variation of the nucleic acid. One of skill will recognize that each codon in a nucleic acid (except AUG, which is ordinarily the only codon for methionine, and TGG, which is ordinarily the only codon for tryptophan) can be modified to yield a functionally identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a polypeptide is implicit in each described sequence.
As to amino acid sequences, one of skill will recognize that individual substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded sequence is a “conservatively modified variant” where the alteration results in the substitution of an amino acid with a chemically similar amino acid. Conservative substitution tables providing functionally similar amino acids are well known in the art. Such conservatively modified variants are in addition to and do not exclude polymorphic variants, interspecies homologs, and alleles of the disclosure.
The following eight groups each contain amino acids that are conservative substitutions for one another: 1) Alanine (A), Glycine (G); 2) Aspartic acid (D), Glutamic acid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine (K); 5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); 6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 7) Serine (S), Threonine (T); and 8) Cysteine (C), Methionine (M) (see, e.g., Creighton, Proteins (1984)).
“Percentage of sequence identity” is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide or polypeptide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence identity. The terms “identical” or percent “identity,” in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site http://www.ncbi.nlm.nih.gov/BLAST/ or the like). Such sequences are then said to be “substantially identical”. This definition also refers to, or may be applied to, the compliment of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
An amino acid or nucleotide base “position” is denoted by a number that sequentially identifies each amino acid (or nucleotide base) in the reference sequence based on its position relative to the N-terminus (or 5′-end). Due to deletions, insertions, truncations, fusions, and the like that must be taken into account when determining an optimal alignment, in general the amino acid residue number in a test sequence determined by simply counting from the N-terminus will not necessarily be the same as the number of its corresponding position in the reference sequence. For example, in a case where a variant has a deletion relative to an aligned reference sequence, there will be no amino acid in the variant that corresponds to a position in the reference sequence at the site of deletion. Where there is an insertion in an aligned reference sequence, that insertion will not correspond to a numbered amino acid position in the reference sequence. In the case of truncations or fusions there can be stretches of amino acids in either the reference or aligned sequence that do not correspond to any amino acid in the corresponding sequence.
The terms “numbered with reference to” or “corresponding to,” when used in the context of the numbering of a given amino acid or polynucleotide sequence, refers to the numbering of the residues of a specified reference sequence when the given amino acid or polynucleotide sequence is compared to the reference sequence.
The term “amino acid side chain” refers to the functional substituent contained on amino acids. For example, an amino acid side chain may be the side chain of a naturally occurring amino acid. Naturally occurring amino acids are those encoded by the genetic code (e.g., alanine, arginine, asparagine, aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, or valine), as well as those amino acids that are later modified, e.g., hydroxyproline, γ-carboxyglutamate, and O-phosphoserine. In embodiments, the amino acid side chain may be a non-natural amino acid side chain. In embodiments, the amino acid side chain is
wherein the symbol “” corresponds to the attachment of a chemical moiety (e.g., side chain) to the remainder of a molecule or chemical formula (e.g., the amino acid core, or
The term “non-natural amino acid side chain” refers to the functional substituent of compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium, allylalanine, 2-aminoisobutryric acid. Non-natural amino acids are non-proteinogenic amino acids that either occur naturally or are chemically synthesized. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Non-limiting examples include exo-cis-3-aminobicyclo[2.2.1]hept-5-ene-2-carboxylic acid hydrochloride, cis-2-aminocycloheptanecarboxylic acid hydrochloride,cis-6-amino-3-cyclohexene-1-carboxylic acid hydrochloride, cis-2-amino-2-methylcyclohexanecarboxylic acid hydrochloride, cis-2-amino-2-methylcyclopentanecarboxylic acid hydrochloride,2-(Boc-aminomethyl)benzoic acid, 2-(Boc-amino)octanedioic acid, Boc-4,5-dehydro-Leu-OH (dicyclohexylammonium), Boc-4-(Fmoc-amino)-L-phenylalanine, Boc-β-Homopyr-OH, Boc-(2-indanyl)-Gly-OH, 4-Boc-3-morpholineacetic acid, 4-Boc-3-morpholineacetic acid, Boc-pentafluoro-D-phenylalanine, Boc-pentafluoro-L-phenylalanine, Boc-Phe(2-Br)—OH, Boc-Phe(4-Br)—OH, Boc-D-Phe(4-Br)—OH, Boc-D-Phe(3-Cl)—OH, Boc-Phe(4-NH2)-OH, Boc-Phe(3-NO2)-OH, Boc-Phe(3,5-F2)-OH, 2-(4-Boc-piperazino)-2-(3,4-dimethoxyphenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(2-fluorophenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(3-fluorophenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(4-fluorophenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(4-methoxyphenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-phenylacetic acid purum, 2-(4-Boc-piperazino)-2-(3-pyridyl)acetic acid purum, 2-(4-Boc-piperazino)-2-[4-(trifluoromethyl)phenyl]acetic acid purum, Boc-β-(2-quinolyl)-Ala-OH, N-Boc-1,2,3,6-tetrahydro-2-pyridinecarboxylic acid, Boc-β-(4-thiazolyl)-Ala-OH, Boc-β-(2-thienyl)-D-Ala-OH, Fmoc-N-(4-Boc-aminobutyl)-Gly-OH, Fmoc-N-(2-Boc-aminoethyl)-Gly-OH, Fmoc-N-(2,4-dimethoxybenzyl)-Gly-OH, Fmoc-(2-indanyl)-Gly-OH, Fmoc-pentafluoro-L-phenylalanine, Fmoc-Pen(Trt)-OH, Fmoc-Phe(2-Br)—OH, Fmoc-Phe(4-Br)—OH, Fmoc-Phe(3,5-F2)—OH, Fmoc-β-(4-thiazolyl)-Ala-OH, Fmoc-β-(2-thienyl)-Ala-OH, 4-(hydroxymethyl)-D-phenylalanine.
The terms “identical” or percent “identity,” in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site http://www.ncbi.nlm.nih.gov/BLAST/ or the like). Such sequences are then said to be “substantially identical.” This definition also refers to, or may be applied to, the compliment of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
The term “expression” includes any step involved in the production of the polypeptide including, but not limited to, transcription, post-transcriptional modification, translation, post-translational modification, and secretion. Expression can be detected using conventional techniques for detecting protein (e.g., ELISA, Western blotting, flow cytometry, immunofluorescence, immunohistochemistry, etc.).
“Control” or “control experiment” is used in accordance with its plain ordinary meaning and refers to an experiment in which the subjects or reagents of the experiment are treated as in a parallel experiment except for omission of a procedure, reagent, or variable of the experiment. In some instances, the control is used as a standard of comparison in evaluating experimental effects. In some embodiments, a control is the measurement of the activity of a protein in the absence of a compound as described herein (including embodiments and examples).
As used herein, the term “about” means a range of values including the specified value, which a person of ordinary skill in the art would consider reasonably similar to the specified value. In embodiments, about means within a standard deviation using measurements generally acceptable in the art. In embodiments, about means a range extending to +/−10% of the specified value. In embodiments, about means the specified value.
The terms “bind” and “bound” as used herein is used in accordance with its plain and ordinary meaning and refers to the association between atoms or molecules. The association can be direct or indirect. For example, bound atoms or molecules may be direct, e.g., by covalent bond or linker (e.g. a first linker or second linker), or indirect, e.g., by non-covalent bond (e.g. electrostatic interactions (e.g. ionic bond, hydrogen bond, halogen bond), van der Waals interactions (e.g. dipole-dipole, dipole-induced dipole, London dispersion), ring stacking (pi or hyrdophobic effects), hydrophobic interactions and the like).
The terms “set of ligand binding amino acid residues” as used herein refers to at least two ligand binding amino acid residues. “Ligand binding amino acid residues” refer to amino acid residues which are capable of binding (e.g., has a measurable dissociation constant of binding, has a dissociation constant of binding less than 1 μM) to a ligand. In embodiments, the ligand binding amino acid residues refer to amino acid residues which bind to a ligand. Each ligand binding amino acid residue is associated with a set of ligand binding amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, or spherical coordinates) which defines the ligand binding amino acid residue in space (e.g., Euclidean space). In embodiments, ligand binding amino acid residues refer to amino acid residues within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, or 30 Å from the ligand. In embodiments, ligand binding amino acid residues refer to amino acid residues within about 5 Å from the ligand. In determining the set of ligand binding amino acid residues, such factors such as the proximity of the amino acid to the ligand or the interactions between the amino acid and the ligand may influence the designation to be a “ligand binding amino acid residue.”
The term “dissociation constant” is used in accordance with its plain ordinary meaning and refers to the ligand concentration at which half of the proteins are occupied (i.e. bound to a ligand) at equilibrium. Typically, the dissociation constant has molar units (M). The smaller the dissociation constant, the more tightly bound the ligand is, or the higher the affinity between ligand and protein. For example, a ligand with a nanomolar (nM) dissociation constant binds more tightly to a particular protein than a ligand with a micromolar (μM) dissociation constant.
The terms “ligand” and “cofactor” are synonymous, and used in accordance with their plain ordinary meaning in chemistry and biochemistry and refer to an agent (e.g., compound, metal, ion, biomolecule, agonist, antagonist) which is capable of binding to a protein (e.g., a protein described herein). In embodiments, a ligand refers to an agent (e.g., compound, metal, ion, biomolecule) which is binds (e.g., covalently or non-covalently) to a protein. Typically, upon binding the ligand has an effect on the protein (e.g., structural change of the protein, modulation of signaling pathways). A ligand is associated with a set of ligand atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which define the ligand in space (e.g., Euclidean space). The ligand may be endogenous or exogenous. Non-limiting examples of ligands include a catalyst, detectable agent, therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic (e.g., a combined therapeutic and diagnostic agent), photodynamic therapy (PDT) agent, porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component that is capable of binding a metal ion. In embodiments, the ligand is a peptide (e.g., 2 to 30 amino acid residues), a protein (e.g., greater than 30 amino acid residues), a small molecule (e.g., a compound with a molecular weight of less than 2000 Daltons), or a small molecule-metal-ion complex (e.g., a metalloporphyrin). In embodiments, the ligand is endogenous. In embodiments, the ligand is exogenous. In embodiments, the ligand is flavin. In embodiments, the ligand is heme.
The terms “set of core amino acid residues” as used herein refers to at least two core amino acid residues. Core amino acid residues refer to amino acid residues, which are incapable of binding to a ligand (e.g., does not have a measurable dissociation constant of binding, does not have a dissociation constant of binding less than 1 μM). In embodiments core amino acids are amino acids which do not bind a ligand. Each core amino acid residue is associated with a set of core amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which defines the core binding amino acid residue in space (e.g., Euclidean space). Core amino acids are at least 75% inaccessible to a 1.8 Å spherical probe. A typical set of core amino acid residues contains at least 6 amino acid residues. In embodiments, the set of core amino acid residues includes amino acid residues which are solvent inaccessible as measured by the accessible surface area. Additional information regarding the accessible surface area assessment may be found in Lins et al. (Lins, L., Thomas, A., & Brasseur, R. (2003) Protein Science: A Publication of the Protein Society, 12(7), 1406-141), which is incorporated herein in its entirety for all purposes. In embodiments, the core amino acids atomic coordinates are greater than 5 Å from any ligand atomic acid coordinate. In embodiments, the set of core amino acid residues is hydrophobic. In embodiments, the core amino acids includes the sequence:
The terms “optimizing” and “optimization” are used in accordance with their ordinary meaning in mathematics and computer science and refers to identifying a favorable outcome subject to certain criteria (e.g., constraints) from a set of available possibilities. Optimizing may employ iterative or heuristic algorithms, such as simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, stimulated annealing algorithm, Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm. For example, optimizing typically includes evaluating an energy function (e.g., force field model) and finding the minimum (e.g., global minimum or local minimum). Optimizing may include repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates). The output of an optimization process may provide a set of ligand binding amino acid residues and a corresponding set of ligand binding amino acid residue atomic coordinates, and a set of core amino acid residues and a corresponding set of core amino acid residue atomic coordinates, which corresponds to an energetically stabilized protein. In embodiments the outcome of the optimization is the global minimum (e.g., the most energetically stabilized protein). In embodiments the outcome of the optimization is a local minimum (e.g., a minimum energy given the domain). In embodiments the optimization is complete when the derivative of the energy with respect to the position of the atoms, ∂E/∂r, is zero and the Hessian matrix has positive eigenvalues. In embodiments, optimizing includes a plurality of minimization calculations. In embodiments the optimization is a finite number of iterations.
An energy minimization calculation refers to the process of evaluating the energy as a function of the atomic coordinates, V(r). The energy function may include intra- and intermolecular energy terms within the system (e.g., protein) which may be written as Vtotal(r)=Vbonds(r)+Vangles(r)+Vdihedral(r)+Vimproper(r)+Vnonbonding(r)+Velectrostatics(r); where Vtotal(r) corresponds to the total energy as a function of the atomic positions; Vbonds(r) corresponds to the energy contribution from bonded atoms, Vangles(r) corresponds to the energy contribution from angles; Vdihedral(r) corresponds to the energy contribution from dihedral torsions; Vimproper(r) corresponds to the energy contribution from out-of-plane torsions; Vnonbonding(r) corresponds to the energy contribution from nonbonding interactions; and Velectrostatics(r) corresponds to the energy contribution from electrostatic interactions. Additional energy function terms may also be included in the total energy function, Vtotal(r), for example additional functions from molecular mechanics, functions from structural bioinformatics (log-odds scores), amino acid sidechain packing functions (e.g., functions and algorithms which vary the identity and rotamer of an amino acid side chain), protein radius of gyration functions, or a penalty function.
The term biomolecule as used herein refers to a molecule present in living organisms (e.g., proteins, carbohydrates, lipids, and nucleic acids, metabolites) and may be endogenous or exogenous in origin.
The term “energetically stabilized protein” is used in accordance with its ordinary meaning in the art, and is understood to refer to a protein which is structurally and thermodynamically stable relative to the protein that has not been energetically stabilized. For example, an energetically stabilized protein is determined to be energetically stabilized by determining the difference in the Gibbs free energy between the folded and unfolded states of the protein, also referred to herein as ΔGfolding. An energetically stabilized protein may be characterized by a well-dispersed NMR spectrum and/or the presence of a significantly folded core. In embodiments, the energetically stabilized protein is an enzyme. In embodiments, the energetically stabilized protein is an apo protein (e.g., a protein that is not bound to a ligand). In embodiments, the energetically stabilized protein is a holo protein (e.g., a protein that is bound to a ligand). In embodiments, the energetically stabilized protein is an apo protein which is capable of becoming a holo protein upon ligand binding. In embodiments, an energetically stabilized protein refers to a protein which is capable of performing a function (e.g., modulating a signal pathway). In embodiments, the energetically stabilized protein resists side-reactions such as aggregation and proteolysis. In embodiments, the energetically stabilized protein has a ΔGfolding of about −5 to about −40 kcal/mol in standard physiological conditions (e.g., temperature range of 20-40 degrees Celsius, atmospheric pressure of 1, pH of 6-8, glucose concentration of 1-20 mM, atmospheric oxygen concentration).
The term “exogenous” refers to a molecule or substance (e.g., a compound, ligand, or protein) that originates from outside a given cell or organism. Conversely, the term “endogenous” refers to a molecule or substance that is native to, or originates within, a given cell or organism.
A “therapeutic agent” as used herein refers to an agent (e.g., compound or composition) that when administered to a subject in sufficient amounts will have a therapeutic effect, such as an intended prophylactic effect, preventing or delaying the onset (or reoccurrence) of an injury, disease, pathology or condition, or reducing the likelihood of the onset (or reoccurrence) of an injury, disease, pathology, or condition, or their symptoms or the intended therapeutic effect, e.g., treatment or amelioration of an injury, disease, pathology or condition, or their symptoms including any objective or subjective parameter of treatment such as abatement; remission; diminishing of symptoms or making the injury, pathology or condition more tolerable to the patient; slowing in the rate of degeneration or decline; making the final point of degeneration less debilitating; or improving a patient's physical or mental well-being.
The term “small molecule” or the like as used herein refers, unless indicated otherwise, to a molecule having a molecular weight of less than about 700 Dalton, e.g., less than about 700, 650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 100, or 50 Dalton.
In this disclosure, “comprises,” “comprising,” “containing” and “having” and the like can have the meaning ascribed to them in U.S. patent law and can mean “includes,” “including,” and the like. “Consisting essentially of or “consists essentially” likewise has the meaning ascribed in U.S. patent law and the term is open-ended, allowing for the presence of more than that which is recited so long as basic or novel characteristics of that which is recited is not changed by the presence of more than that which is recited, but excludes prior art embodiments.
In an aspect is provided a computer-implemented method, including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein. In embodiments, the optimization is performed to improve, relative to a control, the protein-ligand interactions (e.g., decrease the dissociation constant of binding 1-fold, 2-fold, 3-fold, 4-fold or 5-fold). In embodiments, the optimization modulates, relative to a control, the non-covalent interactions between the protein and the ligand.
In embodiments, step c) includes simultaneously optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes concurrently (e.g., performing an optimization iteration on all sets prior to continuing the optimization) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates.
In embodiments, the optimizing is joint optimizing (e.g., optimizing the set of ligand binding amino acid residues, the set of core amino acid residues, and optionally the ligand simultaneously). In embodiments, step c) includes optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of ligand binding amino acid residues and the set of core amino acid residues. In embodiments, step c) includes optimizing the set of ligand binding amino acid residues and the set of ligand binding amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of core amino acid residues and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of ligand binding amino acid residue atomic coordinates and the set of core amino acid residue atomic coordinates.
In embodiments, step c) includes optimizing the protein backbone. Optimizing the protein backbone may refer to repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate, but not the side chain of the residue), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate, but not the side chain of the residue (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates). In embodiments, step c) includes simultaneously optimizing the protein backbone and the set of ligand binding amino acid residues. In embodiments, step c) includes simultaneously optimizing the protein backbone and the ligand. In embodiments, step c) includes simultaneously optimizing the protein backbone and the set of core amino acid residues. In embodiments, step c) includes optimizing the protein backbone using known conformational sampling techniques in the art (e.g., rigid-body shifts of helices, backrub algorithms, or crankshaft algorithms). In embodiments, step c) is performed using a protein modeling software suite (e.g., Rosetta). In embodiments, step c) includes an ensemble (e.g., a finite set of proteins, which includes amino acid residue atomic coordinates) of backbones for conformational sampling calculations.
In embodiments, step c) includes fixing (e.g., not geometrically displacing) an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
In embodiments, step c) includes fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing an atomic coordinate of at least one ligand atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of the ligand atomic coordinate. In embodiments, step c) includes prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues. In embodiments, step c) includes prohibiting introduction of an additional amino acid residue into the set of core amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of core amino acid residues. In embodiments, the method includes distance and angle constraints (i.e. specifying the distance of a ligand to an amino acid (e.g., a ligand binding amino acid residue) coordinate).
In embodiments, the optimizing includes fixing (e.g., not geometrically displacing) at least one atomic coordinate of the ligand atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing any atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the optimizing includes fixing angle form by three atoms (e.g., angles formed between atoms of the ligand and the ligand bind amino acid residues) or fixing the distance between atoms (e.g., at least one atomic coordinate of the ligand and at least one atomic coordinate of the ligand binding amino acid residue).
In embodiments, the optimizing includes an iterative or heuristic algorithm. In embodiments, the optimizing includes an iterative algorithm. In embodiments, the optimizing includes a heuristic algorithm. In embodiments, the optimizing includes a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm. In embodiments, the optimizing includes a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm. In embodiments, the optimizing includes knobs-into-holes side chain packing. In embodiments, the optimization may begin with an idealized, parameterized backbone. In embodiments, optimization may relax the backbone structure of the protein, for example, by using gradient descent algorithms, while optimizing the protein sequence via rotamer sampling and minimization.
In embodiments, the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
In embodiments, the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a core amino acid residue to a ligand binding amino acid residue). In embodiments, the optimizing includes replacing a ligand binding amino acid residue within the set of ligand binding amino acid residues. In embodiments, the optimizing includes deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a ligand amino acid residue to a core binding amino acid residue). In embodiments, the optimizing includes a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of at least one of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of the ligand binding amino acid residue atomic coordinates.
In embodiments, the geometric transformation includes a translation (i.e., a geometric transformation that moves a coordinate by the same distance in a given direction) or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation (e.g., displacing the x coordinate) of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of all atomic coordinates (e.g., x, y, and z coordinates in Cartesian space) of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least three atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of all atomic coordinates of the ligand binding amino acid residue atomic coordinates.
In embodiments, the optimizing includes a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation or a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least two atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of all atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least two atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least three atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of all atomic coordinates of the core amino acid residue atomic coordinates.
In embodiments, the optimizing includes 1a) calculating the force on each atom in the protein (e.g., the set of ligand binding amino acid residues; the set of core amino acid residues; and the ligand); 2a) evaluating the calculation to determine if it is the minimum or below an acceptable threshold; 3a) if the force is less than a threshold, the optimization is finished, otherwise perform a geometric transformation (e.g., translation) of at least one atomic coordinate on the atoms in the protein; and 4a) repeat.
In embodiments, the geometric transformation of at least one atomic coordinate includes no greater than a 6 Å displacement of any atomic coordinate. In embodiments, the geometric transformation of at least one atomic coordinate includes no greater than a 3 Å displacement of any atomic coordinate. In embodiments, the displacement is no greater than 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 Å displacement of any atomic coordinate. In embodiments, the displacement is no greater than 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 Å displacement of any atomic coordinate.
In embodiments, the set of ligand binding amino acids includes at least 50 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 6 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 4 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 2 amino acid residues. In embodiments the ligand binding amino acids are apolar. In embodiments the ligand binding amino acids are hydrophilic.
In embodiments, the set of ligand binding amino acids includes 50 amino acid residues. In embodiments, the set of ligand binding amino acids includes 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes 6 amino acid residues. In embodiments, the set of ligand binding amino acids includes 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes 4 amino acid residues. In embodiments, the set of ligand binding amino acids includes 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes 2 amino acid residues. In embodiments the ligand binding amino acids are polar. In embodiments the ligand binding amino acids are hydrophilic.
In embodiments, the energy minimization calculation includes a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof. In embodiments, the energy minimization calculation includes a penalty function.
In embodiments, the core amino acids are at least 75% inaccessible to a 1.8 Å spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.0 Å spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.2 Å spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.4 Å spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.6 Å spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 2.0 Å spherical probe. In embodiments, the core amino acids are at least 80% inaccessible to a 1.8 Å spherical probe. In embodiments, the core amino acids are at least 90% inaccessible to a 1.8 Å spherical probe. In embodiments, the core amino acids are at least 95% inaccessible to a 1.8 Å spherical probe. In embodiments, the set of core amino acids includes at least 50 amino acid residues. In embodiments, the set of core amino acids includes at least 40 amino acid residues. In embodiments, the set of core amino acids includes at least 30 amino acid residues. In embodiments, the set of core amino acids includes at least 20 amino acid residues. In embodiments, the set of core amino acids includes at least 12 amino acid residues. In embodiments, the set of core amino acids includes at least 10 amino acid residues. In embodiments, the set of core amino acids includes at least 8 amino acid residues. In embodiments, the set of core amino acids includes at least 6 amino acid residues. In embodiments the core amino acids are apolar. In embodiments the core amino acids are hydrophobic.
In embodiments, the set of core amino acids includes 6 amino acids. In embodiments, the set of core amino acids includes 8 amino acids. In embodiments, the set of core amino acids includes 10 amino acids. In embodiments, the set of core amino acids includes 20 amino acids. In embodiments, the set of core amino acids includes 30 amino acids. In embodiments, the set of core amino acids includes 40 amino acids. In embodiments, the set of core amino acids includes 35, 36, 37, 38, 39, or 40 amino acids. In embodiments, the set of core amino acids includes 37 amino acids. In embodiments, the core amino acids include the sequence: LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO:5). In embodiments, the core amino acids include the sequence: LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA (SEQ ID NO:6).
In embodiments, the protein is 99% identical to SEQ ID NO:5. In embodiments, the protein is 98% identical to SEQ ID NO:5. In embodiments, the protein is 95% identical to SEQ ID NO:5. In embodiments, the protein is 90% identical to SEQ ID NO:5. In embodiments, the protein is 85% identical to SEQ ID NO:5. In embodiments, the protein is 80% identical to SEQ ID NO:5. In embodiments, the protein is 60% identical to SEQ ID NO:5. In embodiments, the protein is about 99% identical to SEQ ID NO:5. In embodiments, the protein is about 98% identical to SEQ ID NO:5. In embodiments, the protein is about 95% identical to SEQ ID NO:5. In embodiments, the protein is about 90% identical to SEQ ID NO:5. In embodiments, the protein is about 85% identical to SEQ ID NO:5. In embodiments, the protein is about 80% identical to SEQ ID NO:5. In embodiments, the protein is about 60% identical to SEQ ID NO:5.
In embodiments, the protein is 99% identical to SEQ ID NO:6. In embodiments, the protein is 98% identical to SEQ ID NO:6. In embodiments, the protein is 95% identical to SEQ ID NO:6. In embodiments, the protein is 90% identical to SEQ ID NO:6. In embodiments, the protein is 85% identical to SEQ ID NO:6. In embodiments, the protein is 80% identical to SEQ ID NO:6. In embodiments, the protein is 60% identical to SEQ ID NO:6. In embodiments, the protein is about 99% identical to SEQ ID NO:6. In embodiments, the protein is about 98% identical to SEQ ID NO:6. In embodiments, the protein is about 95% identical to SEQ ID NO:6. In embodiments, the protein is about 90% identical to SEQ ID NO:6. In embodiments, the protein is about 85% identical to SEQ ID NO:6. In embodiments, the protein is about 80% identical to SEQ ID NO:6. In embodiments, the protein is about 60% identical to SEQ ID NO:6.
In embodiments, the set of core amino acids includes at least 50% of the total number of amino acid residues in the protein.
In embodiments, the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion. In embodiments, the ligand is a detectable agent. In embodiments, the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theragostic, or a photodynamic therapy (PDT) agent. In embodiments, the ligand is a therapeutic agent. In embodiments, the ligand is a biological agent. In embodiments, the ligand is a cytotoxic agent (e.g., an anticancer agent). In embodiments, the ligand is a magnetic resonance imaging (MRI) agent. In embodiments, the ligand is a positron emission tomography (PET) agent. In embodiments, the ligand is a radiological imaging agent. In embodiments, the ligand is a diagnostic agent. In embodiments, the ligand is a theragostic agent. In embodiments, the ligand is a photodynamic therapy (PDT) agent. In embodiments, the ligand is a small molecule.
In embodiments, the ligand is a catalyst. In embodiments, the catalyst catalyzes an abiological or bio-orthogonal reaction. In embodiments, the ligand is a molecule that exists within a living system (e.g., within an organism or a cell). In embodiments, the ligand is (CF3)-4PZn. In embodiments, the ligand is (CF3)4PFe. In embodiments, the ligand atomic coordinates are optimized using known methods in the art (e.g., density functional theory using the B3-LYP functional).
In embodiments, the method further includes synthesizing the protein (e.g., utilizing the expression vectors such as the plasmid method described in the Example, such as cloning into the IPTG-inducible pET-11a plasmid). In embodiments, the method further includes expressing the protein.
At 1402, a set of ligand binding amino acid residues within a protein for binding to a ligand can be identified. These ligand binding amino acid residues can form the backbone of a protein. Each ligand binding amino acid residue within the protein can be associated with a set of ligand binding amino acid residue atomic coordinates, which can define the ligand binding amino acid residue in space. Furthermore, each atom of the ligand can be associated with a set of ligand atomic coordinates, which can define the ligand in space. As noted herein, these coordinates can be Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates, and/or the like.
At 1404, a set of core amino acid residues within the protein that do not bind to the ligand can be identified. The backbone of the protein can further include core amino acid residues. Each core amino acid residue within the protein can be associated with a set of core amino acid residue atomic coordinates, which define the core amino acid residue in space.
At 1406, the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can be optimized. For example, the optimization can be performed using an energy minimization calculation including, for example, a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, and/or the like. Optimizing the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can generate an energetically stabilized protein.
As shown in
The memory 1520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 1500. The memory 1520 can store data structures representing configuration object databases, for example. The storage device 1530 is capable of providing persistent storage for the computing system 1500. The storage device 1530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input/output device 540 provides input/output operations for the computing system 1500. In some example embodiments, the input/output device 540 includes a keyboard and/or pointing device. In various implementations, the input/output device 540 includes a display unit for displaying graphical user interfaces.
According to some example embodiments, the input/output device 540 can provide input/output operations for a network device. For example, the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
In some example embodiments, the computing system 1500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various formats. Alternatively, the computing system 1500 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing functionalities, communications functionalities, etc. The applications can include various add-in functionalities (e.g., SAP Integrated Business Planning as an add-in for a spreadsheet and/or other type of program) or can be standalone computing products and/or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input/output device 540. The user interface can be generated and presented to a user by the computing system 1500 (e.g., on a computer screen monitor, etc.).
In an aspect is provided a system, including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
In another aspect is provided a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
In an aspect is provided a protein sequence obtainable based on the energy minimization calculation using the method, the system, or the non-transitory computer-readable medium as described herein, including embodiments. In embodiments, the protein sequence is:
In embodiments, the protein sequence is SEQ ID NO:1. In embodiments, the protein sequence is SEQ ID NO:2. In embodiments, the protein sequence is SEQ ID NO:3. In embodiments, the protein sequence is SEQ ID NO:4. In embodiments, the protein sequence is SEQ ID NO:5. In embodiments, the protein sequence is SEQ ID NO:6. In embodiments, the protein sequence is SEQ ID NO:7.
In an aspect is provided a protein, or conservatively modified variant thereof, having the sequence:
In embodiments, the protein sequence is SEQ ID NO:1. In embodiments, the protein sequence is SEQ ID NO:2. In embodiments, the protein sequence is SEQ ID NO:3.
In embodiments, the protein is 99% identical to SEQ ID NO:1. In embodiments, the protein is 98% identical to SEQ ID NO:1. In embodiments, the protein is 95% identical to SEQ ID NO:1. In embodiments, the protein is 90% identical to SEQ ID NO:1. In embodiments, the protein is 85% identical to SEQ ID NO:1. In embodiments, the protein is 80% identical to SEQ ID NO:1. In embodiments, the protein is 60% identical to SEQ ID NO:1. In embodiments, the protein is about 99% identical to SEQ ID NO:1. In embodiments, the protein is about 98% identical to SEQ ID NO:1. In embodiments, the protein is about 95% identical to SEQ ID NO:1. In embodiments, the protein is about 90% identical to SEQ ID NO:1. In embodiments, the protein is about 85% identical to SEQ ID NO:1. In embodiments, the protein is about 80% identical to SEQ ID NO:1. In embodiments, the protein is about 60% identical to SEQ ID NO:1.
In embodiments, the protein is 99% identical to SEQ ID NO:2. In embodiments, the protein is 98% identical to SEQ ID NO:2. In embodiments, the protein is 95% identical to SEQ ID NO:2. In embodiments, the protein is 90% identical to SEQ ID NO:2. In embodiments, the protein is 85% identical to SEQ ID NO:2. In embodiments, the protein is 80% identical to SEQ ID NO:2. In embodiments, the protein is 60% identical to SEQ ID NO:2. In embodiments, the protein is about 99% identical to SEQ ID NO:2. In embodiments, the protein is about 98% identical to SEQ ID NO:2. In embodiments, the protein is about 95% identical to SEQ ID NO:2. In embodiments, the protein is about 90% identical to SEQ ID NO:2. In embodiments, the protein is about 85% identical to SEQ ID NO:2. In embodiments, the protein is about 80% identical to SEQ ID NO:2. In embodiments, the protein is about 60% identical to SEQ ID NO:2.
In embodiments, the protein is 99% identical to SEQ ID NO:3. In embodiments, the protein is 98% identical to SEQ ID NO:3. In embodiments, the protein is 95% identical to SEQ ID NO:3. In embodiments, the protein is 90% identical to SEQ ID NO:3. In embodiments, the protein is 85% identical to SEQ ID NO:3. In embodiments, the protein is 80% identical to SEQ ID NO:3. In embodiments, the protein is 60% identical to SEQ ID NO:3. In embodiments, the protein is about 99% identical to SEQ ID NO:3. In embodiments, the protein is about 98% identical to SEQ ID NO:3. In embodiments, the protein is about 95% identical to SEQ ID NO:3. In embodiments, the protein is about 90% identical to SEQ ID NO:3. In embodiments, the protein is about 85% identical to SEQ ID NO:3. In embodiments, the protein is about 80% identical to SEQ ID NO:3. In embodiments, the protein is about 60% identical to SEQ ID NO:3.
In embodiments, the protein is further bound to a ligand. In embodiments, the ligand is bound to the protein via a dative covalent bond. In embodiments, the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, which is capable of binding a metal ion. In embodiments, the ligand is a detectable agent. In embodiments, the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent. In embodiments, the ligand is a catalyst. In embodiments, the catalyst catalyzes an abiological or bio-orthogonal reaction. In embodiments, the ligand is a molecule that exists within a living system.
In embodiments, the protein is 99% identical to SEQ ID NO:8. In embodiments, the protein is 98% identical to SEQ ID NO:8. In embodiments, the protein is 95% identical to SEQ ID NO:8. In embodiments, the protein is 90% identical to SEQ ID NO:8. In embodiments, the protein is 85% identical to SEQ ID NO:8. In embodiments, the protein is 80% identical to SEQ ID NO:8. In embodiments, the protein is 60% identical to SEQ ID NO:8. In embodiments, the protein is about 99% identical to SEQ ID NO:8. In embodiments, the protein is about 98% identical to SEQ ID NO:8. In embodiments, the protein is about 95% identical to SEQ ID NO:8.
In embodiments, the protein is about 90% identical to SEQ ID NO:8. In embodiments, the protein is about 85% identical to SEQ ID NO:8. In embodiments, the protein is about 80% identical to SEQ ID NO:8. In embodiments, the protein is about 60% identical to SEQ ID NO:8.
Informal Sequence Listing:
A computer-implemented method, comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
The method of embodiment 1, wherein step c) comprises simultaneously optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates.
The method of embodiment 1, wherein the energy minimization calculation comprises a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof.
The method of embodiment 1, wherein the core amino acids are at least 75% inaccessible to a 1.8 Å spherical probe.
The method of embodiment 1, wherein said set of core amino acids comprises at least six amino acid residues.
The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
The method of any one of embodiments 1 to 7, wherein the energy minimization calculation comprises a penalty function.
The method of any one of embodiments 1 to 8, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
The method of any one of embodiments 1 to 8, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
The method of embodiment 10, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
The method of any one of embodiments 1 to 11, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 Å displacement of any atomic coordinate.
The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 Å displacement of any atomic coordinate.
The method of any one of embodiments 1 to 14, wherein the optimizing comprises an iterative or heuristic algorithm.
The method of any one of embodiments 1 to 14, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
The method of any one of embodiments 1 to 14, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
The method of any one of embodiments 1 to 17, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
The method of any one of embodiments 1 to 17, wherein the ligand is a detectable agent.
The method of any one of embodiments 1 to 17, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
The method of any one of embodiments 1 to 17, wherein the ligand is a catalyst.
The method of any one of embodiments 1 to 17, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
The method of any one of embodiments 1 to 17, wherein the ligand is a molecule that exists within a living system.
A system, comprising: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
The system of embodiment 24, wherein the energy minimization calculation comprises functions from molecular mechanics, functions from structural bioinformatics, amino acid sidechain packing functions, protein radius of gyration functions, or a combination thereof.
The system of embodiment 24, wherein the core amino acids are at least 75% inaccessible to a 1.8 Å spherical probe.
The system of embodiment 24, wherein said set of core amino acids comprise at least six amino acid residues.
The system of any one of embodiments 24 to 27, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
The system of any one of embodiments 24 to 28, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
The system of any one of embodiments 24 to 29, wherein the energy minimization calculation comprises a penalty function.
The system of any one of embodiments 24 to 30, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
The system of any one of embodiments 24 to 31, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
The method of embodiment 32, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
The system of any one of embodiments 24 to 33, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 Å displacement of any atomic coordinate.
The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 Å displacement of any atomic coordinate.
The system of any one of embodiments 24 to 36, wherein the optimizing comprises an iterative or heuristic algorithm.
The system of any one of embodiments 24 to 36, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
The system of any one of embodiments 24 to 36, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
A non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
A protein sequence obtainable based on the energy minimization calculation using the method of any of embodiments 1 to 23, the system of any of embodiments 24 to 39, or the non-transitory computer-readable medium of embodiment 40.
A protein, or conservatively modified variant thereof, having the sequence SEQ ID NO:1.
The protein of embodiment 42, wherein the protein is 90% identical to SEQ ID NO:1.
The protein of embodiment 42, bound to a ligand.
The protein of embodiment 42, wherein the ligand is bound to the protein via a dative covalent bond.
The protein of embodiment 44, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
The protein of embodiment 44, wherein the ligand is a detectable agent.
The protein of embodiment 44, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (Mill) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
The protein of embodiment 44, wherein the ligand is a catalyst.
The protein of embodiment 44, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
The protein of embodiment 44, wherein the ligand is a molecule that exists within a living system.
While the de novo design of proteins has seen many successes1-12, no small molecule ligand- or organic cofactor-binding protein has been designed entirely from first principles to achieve i) a unique structure and ii) a predetermined binding-site geometry with sub-Å accuracy. Such achievements are prerequisites for the design of proteins that control and enable complex reaction trajectories, where the relative placements of cofactors, substrates, and protein side chains must be established within the length scale of a chemical bond. Here, we design a small molecule-binding protein based on the concept that the entire protein contributes to establishing the binding geometry of a ligand13-16. Mutational studies of natural ligand-binding proteins have highlighted the counter-intuitive importance of distant amino acids (10-20 Å from the binding site) on binding affinity, which work in concert with first-shell amino acids surrounding the bound ligand13-16. We implement this concept for the first time in de novo protein design. Hence, what are traditionally considered as separate sectors—the hydrophobic core and ligand-binding site—we treat as an inseparable unit. We utilize flexible backbone sequence design of a parametrically defined protein template to simultaneously pack the protein interior both proximal to and remote from the ligand-binding site. Thus, tight interdigitation of core side chains quite removed from the binding site structurally restrains the first- and second-shell packing around the ligand. We apply this principle to the decades-old problem of structural non-uniqueness in de novo-designed heme-binding proteins17. We designed a novel protein, PS1, which binds a highly electron-deficient, non-natural porphyrin at temperatures up to 100° C. The high-resolution structure of holo-PS1 is in sub-A agreement with the design. The structure of apo-PS1 retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry. Our results illustrate the unification of core packing and binding site definition as a fundamental principle of ligand-binding protein design.
Recent successes in the field of de novo design of coiled coils3,7 and metalloproteins4,8-10 are encouraging, but so far have not translated to more complex cofactors. In fact, attempts at computational design of novel small molecule ligand-binding proteins have been limited in number and generally focused on changing only the binding site of natural proteins, leaving the core of the protein intact18,19. For example, the binding site of a natural scaffold was computationally redesigned to bind a hydrophobic organic ligand but required multiple rounds of mutagenesis and experimental selection using yeast display18. At the other extreme, de novo heme-binding helical bundle proteins have been designed entirely from first principles (17, 20), but these “maquettes” have evaded structural determination, largely due to aggregation or their dynamical properties17,21,22. With the exception of short, covalently linked peptide-heme complexes23, the only structure of a de novo heme-binding protein was solved for an apo-protein, which showed a hydrophobically collapsed binding site with no space for binding heme21,24. The lack of precise, predictive three-dimensional models of heme-binding maquettes, coupled with the failure to determine high-resolution structures, has limited their utility, although maquettes have elucidated electrostatic roles for tuning redox potentials of donors/acceptors in electron-transfer reactions20. An iterative trial-and-error approach has been shown to incrementally improve NMR spectra of maquette proteins25, and may ultimately lead to the determination of three-dimensional structures; however, a robust computational method is needed to deliver precisely predetermined structures with sub-A accuracy.
Our own work has focused on the development of computational design of cofactor-binding proteins26-28 with atomic-level accuracy. We used a step-wise strategy in which we first employed a mathematical parameterization of an antiparallel coiled coil to construct a rigid binding site, then, in a separate calculation, introduced side chain packing constrained by this rigid backbone26-28. This approach resulted in de novo porphyrin-binding proteins with the desired tertiary structure and ligand-binding stoichiometry, but not of sufficient conformational uniqueness to yield a high-resolution structure.
A body of work with natural proteins13-16 has shown that side chain packing quite distant from the binding site can propagate to significantly affect ligand binding, catalysis, and allosteric regulation. Thus, the entire hydrophobic core—even residues 20 Å away from the binding site—should be considered as an essential extension of the primary and secondary shell interactions with the ligand. We noted that, unlike natural proteins (
Protein Design.
The design of PS1 (Porphyrin-binding Sequence 1) began with the previously parameterized backbone from the de novo designed protein SCRPZ-228, a protein that bound an extended porphinato(metal)-polypyridyl(metal) cofactor (
We used the parameterized backbone of SCRPZ-2 as a starting point for design of a protein that binds a much smaller abiological porphyrin (CF3)4PZn ([5,10,15,20-tetrakis(trifluoromethyl)porphinato]Zn2+) (
Biophysical Characterization of PS1.
PS1 is monomeric (
Time-resolved transient absorption spectroscopy showed that protein/(CF3)4PZn interactions are preserved even at near-boiling temperatures where the protein retains its native structure (
We also examined another high-scoring sequence (named PS2) of the design process, with a hydrophobic core unique from PS1, which was expressed, purified, and tested for binding to (CF3)4PZn. Electronic absorption spectra of holo-PS2 shows narrow absorption bands similar to those evinced by holo-PS1 (
Structural Characterization of Holo-PS1.
An exceptionally well-resolved NMR structural ensemble of holo-PS1 (
The location and orientation of the porphyrin within PS1 was determined by an exceptional number of porphyrin-protein NOEs (26 porphyrin-protein NOEs were used in the structural refinement,
Ab Initio Folding Predictions and NMR Structure of Apo-PS1.
Ab initio folding38 simulations of the apo-PS1 sequence predict a bipartite structure with a conformationally unique folded core, which closely resembles the core of holo-PS1, and a more flexible cofactor-binding region (
The NMR structure of apo-PS1 was also solved (
Dynamics and Structural Comparisons of Apo-Vs Holo-PS1.
Solvent hydrogen-deuterium exchange (HDX) experiments and molecular dynamics simulations of apo-PS1 also show a gradient in conformational stability between the apolar core and the binding site of apo-PS1 (
In both the apo- and the holo-structures, the interior side chains stack into four layers, beginning at the edge of the porphyrin-binding site and extending to the end of the bundle (
The vast improvement in conformational specificity between PS1 and earlier designs illuminates the importance of considering hydrophobic core packing and the construction of ligand-binding sites as a joint optimization problem during computational design. Our previous studies indicate that the use of rigid backbones optimized for ligand-protein interactions alone are insufficient for conformational uniqueness without explicitly considering and designing a backbone that can also accommodate a well-defined apolar core. Similarly, attempts to radically change specificity of natural proteins by varying their binding sites, while treating the surrounding protein matrix as a rigid unit of fixed sequence, has required subsequent experimental optimization via extensive rounds of random mutagenesis and selection18,19,39. The reliance on experimental methods such as directed evolution and genetic selections, while currently useful in many practical applications19, speaks to our incomplete understanding of protein structure and function, and the need to test and refine this knowledge through design. It is noteworthy that the first sequence designed and tested via our approach succeeded without need for experimental screening. Furthermore, another high-scoring protein design also bound the cofactor, suggesting a possible generality of the method within the helical bundle protein family. These studies bring chemists closer to the ultimate goal of the computational design of fully functional proteins with properties unprecedented in nature.
PS1 Design Process.
Full methods and scripts regarding the design of PS1 can be found in Example 2. Briefly, the entire core of the D2-symmetrical parameterized backbone of SCRPZ-2 was redesigned to bind (CF3)4PZn via a customized Rosetta script for flexible backbone sequence design. The flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see Example 2) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score≥0.58) were output. 170 designs were output from 500 runs through the protocol (
Protein Expression, Purification, and Biophysical Characterization.
Details regarding protein expression, purification, and biophysical characterization can be found in the supplement. Briefly, genes for the proteins were ordered from GenScript, cloned into a pET-11a plasmid, and purified via a Ni column, followed by His-tag cleavage by TEV protease. The protein sequence of expressed, purified PS1 after His-tag cleavage is: SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFDNRQEAADTE AAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO:2). The sequence for PS2 can be found in Example 2.
Porphyrin Binding to Apo-Protein.
A 2-fold excess of the cofactor (CF3)4PZn was added from a 4 mM DMSO stock solution to a 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer with apo-protein (Note that final DMSO concentrations were kept <1%.). Buffer solution of apo-PS1 protein was heated for 5 minutes at 50° C., (CF3)4PZn was then added from the DMSO stock solution, the resultant mixture was vortexed for 5 seconds, and placed back in the heat block at 50° C. for 15 minutes, with vortexing every 3 minutes. The protein/cofactor solution was then spun at 14000×g in a Amicon Ultra-0.5 mL centrifuge filter for 10 min, three times, replacing the buffer to 0.5 mL after each 10 min spin. Finally, the protein solution was spun for 4 min at 12000×g in an Amicon ultrafree-MC GV filter (UFC30GV0S). The holo-PS1 sample was then used for spectroscopic experiments immediately afterward, and diluted to an appropriate concentration if necessary. Binding of (CF3)4PFe was carried out in the same fashion, with the exception that the porphyrin was first dissolved in a stock of DMSO/CHCl3.
Nuclear Magnetic Resonance Spectroscopy.
NMR spectra were recorded at 298 K on a 900 MHz Bruker Avance II spectrometer equipped with cryogenic probe for the holo-protein or on a Bruker 600 MHz spectrometer equipped with cryogenic probe for the apo-protein.
Sequence specific backbone (1HN, 15N, 13Cα, 13CO) and 13Cβ resonance assignments were obtained by using 3D HNCACB/CBCA(CO)NH and 3D HNCO/CO(CA)NH along with the program AUTOASSIGN.41 1Hα and 1Hβ assignments were extended by 3D HAHB(CO)NH experiment and more peripheral side chain chemical shifts were assigned with aliphatic 3D CCH-TOCSY (mixing time: 75 ms) and simultaneous 3D 15N/13Caliphatic/13Caromatic-resolved [1H,1H]-NOESY (mixing time: 120 ms). Overall assignments were obtained for 98.1% and 95.9% of the backbone (excluding the N-terminal NH3+) and 13CO, and for 97% and 94.6% of the side chain chemical shifts (excluding Lys NH3+, Arg NH2, OH, side chain 13CO and aromatic 13Cγ) for the holo- and apo-proteins, respectively. All spectra were processed and analyzed with the programs NMRPIPE and XEASY, respectively42,43. 1H□1H upper distance limit constraints for structure calculations were extracted from NOESY. In addition, backbone dihedral angle constraints were derived from chemical shifts using the program TALOS for residues located in well-defined secondary structure elements44. 2D constant-time [13C,1H]-HSQC spectra were recorded as was described for the 5% fractionally 13C-labeled samples to obtain stereo-specific assignments for isopropyl groups of Val and Leu45. The 1DNH residual dipolar couplings (RDCs) were measured with 2D 1H-15N IPAP-HSQC in samples aligned using Pf1 phage (ASLA biotech). The program CYANA was used to assign long-range NOEs and calculate the structure46,47. Backbone 1DNH RDCs were used as orientational constraints for the later stages of refinement with XPLOR-NIH48. The final set of structures was further refined by restrained molecular dynamics in explicit water48. NMR structure quality was assessed with the Protein Structure Validation Software Suite (PSVS)49 (Table S4).
Hydrogen-Deuterium Exchange Measurements.
For the measurements of H/D exchange rates, a series of 2D 15N HSQC spectra were obtained on a 900 MHz Bruker Avance II spectrometer. The first spectra were recorded 9 minutes after the dilution of 100 μl of a high concentration sample in H2O (2 mM for apo and 1.2 mM for holo) into 200 μl D2O buffer. 15-min HSQC spectra were recorded successively in the first 12 hours, a 15-min spectrum in every hour in the second 12 hours, a 15-min spectrum in every two hours in the third 12 hours, and so on. The last points were 2730.6 and 4903.5 min for apo and holo, respectively. For the H/D exchange rate analysis, the peak height of each isolated peak was extracted by nmrDraw and fitted to one-phase exponential decay.
Coordinates and data files have been deposited to the Protein Data Bank with accession codes STGW (apo-PS1) and STGY(holo-PS1) and to the BMRB (chemical shifts) with codes 30185 (apo-PS1) and 30186 (holo-PS1).
References cited in Example 1. 1. Roy, S. et al. A protein designed by binary patterning of polar and nonpolar amino acids displays native-like properties. J. Am. Chem. Soc. 119, 5302-5306 (1997). 2. Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364-1368 (2003). 3. Nanda, V. & Koder, R. L. Designing artificial enzymes by intuition and computation. Nat. Chem. 2, 15-24 (2010). 4. Peacock, A. F. A. Incorporating metals into de novo proteins. Curr. Opin. Chem. Biol. 17, 934-939 (2013). 5. Huang, P.-S. et al. High thermodynamic stability of parametrically designed helical bundles. Science 346, 481 (2014). 6. Thomson, A. R. et al. Computational design of water-soluble α-helical barrels. Science 346, 485 (2014). 7. Woolfson, D. N. et al. De novo protein design: How do we expand into the universe of possible protein structures? Curr. Opin. Struct. Biol. 33, 16-26 (2015). 8. Mocny, C. S. & Pecoraro, V. L. De novo protein design as a methodology for synthetic bioinorganic chemistry. Acc. Chem. Res. 48, 2388-2396 (2015). 9. Ulas, G., Lemmin, T., Wu, Y., Gassner, G. T. & DeGrado, W. F. Designed metalloprotein stabilizes a semiquinone radical. Nat. Chem. 8, 354-359 (2016). 10. Olson, T. L. et al. Design of dinuclear manganese cofactors for bacterial reaction centers. Biochim. Biophys. Acta: Bioenergetics 1857, 539-547 (2016). 11. Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320-327 (2016). 12. Brunette, T. J. et al. Exploring the repeat protein universe through computational protein design. Nature 528, 580-584 (2015). 13. Bollen, Y. J. M., Westphal, A. H., Lindhoud, S., van Berkel, W. J. H. & van Mierlo, C. P. M. Distant residues mediate picomolar binding affinity of a protein cofactor. Nat. Comm. 3, 1010 (2012). 14. Sela-Culang, I., Kunik, V. & Ofran, Y. The structural basis of antibody-antigen recognition. Front. Immunol. 4 (2013). 15. van den Bedem, H., Bhabha, G., Yang, K., Wright, P. E. & Fraser, J. S. Automated identification of functional dynamic contact networks from X-ray crystallography. Nat. Methods 10, 896-902 (2013). 16. Koulechova, D. A., Tripp, K. W., Horner, G. & Marqusee, S. When the scaffold cannot be ignored: The role of the hydrophobic core in ligand binding and specificity. J. Mol. Biol. 427, 3316-3326 (2015). 17. Reedy, C. J. & Gibney, B. R. Heme protein assemblies. Chem. Rev. 104, 617-650 (2004). 18. Tinberg, C. E. et al. Computational design of ligand-binding proteins with high affinity and selectivity. Nature 501, 212-216 (2013). 19. Prier, C. K. & Arnold, F. H. Chemomimetic biocatalysis: Exploiting the synthetic potential of cofactor-dependent enzymes to create new catalysts. J. Am. Chem. Soc. 137, 13992-14006 (2015). 20. Farid, T. A. et al. Elementary tetrahelical protein design for diverse oxidoreductase functions. Nat. Chem. Biol. 9, 826-833 (2013). 21. Skalicky, J. J. et al. Solution structure of a designed four-α-helix bundle maquette scaffold. J. Am. Chem. Soc. 121, 4941-4951 (1999). 22. Huang, S. S., Koder, R. L., Lewis, M., Wand, A. J. & Dutton, P. L. The HP-1 maquette: From an apoprotein structure to a structured hemoprotein designed to promote redox-coupled proton exchange. Proc. Natl. Acad. Sci. USA 101, 5536-5541 (2004). 23. Lombardi, A., Nastri, F. & Pavone, V. Peptide-based heme-protein models. Chem. Rev. 101, 3165-3190 (2001). 24. Huang, S. S., Gibney, B. R., Stayrook, S. E., Leslie Dutton, P. & Lewis, M. X-ray structure of a maquette scaffold. J. Mol. Biol. 326, 1219-1225 (2003). 25. Gibney, B. R., Rabanal, F., Skalicky, J. J., Wand, A. J. & Dutton, P. L. Iterative protein redesign. J. Am. Chem. Soc. 121, 4952-4960 (1999). 26. Bender, G. M. et al. De novo design of a single-chain diphenylporphyrin metalloprotein. J. Am. Chem. Soc. 129, 10732-10740 (2007). 27. Fry, H. C., Lehmann, A., Saven, J. G., DeGrado, W. F. & Therien, M. J. Computational design and elaboration of a de novo heterotetrameric alpha-helical protein that selectively binds an emissive abiological (porphinato)zinc chromophore. J. Am. Chem. Soc. 132, 3997-4005 (2010). 28. Fry, H. C. et al. Computational de novo design and characterization of a protein that selectively binds a highly hyperpolarizable abiological chromophore. J. Am. Chem. Soc. 135, 13914-13926 (2013). 29. Solomon, L. A., Kodali, G., Moser, C. C. & Dutton, P. L. Engineering the assembly of heme cofactors in man-made proteins. J. Am. Chem. Soc. 136, 3192-3199 (2014). 30. Ghirlanda, G. et al. De novo design of a D2-symmetrical protein that reproduces the diheme four-helix bundle in cytochrome bc1. J. Am. Chem. Soc. 126, 8141-8147 (2004). 31. North, B., Summa, C. M., Ghirlanda, G. & DeGrado, W. F. Dn-symmetrical tertiary templates for the design of tubular proteins. J. Mol. Biol. 311, 1081-1090 (2001). 32. Lahr, S. J. et al. Analysis and design of turns in α-helical hairpins. J. Mol. Biol. 346, 1441-1454 (2005). 33. Goll, J. G., Moore, K. T., Ghosh, A. & Therien, M. J. Synthesis, structure, electronic spectroscopy, photophysics, electrochemistry, and x-ray photoelectron spectroscopy of highly-electron-deficient [5,10,15,20-tetrakis(perfluoroalkyl)porphinato]zinc(II) complexes and their free base derivatives. J. Am. Chem. Soc. 118, 8344-8354 (1996). 34. Lubitz, W., Lendzian, F. & Bittl, R. Radicals, radical pairs and triplet states in photosynthesis. Acc. Chem. Res. 35, 313-320 (2002). 35. Kaufmann, K. W., Lemmon, G. H., DeLuca, S. L., Sheehan, J. H. & Meiler, J. Practically useful: What the Rosetta protein modeling suite can do for you. Biochemistry 49, 2987-2998 (2010). 36. Moore, K. T., Horvath, I. T. & Therien, M. J. Mechanistic studies of (porphinato)iron-catalyzed isobutane oxidation. Comparative studies of three classes of electron-deficient porphyrin catalysts. Inorg. Chem. 39, 3125-3139 (2000). 37. Gentemann, S. et al. Variations and temperature dependence of the excited state properties of conformationally and electronically perturbed zinc and free base porphyrins. J. Phys. Chem. B 101, 1247-1254 (1997). 38. Bradley, P., Misura, K. M. S. & Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868 (2005). 39. Tinberg, C. E. & Khare, S. D. in Computational Design of Ligand Binding Proteins (ed Barry L. Stoddard) 155-171 (Springer New York, 2016). 40. Choma, C. T. et al. Design of a heme-binding four-helix bundle. J. Am. Chem. Soc. 116, 856-865 (1994). 41. Zimmerman, D. E. et al. Automated analysis of protein NMR assignments using methods from artificial intelligence. J. Mol. Biol. 269, 592-610 (1997). 42. Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277-293 (1995). 43. Bartels, C., Xia, T.-h., Billeter, M., Guntert, P. & Wiithrich, K. The program XEASY for computer-supported NMR spectral analysis of biological macromolecules. J. Biomol. NMR 6, 1-10 (1995). 44. Cornilescu, G., Delaglio, F. & Bax, A. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR 13, 289-302 (1999). 45. Neri, D., Szyperski, T., Otting, G., Senn, H. & Wuethrich, K. Stereospecific nuclear magnetic resonance assignments of the methyl groups of valine and leucine in the DNA-binding domain of the 434 repressor by biosynthetically directed fractional carbon-13 labeling. Biochemistry 28, 7510-7516 (1989). 46. Guntert, P., Mumenthaler, C. & Wüthrich, K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol. 273, 283-298 (1997). 47. Herrmann, T., Guntert, P. & Wiithrich, K. Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319, 209-227 (2002). 48. Schwieters, C. D., Kuszewski, J. J., Tjandra, N. & Marius Clore, G. The Xplor-NIH NMR molecular structure determination package. J. Magn. Reson. 160, 65-73 (2003). 49. Bagaria, A., Jaravine, V., Huang, Y. J., Montelione, G. T. & Guntert, P. Protein structure validation by generalized linear model root-mean-square deviation prediction. Protein Sci. 21, 229-238 (2012).
PS1 Design Process.
The design of PS1 began with a D2-symmetrical parameterized backbone of a 4-helix bundle (Tables S1 and S2)1. We have previously used this backbone parameterization to create a D2-symmetrical diheme-binding tetrameric 4-helix bundle, PATET, which was composed of 4 copies of a 25 residue helix containing the requisite metal-coordinating His and second shell H-bonding Thr residues placed at d and b positions in a heptad repeat, respectively2. This tetramer bound two hemes with a bis-His ligation in a D2-symmetrical bundle. Asymmetry of the sequence was later introduced in a single chain diporphyrin-binding design, PASC (
Flexible Backbone Sequence Design.
We wrote a RosettaScript for flexible backbone sequence design, implemented in Rosetta 3.5, that proceeds through a cycle of backbone/sidechain relaxations and fixed backbone design, with a filtering step based on core packing (RosettaScript provided below). Details of the process are provided in the subsections herein.
Amino Acids Allowed to Vary in the Design.
Because (CF3)4PZn could potentially act as a photo-oxidant, we disallowed any potentially oxidizable amino acids in the sequence (e.g., Tyr, Cys, Met, Trp, His) other than the single His and Trp residues described below. The initial residue identities of the bundle were chosen from a previous computationally designed 4-helix bundle SCRPZ-25, with a few changes, e.g., surface-exposed Tyr residues of the SCRPZ-2 sequence were constrained to be polar or charged during the computational sequence design in Rosetta. The entire core (40 residues in total) of SCRPZ-2 was allowed to vary during the design process, except for His46 and Thr9, which are keystone interactions dedicated to Zn coordination of the porphyrin used in previous designs (see
Selection of Residue 68 as Trp.
A motivation for this work is to position aromatic side chains in precise position relative to a photo-excitable cofactor to initiate proton-coupled electron transfer. We asked whether a Trp residue could be held in precise juxtaposition relative to the (CF3)4PZn cofactor, as a prelude to future studies in which “proton wires” are introduced to facilitate proton transfer concomitant with electron transfer from Trp to the photoexcited state of (CF3)4PZn. A Trp residue in the protein interior also serves as an absorption handle, as well as a fluorescent indicator of hydrophobic packing.
To select the sequence position of the single Trp residue, we used the Rosetta Backrub program6,7 to create an ensemble of backbones that were relaxed around the (CF3)4PZn cofactor, after the cofactor was docked in the porphyrin binding region of the SCRPZ-2 model, with an orientation described by CF3 groups pointing down the long axis of the bundle. No sequence design was performed to generate this backbone ensemble. Next, we performed fixed backbone sequence design on each member of the backbone ensemble, allowing Trp at all core residues, to determine a probable location of Trp within the protein interior, based on the frequency of occurrence within the designed sequences. Based on this information, we constrained residue 68 to be Trp during the flexible backbone design process below.
Flexible Backbone Design Protocol.
Flexible backbone design utilized angle and distance constraints between the Zn and His to restrict the design space to those consistent with the DFT-optimized imidazole-Zn distance of 2.0 Å. We used an energy term (hack_aro=1) that models quadrupolar interactions between aromatic side chains in every stage of the flexible backbone design protocol. We also employed an energy term (rg=2) that penalizes bundles with a large radius of gyration (rg). We noticed a propensity within Rosetta to output bundles that received good packing scores (via Packstat or Rosetta Holes) but displayed helices separated by large distances (large rg). The packing algorithms could not differentiate between interior or exterior when the helix-helix interfaces were very wide, and often inappropriately gave good packing scores when the designed bundle was qualitatively poorly packed. The inclusion of the rg term, as well as employing Rosetta Backrub, ameliorated this issue.
The flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see below) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score≥0.58) were output. The PackStat score was calculated 3 times per trial to account for its stochastic behavior. 170 designs were output from 500 runs through the protocol (Fig. S1). We analyzed these 170 models for packing, rg, energy, and rotamer state probability within Matlab to select PS1 for expression.
Flexible Backbone Design Sub-Protocol.
The flexible backbone design sub-protocol consists of 3 Monte Carlo trials of (i) fixed backbone design with soft weights (decreased vdW interactions, i.e., soft_rep_design weights within Rosetta), (ii) sidechain minimization via MinMover, (iii) fixed backbone design with Score13 weights, where the electrostatic term (fa_pair) is replaced by hack_elec (hack_elec=0.55), and the addition of extra rotamer sampling around χ1 (ex1, level 3, i.e., sampled between 2 std of the mean chi angle value for each rotamer) and χ2 (ex2, level 3) sidechain dihedrals, (iv) backbone minimization via MinMover, (v) repetition of step iii (due to propensities of Rosetta to design a particular sequence to a particular backbone). At the end of step (v), the model is filtered for native structure-like packing via PackStat (If 1 of 3 trials of PackStat score is >0.58, the model passes the filter.). In all energy functions for flexible backbone design, hack_aro is set to 1 and rg is set to 2. The final, designed sequence (PS1) selected for protein expression was the following 108 amino acids:
Ab initio folding.
Rosetta ab initio folding8 was performed on the PS1 sequence in Rosetta 3.5. Ca RMSD of the folded core was scored against residues 14-23, 32-42, 69-79, and 87-97 of the design model. Ca RMSD of the binding region was scored against residues 5-13, 43-50, 61-68, and 98-105 of the design model.
Porphyrin Binding Titration to Determine KD.
2 μM of (CF3)4PZn was solubilized in a 1 mL solution of 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer by inclusion of 1% w/v octyl-b-D-glucopyranoside. 2 μL of a 102 uM stock of apo-PS1 (0.2 μM aliquots) was titrated into the 1 mL solution containing the porphyrin, and an electronic absorption spectrum was measured until >2.5 equivalents of protein were added. Absorbance changes at 423 nm, due to His-Zn coordination-induced spectral shifts of the porphyrin, were fit to a single-site, protein-ligand binding model.
Analytical Ultracentrifugation (AUC).
The oligomeric state of apo- and holo-PS1 were determined by analytical equilibrium sedimentation performed at 25° C. using a Beckman XL-I analytical ultracentrifuge. Ultracentrifugation was conducted at speeds of 25K, 30K, 35K, 40K and 45K r.p.m., and the radial gradient profiles were obtained by absorbance at 280 nm. A 200 μM solution of the apo- and a 100 μM solution of the holo-protein were prepared in 50 mM NaPi pH 7.5, 100 mM NaCl (apo) and 20 mM NaPi pH 7.5, 125 mM NaCl (holo). Data were globally fit to a single-species model of equilibrium sedimentation by a nonlinear least-squares method using IGOR Pro (Wavemetrics).
Size Exclusion Chromatography.
Gel filtration profiles were obtained using a Superdex 75 5/150 column on an FPLC system (GE Healthcare AKTA). To evaluate the oligomeric state, 20 μL of 100 μM apo-PS1 or 37 μM holo-PS1 was injected onto the column and eluted with a 50 mM phosphate, 150 mM NaCl, pH 7.0 buffer mobile phase at a flow rate of 0.4 mL/min. The approximate molecular weight (MWapp) was calculated from a standard curve obtained with the GE LMW standard protein kit. From this curve, MWapp of the apo is 19.5 kD and that of holo is 17.9 kD. These 13 kD proteins elute at higher MWapp due to their large negative surface charge (q=−12). For apo-PS1, a small dimer peak elutes at MWapp of 44.1 kD, and a smaller tetramer (or pentamer) peak at 103.2 kD.
Circular dichroism (CD).
CD spectra were collected on a Jasco J-810 CD spectrometer in a 0.1 cm path length quartz cuvette, using temperature/wavelength mode. Spectra were collected from 20 to 95° C. with an interval of 5° C. and an increase rate of 1° C./minute, over a wavelength range from 215 to 250 nm. Apo- and holo-PS1 were prepared at 10 μM and 6.6 μM, respectively, in 50 mM NaPi pH 7.5, 100 mM NaCl buffer. Temperature melts of apo-PS1 were also performed at varying concentrations of Guanidine HCl denaturant (0M, 1M, 2M, 3M, 4M, 5M, 5.85, 7M).
Steady-State Electronic Absorption and Emission Spectroscopy.
Electronic absorption spectra were collected using a Shimadzu UV-1700 UV-Vis spectrophotometer or Cary 5000 spectrophotometer. Steady-state emission spectra were obtained on FLS920P spectrophotometer (Edinburgh Instruments Ltd. Livingston, UK) in 1 cm quartz optical cells. The steady-state emission spectra were corrected using the correction factor generated by the manufacturer.
Pump-Probe Transient Absorption Spectroscopy.
Ultrafast transient absorption spectra were obtained using standard pump-probe methods9 with a time resolution of approximately 200 fs. Elevated temperature experiments were performed in a custom-made temperature block of anodized aluminum, the temperature of which was controlled by heating rods and monitored by a pair of thermocouples wired to a PID through a solid-state relay. Following pump-probe transient absorption experiments, electronic absorption spectra verified that the samples were robust.
Cofactor (e.g., Ligand) Geometry Optimization.
The geometry of (CF3)4PZn was optimized via density functional theory using the B3-LYP functional and 6-31G* basis set implemented in Gaussian03. The starting geometry was obtained from the crystal structure of related meso-heptafluoropropyl(porphinato)Zn(II), with the fluoropropyl groups truncated to fluoromethyl10. Meso-heptafluoropropyl(porphinato)Zn(II) co-crystallized with an axially ligating pyridine; imidazole was computationally substituted for pyridine for the geometry optimization of (CF3)4PZn.
Visualization of Protein Structures and Image Rendering.
Protein models were visualized and rendered in the PyMol visualization program11.
Protein expression and purification. The gene coding for the protein sequence of PS1 was ordered from GenScript, which was cloned into the IPTG-inducible pET-11a plasmid (cloning site NdeI-BamHI). The sequence also coded for an N-terminal 6×His-tag followed by a TEV protease cleavage sequence, followed finally by the designed sequence. The cloned gene sequence is:
The expressed protein sequence was ultimately:
where the “/” defines the cleavage site of TEV protease. The plasmids were transfected into E. coli BL21(DE3) cells, which were grown in LB/ampicillin media (or, for NMR samples, M9 minimal media with isotope-labeled ammonia and glucose from Cambridge Isotopes) until OD @ 600 nm=0.6. The cells were then induced with IPTG and allowed to grow for 4 more hours. Cells were then centrifuged and frozen. The frozen cell pellets were lysed in a French press in the Duke University Biology Department. The expressed, His-tagged PS1 protein was purified via a Ni NTA column (Invitrogen) and confirmed by gel electrophoresis. The buffer was exchanged to the Sigma-recommended TEV protease buffer (5 mM DTT, 50 mM Tris, 0.5 mM EDTA, pH 8.0), and the PS1/TEV solution (His-tagged TEV protease was ordered from Sigma.) was allowed to rock for 1 day at room temperature. The resulting His-tag-free PS1 protein was collected from the flow-through of a Ni NTA column and concentrated in a stock of 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer, with an approximate yield of 40 mg/L. PS2 was expressed and purified in the same manner.
Design of PS2.
To explore the need for a second shell hydrogen bond to the Trp indole of W68, we designed a second sequence, PS2. Computational evaluation of positions where a second-shell polar residue could be introduced showed that a Ser at position 94 could form the desired hydrogen bond. This residue is Leu in PS1, so introducing a small Ser at this position led to a local defect in the packing if this change were made directly into PS1. Thus, the entire core was redesigned using the original procedure, but this time requiring Ser and Trp at positions 94 and 68, respectively. The core of PS2 shares only 55 percent identity with PS1, as shown in the aligned sequences below. (The solvent-exposed amino acids are identical between PS1 and PS2, as per the design method, which only explicitly considers the protein core.)
PS2 was expressed with the same His-tag as PS1, and cleaved and purified using the same methods. Binding of (CF3)4PZn to PS2 was carried out using the same method as for PS1. We found that PS2 bound (CF3)4PZn in a homogenous environment, indicated by the narrow electronic absorption bands of the porphyrin in PS2, nearly indistinguishable from that in PS1 (
Cofactor Synthesis.
The cofactor [5,10,15,20-tetrakis(trifluoromethyl)porphinato]zinc(II), abbreviated as (CF3)4PZn in the main text, was synthesized as previously reported″, and was confirmed by NMR and electronic absorption spectra. Likewise for (CF3)4PFe.
Clustering of Apo-PS1 NMR Models.
We implemented a greedy clustering algorithm in Matlab to form clusters within the family of structures of apo-PS1 (Extended Data
Molecular Dynamics Simulations.
The lowest-energy NMR structure of apo-PS1, which is the centroid of the closed conformation, was used as the starting conformation for the molecular dynamics simulation. The structure was solvated in a 17 Å padding water box, neutralized by the addition of 12 Na+ counter ions. The AMBER force field 14SB was used for the parameterization of the protein. TIP3P water parameterization was used to describe the water molecules12.
The molecular dynamics simulation was carried out using ACEMD13. The system was minimized for 2000 steps, followed by equilibration using the NPT ensemble for 10 ns at 1 atm using a time-step of 2 fs. We also used rigid bonds and a cutoff of 9 Å using PME for long-range electrostatics. Following the relaxation phase, the protein was allowed to move freely and simulated under the NVT ensemble using ACEMD's NVT ensemble with a Langevin thermostat. To achieve a time-step of 4 ps, we used damping at 0.1 ps-1 and a hydrogen mass repartitioning scheme. The simulation was carried out to 1 μs at 298 K.
SOCKET Server for assessment of knobs-into-holes packing. PDB files of the PS1 design model, holo-PS1 centroid, and apo-PS1 open/closed centroids were individually uploaded to and analyzed by the SOCKET server14 for knobs-into-holes side chain packing (see Section 4). A helical residue was defined as a knob if its side chain was within 8 Å of 4 other side chains from residues on an adjacent helix (a hole). Output from the SOCKET server for each of these PDB files is displayed below showing the residues of each knob and hole. Note that the residue number of the PS1 design model is off register by 1 amino acid from the structural sequences, due to the presence of the N-terminal Ser residue from TEV cleavage of the expressed proteins.
The computational method described here is capable of producing proteins that noncovalently bind ligands in vivo. We have observed loading of endogenous heme in a PS1 variant, where 7 terminal residues near the binding ligand site were deleted to allow incorporation of a heme ligand with its bulky, charged proprionate functional groups (
aResidues are numbered according to the expressed 109-residue PS1 protein.
aResidues 5-26, 29-52, 58-81, 84-106.
bexcluding the N-terminal NH3+
cexcluding Lys NH3+, Arg NH2, OH, side chain 13CO and aromatic 13Cγ
Input files and command lines for design calculations.
Command lines and flags for generating the backbone ensemble via Rosetta backrub
Flags
Command Line
Command lines, RosettaScript, and flags for the flexible backbone sequence design protocol.
RosettaScript
Contents of Constraint File (my_atomic.cst):
Flags
Command Line Input
Contents of the Residue File (resfile.txt):
Contents of (CF3)4PZn parameters file (PZNF.params):
1. North, B., Summa, C. M., Ghirlanda, G. & DeGrado, W. F. Dn-symmetrical tertiary templates for the design of tubular proteins. J. Mol. Biol. 311, 1081-1090 (2001). 2. Ghirlanda, G. et al. De novo design of a D2-symmetrical protein that reproduces the diheme four-helix bundle in cytochrome bc1. J. Am. Chem. Soc. 126, 8141-8147 (2004). 3. Lahr, S. J. et al. Analysis and design of turns in α-helical hairpins. J. Mol. Biol. 346, 1441-1454 (2005). 4. Bender, G. M. et al. De novo design of a single-chain diphenylporphyrin metalloprotein. J. Am. Chem. Soc. 129, 10732-10740 (2007). 5. Fry, H. C. et al. Computational de novo design and characterization of a protein that selectively binds a highly hyperpolarizable abiological chromophore. J. Am. Chem. Soc. 135, 13914-13926 (2013). 6. Davis, I. W., Arendall Iii, W. B., Richardson, D. C. & Richardson, J. S. The backrub motion: How protein backbone shrugs when a sidechain dances. Structure 14, 265-274 (2006). 7. Friedland, G. D., Lakomek, N.-A., Griesinger, C., Meiler, J. & Kortemme, T. A correspondence between solution-state dynamics of an individual protein and the sequence and conformational diversity of its family. PLoS Comput Biol 5, e1000393 (2009). 8. Bradley, P., Misura, K. M. S. & Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868 (2005). 9. Polizzi, N. F. et al. Photoinduced Electron Transfer Elicits a Change in the Static Dielectric Constant of a de Novo Designed Protein. J. Am. Chem. Soc. 138, 2130-2133 (2016). 10. Goll, J. G., Moore, K. T., Ghosh, A. & Therien, M. J. Synthesis, structure, electronic spectroscopy, photophysics, electrochemistry, and x-ray photoelectron spectroscopy of highly-electron-deficient [5,10,15,20-tetrakis(perfluoroalkyl)porphinato]zinc(II) complexes and their free base derivatives. J. Am. Chem. Soc. 118, 8344-8354 (1996). 11. Schrodinger, LLC. The PyMOL Molecular Graphics System, Version 1.8. (2015). 12. Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926-935 (1983). 13. Harvey, M. J., Giupponi, G. & Fabritiis, G. D. ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. J. Chem. Theory Comput. 5, 1632-1639 (2009). 14. Walshaw, J. & Woolfson, D. N. SOCKET: a program for identifying and analysing coiled-coil motifs within protein structures. J. Mol. Biol. 307, 1427-1450 (2001). 15. Hayes, D., Laue, T. & Philo, J. Program Sednterp: sedimentation interpretation program. Durham, N.H.: University of New Hampshire (1995). 16. Moore, K. T., Fletcher, J. T. & Therien, M. J. Syntheses, NMR and EPR Spectroscopy, Electrochemical Properties, and Structural Studies of [5,10,15,20-Tetrakis(perfluoroalkyl)porphinato]iron(II) and -iron(III) Complexes. J. Am. Chem. Soc. 121, 5196-5209 (1999). 17. Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079-1100 (2011).
This application claims the benefit of U.S. Provisional Application No. 62/537,774, filed on Jul. 27, 2017, which is incorporated herein by reference in its entirety and for all purposes.
This invention was made with Government support under grant numbers GM-54616 and GM-071628 awarded by The National Institutes of Health, and grant numbers CHE-1413333, CHE-1413295 and DMR-1120901 awarded by The National Science Foundation. The Government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/044195 | 7/27/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62537774 | Jul 2017 | US |