The present invention relates in general to the field of methods of identifying peptides and peptide-like molecules with desirable properties (including affinity, specificity, selectivity, in vitro and in vivo availability and viability, and others), and more particularly, to novel, short peptides or peptide-like molecules which have a high probability of binding to and/or otherwise modulating the function of polypeptides, proteins, or DNA, and to methods for designing and validating the function of such peptides or peptide-like molecules.
The instant application contains a Sequence Listing which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on May 11, 2012, is named LYNN1000.txt and is 107,967 bytes in size.
Without limiting the scope of the invention, its background is described in connection with methods for identifying and selecting peptides or peptide-like molecules capable of binding to and/or otherwise modulating the function of cells and their components.
Future identification and understanding of disease etiology, progression, therapy choice, and therapy response will require the ability to sensitively monitor vast panels (10 s-100 s) of protein biomarkers in parallel. Significant progress in proteome characterization has been made possible through technological advancements in protein production methods, mass spectrometry, and microfluidics and other protein separation methods. High affinity probe molecules able to specifically recognize biomarker proteins present at low concentrations in complex biological samples are critically important and desirable.
Traditionally, antibodies have satisfied the demand for high affinity ligands. However, their production and purification is time intensive and laborious. As information continues to accrue at a rapid rate, concerning both the composition of an organism's proteome and sequence information of new target proteins, rapid methods to generate affinity reagents are needed urgently in order to quickly identify, validate, and assay these new biomarker proteins. Likewise, methods offering the possibility of rapidly generating multiple high-affinity reagents against large numbers of proteins in the proteome are also desired.
Short high-affinity peptides (˜8-15 amino acids) are attractive as ligand alternatives to antibodies. They can be made in large quantities, and purified much more cheaply and easily. Therefore, they are preferable over biologically produced molecules. They have been used extensively in high throughput protein identification techniques that involve random generation of large combinatorial peptide libraries. However, the drawback to these methods is the time and cost required to generate and to screen these vast libraries, given the enormous diversity of ligand chemistry, and the fact that no rational design is involved. Consequently, there is a substantial need for automated methods that can rapidly select and optimize high-affinity peptide ligands with highly selective binding properties.
Drug discovery is a protracted and expensive process, from hit-to-lead discovery to preclinical and clinical testing and FDA approval. The estimated time and cost for a single drug to make it to market, including developmental costs of all failed drugs, is approximately 12-15 years and $600-800 million. Protein-protein and protein-DNA interactions are fundamental to all biological functions. These biomolecular interactions play a key role in signal transduction and in regulating several cellular processes such as cell growth, differentiation, migration, adhesion, and cell death. It is understood that dysregulation of certain key protein interactions is central to the pathology of most diseases (e.g. p53-Mdm2, Bcl-XL-Bax, αvβ3 Integrin-MMP2 in cancer). Thus, modulating protein-protein and protein-DNA interactions has the promise of obtaining many novel therapeutic agents. In fact, the identification of small molecule, peptide, and peptidomimetic ligands that bind to proteins with high affinity and mediate their activity by affecting their complementary protein interactions has been the central motif of drug discovery approaches. The short high-affinity peptide ligands described hereinabove can also be used as selective markers (perhaps even against a particular protein disease modification), as a detection and diagnostic agent for a given protein or protein variant for disease diagnosis, forensics, etc.
There are many approaches that have been taken to identify drugs that disrupt key protein mediated process. In general, once a drug target has been identified, the protein is over-expressed, isolated, and purified, and used for high-throughput screening via in vitro assays; i.e. vast unfocussed libraries of compounds (including peptides) are screened against the target to find a lead molecule that, for example, inhibits or modulates the function of the target protein. Several lead compounds are chosen and lead optimization then proceeds to complete the preclinical program, from which downselected candidates are carried into clinical trials. The most inefficient step in this process is the lead optimization stage, where less than 1 in 5000 candidates is found suitable for clinical investigation.
So called fragment based design (FBD) screening strategies are also being used, where interactions of small drug-like fragments with the target protein are evaluated in high throughput, and selected hits are combined into a single, potent pharmacore. Although the FBD approach samples a more diverse ligand space of smaller molecules than does screening synthetic compound libraries, and has produced successful hit-to-lead candidates, the methodology is still based on generating hits from an unfocussed ligand space. A focused library of privileged structures, which would result from the present invention, could enhance the value and potency of this method for discovering new drugs.
Alternatively, structure based design is carried out. NMR or X-ray crystallography has been used to predict the molecular structure of the target protein. Both these analytical techniques also have been applied to high-throughput screening, vis-à-vis saturation-transfer NMR and high-throughput X-ray crystallography, to select candidates that perturb the protein active site. Despite advanced robotic techniques and modernization, this process is cumbersome. Finally, in silico methods are available for de novo prediction of the tertiary structures of proteins, when crystal structures are not available. Computational programs can carry out subsequent in silico docking studies that can predict compounds that would bind to a given protein structure. Owing to the complex nature of proteins and their flexible solution based structures, these methods are complex, and perform better when used for structure-based analog design from inhibitors identified initially by high-throughput screening.
In short, there are many approaches where the chemical and structural information of proteins has been described in detail but the principles and information cannot be easily applied to how to predict ligands to specific proteins.
Among the most successful approaches to finding ligands for modulating protein-protein interactions is by developing mimics of the interface amino acid residues for one of the binding partners. Several important cell signaling events are mediated by short unstructured peptide interactions with proteins (Pawson and Scott 1997). Because of the potent and specific effects of peptides on physiological responses of human and other species, peptides are of great interest as drug candidates. Since peptides typically interact with other peptides and proteins to produce their biological effects, there is a great deal of interest in being able to predict the interactions between a peptide and another protein. As a result, there is great interest in developing methods to predict sequences of peptides that will interact with a polypeptide protein target and produce a desired response.
There are several arguments to be made that lend merit to peptide-based drug design. Peptides are considered prototypical among so-called ‘privileged structures’ (which include oligonucleotides and bioactive natural compounds). Peptide molecules offer an incredible diversity not commonly found in small molecule libraries. Compared to conventional biomolecular drug compounds (e.g., antibodies, scFv), peptides have the benefits of being robust and easy to synthesize and purify in large quantities, and they are non-immunogenic (Dutton 2007). Theoretically, short peptide sequences of 10-20 amino acids can be selected to interact with DNA and virtually to any binding cleft on a protein, making them excellent leads for antagonist development. These clefts may be located at the active site where the peptide binding may disrupt or inhibit protein interaction. Alternatively, the peptide may modulate protein-protein interactions through allosteric binding effects (Allegretti, Bertini et al. 2008). Multiple DNA-binding proteins are known to be key players in transcriptional and epigenetic signaling, and short, high-affinity peptide based molecules are key leads for binding and modulating critical gene functions. Since peptide binding affinities to proteins can span a considerable range (μM to sub-nM), the response to binding may vary considerably, from perturbation to complete inhibition, which offers latitude in their choice for specific applications (including as affinity molecules for high-throughput biomarker validation and assays).
Undesirable biological properties of peptide drugs include their limited bioavailability and stability owing to rapid proteolytic degradation and elimination from circulation. However, several strategies are available to circumvent these limitations, including peptidomimetic development (Witt, Gillespie et al. 2001; Adessi and Soto 2002; Sillerud and Larson 2005; Vagner, Qu et al. 2008), development of pro-drugs, and the use of advanced drug delivery systems (micelles, vesicles, nanoparticles, etc.).
Peptides as drug leads have been applied successfully to search for antagonists of cell-surface receptors. Numerous publications have demonstrated the ability of peptides to specifically inhibit or activate protein or enzyme actions (Norman, Smith et al. 1999; Tao, Wendler et al. 2000; Christensen, Gottlin et al. 2001; Pini, Giuliani et al. 2005; Allegretti, Bertini et al. 2008). Several peptide agents that bind strongly to disease related proteins have already been identified; e.g., bombesin, octreotide, cRGD peptides (Sillerud and Larson 2005). Peptides now represent one of the fastest growing classes of new drugs. Although they currently account for only 2% of drugs on the market, they reportedly comprise roughly 50% of drugs in the pipelines of major drug manufacturers. The market for peptide drugs is now growing at a compound annual rate of 7.5% and is estimated to be worth in excess of $13 billion by 2010.
There is infinite diversity in protein structure and function. Thus, the process of finding selective, high-affinity peptide (or any) ligands is not trivial. Tested approaches to producing peptide ligands to protein targets include phage display and combinatorial peptide library screening. Combinatorial peptide library synthesis is tedious and expensive to set up, the represented ligand sample space is random and not comprehensive, and the binding ligand needs to be isolated from beads for structure determination by Edman degradation. Iterative affinity refinement is not possible. This tedious screening approach has become less common.
Phage display has demonstrated the greatest success in peptide ligand discovery for proteins (Dias, Fasan et al. 2006). However, this method has certain key limitations, critical among which is that the random libraries represent only fractional diversity of the total ligand space; e.g., an 8-mer peptide can have 208≈26 billion permutations, only ˜10% of which are expressed and screened. Furthermore, affinities are often only modest (μM range), and iterative affinity refinement by systematically modifying the peptide sequence is not feasible.
The present invention makes it possible to predict and test peptide molecules that will bind specifically and with high affinity to a desired protein (or DNA) without significant or substantial knowledge of the structure or function of protein being targeted other than its primary linear amino acid sequence. In this respect it is distinct from the processes used in general for drug molecule discovery, which often requires knowledge of the target beyond the linear sequence. It will be understood by persons skilled in the art that though the primary focus is on identifying peptide molecules for possible uses in therapy, there are many additional uses for the peptide with specific binding properties identified by this invention. Notably, such molecules can be used in diagnostics as molecular probes as well as in drug delivery and drug discovery (e.g., in vitro assays to screen against vast drug libraries). Furthermore, the present invention can be used for the identification of selective, high-affinity peptide binding agents, which can be used in molecular imaging, pathogen or rare cell capture, protein purification through affinity chromatography, forensics, etc.
The present invention relates to methods and computer systems (HALO/Gen; High Affinity Ligand Optimization/Generation) for identifying and selecting peptides or peptide-like molecules capable of binding to, detecting, and/or otherwise modulating the function of protein targets having known amino acid sequences. Furthermore, the invention relates to compositions comprising the peptides identified and to methods to optimize their physiochemical properties and biological activity.
The present invention includes a method for discovery of high-affinity peptide ligands which comprises the steps of (i) obtaining an amino acid sequence of at least a portion of a target protein; (ii) identifying one or more sequence homologous proteins (with 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% homology) of the target protein; (iii) identifying complementary proteins, which are proteins that bind to/interact with the homologous proteins; (iv) extracting from databases peptides that have a ligand binding probability for the target and homologous proteins; (v) generating a library of candidate peptide ligands and their knowledge based sequence permutations; (vi) determining if any of the candidate peptide ligands binds to the target protein; and (vii) iteratively optimizing affinity and specificity for the target protein. The method may further comprise one or more “sequence modulation” steps, wherein sub-sequences of the candidate peptide ligand sequence, each one corresponding to an n-residue long contiguous sequence, are selected to yield a super-library of n-mer parent peptides (the “n-mer superlibrary”); and, one or more residues in each peptide in the n-mer superlibrary are substituted with other amino acids, to yield a “substituted n-mer superlibrary.”
Peptides of 8-12 amino acids may be selected for the n-mer superlibrary. The peptides used in the sequence modulation step are often octamers (i.e., n=8). It will be understood by the skilled artisan that the absolute length need is not limiting and is dictated by a number of factors, for e.g. specificity, sensitivity, ease of manufacture and purification, length constraints to enter cells, etc. Generally, two residues are substituted during sequence modulation, preferably located in central positions in the peptide. In another embodiment, the central four or six residues are permuted. However, it is possible to keep the central position in the peptide unchanged (e.g., positions 3, 4, 5, and/or 6 in an octamer), whereas pairs or other combinations of residues in peripheral positions are substituted and permuted. For example, residues 1, 2, 7, and/or 8 could be substituted. In another embodiment of the present invention, peptides could be scanned positionally such that one residue in each peptide in the n-mer superlibrary was substituted sequentially by a scanning amino acid, such as alanine.
The amino acid substitutions may be randomly selected, or they may be knowledge-based. A possible type of knowledge-based amino acid substitution is based on the use of a substitution matrix. A substitution matrix belonging to the BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) family of substitution matrices may be used. An artisan will appreciate that there are numerous substitution matrices available, based on amino acid properties such as mutational frequency, physicochemical similarities, tendency to occupy a certain location on a protein tridimensional structure, tendency to covary, or tendency to adopt a certain secondary structure.
The present invention may further comprise the step of “screening” peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary for their ability to bind to the target protein using laboratory methods. This screening step and the sequence modulation step may be repeated iteratively to identify the amino acids that contribute to optimal binding (defined as greatest affinity and specificity or selective binding or modulating ability) of peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary to the target protein. A collection of peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary may be screened using high-throughput peptide microarrays. In one embodiment of the present invention, the collection of peptides to be screened may be synthesized on the microarray chip substrate using a digitally controlled light source. In another embodiment, peptides may be synthesized and spotted on a cellulose or other matrix. The final result of this process is selecting peptide ligand sequences having highest binding affinity and specificity for, or most desirable biological activity on the protein-protein interactions of, the query protein.
The skilled artisan will appreciate that a variety of screening methods may be used. Some non-limiting examples of the methods that may be used include radiometric and spectrophotometric methods, fluorescence based methods, e.g., FRET (Fluorescence Resonance Energy Transfer), FET, ligand displacement assays, NMR methods, microarray-based analysis, pep-spot analysis, combinatorial libraries, and phage display, although other techniques could also be used.
The screening of peptides selected from the initial iteration or a subsequent iteration of a library of candidate peptide ligands, its respective n-mer superlibrary, and its substituted n-mer superlibrary may include the steps of: (i) determining the binding affinity of ligands; (ii) determining in vitro the inhibitory ability of ligands; (iii) performing cell-based activity assays that are selected based on the cellular target(s), protein target(s), cellular activation, cellular proliferation, secretion or expression of proteins, genes, or the transcription of RNA; (iv) performing animal-model based pharmacodynamics/pharmacokinetics assays for ADME-Tox or other pharmacological properties; and/or (v) performing preclinical/clinical testing in patient cohorts.
The target protein or fragment thereof may be involved in peptide-protein, protein-protein, or protein-DNA interactions, and the effect of the peptide ligand sequences may be to bind to and otherwise modulate, activate, or inhibit the function of the target protein or homologous proteins. The target protein may also be a post-translationally modified protein or a non-natural protein, such as a recombinant protein, a synthetic protein, a hydrolysis product, or a peptidomimetic. The target protein may be selected from the group consisting of cell membrane receptors, nuclear membrane receptors, cellular and nuclear proteins, mitochondrial proteins, circulating peptide and non-peptide receptors, membrane and circulating transporters, enzymes, chaperonins and chaperonin-like proteins, antibodies, and surface and intracellular proteins of infectious agents.
In the present invention, the identification of one or more proteins homologous to the target protein is performed by querying one or more suitable databases with one or more suitable homology search tools, using the target protein primary sequence or a fragment thereof as input. The concept of homology is used to denote similarities between primary sequences, without implying that the proteins share a common ancestry. The sequence of the target protein can be retrieved from a public or private protein database. However, it is also possible to obtain the primary sequence of the target protein by protein sequencing, or by translating (i) the nucleic acid sequence of a newly sequenced biological target of interest, or (ii) a nucleic acid sequence deposited in a public or private database.
The significance of the detected sequence similarity (homology) may be judged by an alignment score below a threshold of statistical significance. Numerous tools may be used to identify sequence homology. One or more tools may be used to detect sequence homology. The homology search tools may use sequence-to-sequence comparison (pairwise sequence alignment), using an algorithm such as Smith-Waterman, Needleman-Wunsch, BLAST, PSI-BLAST, PHI-BLAST, WU-BLAST2, BLAT, or FASTA.
The homology search tools may also use sequence to profile comparison (sequence-profile method). A number of methods may be used, including Position Specific Scoring Matrix (PSSM)-based methods such as PSI-BLAST, Hidden Markov Model (HMM)-based methods such as HHMER or SAM, or profile to profile comparison (profile-profile) methods such as FFAS, ORFeus, COMPASS, COACH, or HHpred.
Homology search tools may include protein domain comparison tools such as RPS-BLAST, HMMER or IprScan. These protein domain comparison tools may be applied to search for homology between the target protein primary sequence or a fragment thereof and protein domains stored in an individual protein domain databases, such as Pfam, SMART, PROSITE, Propom, PRINTS, UniProt, TIGRFAMs, PIR-SuperFamily, or SUPERFAMILY. Furthermore, these domain comparison tools are applied to search for homology between the target protein primary sequence or a fragment thereof and protein domain stored in a protein domain meta-database, such as InterPro or CCD.
The homology search tools may include protein domain architecture comparison tools, such as CDART. The identification of peptides that have a ligand binding probability may also be performed using functional-domain-prediction bioinformatic methods such as ab-initio domain predictions, secondary structure predictions, disorder/low complexity predictions, linker predictions, gene fusion methods, protein functional domain co-occurrence methods, genetic context methods (e.g., gene neighborhoods, gene clusters, operons), phylogenomic profiles, and metabolic reconstructions.
The present invention may comprise the further step of identifying “complement proteins.” These “complement proteins” are proteins known to interact with the target protein (or to DNA), with proteins homologous to the target protein, and with fragments of the target protein or its homologues. The target protein, proteins homologous to the target protein, and/or fragments of the target protein or its homologues would be used to query one or more protein-protein interaction databases, such as DIP and IntAct, such that proteins known to interact with the query sequences would be identified. Complement proteins may include enzymes that degrade the target protein, proteins degraded by a target protein that is an enzyme, regulatory proteins, structural proteins, et cetera. The target protein might be also its own complement protein when the protein's quaternary structure is multimeric.
The identification of peptides with a probability of being ligands binding to the target protein or DNA may be performed using textmining. Candidate peptide ligands likely to bind to the target protein, candidate peptide ligands likely to bind to homologous proteins, and candidate peptide ligands present in the sequence of complement proteins that are likely to bind to the target protein, DNA, or to homologous proteins may be identified using textmining. Textmining may include mining and curating ligand-binding data collected from scientific literature, protein-ligand databases, publications, literature reports, documents, computerized records, abstracts, scientific journals, and other public and non-public sources (all these different types of publications are collectively described as “data sources”). Textmining may be manual, computer-assisted, or any combination thereof.
Candidate peptide ligands may be identified by text searching publicly accessible databases using customized search terms, for references including peptide ligands and interacting sequences or domains to the target protein and/or to homologous proteins. Textmining may be performed by querying “data sources” using one or more sequences or fragments of the target protein, homologs, or complement protein. Furthermore, the “data sources” may be queried with one or more protein names, synonyms, commonly accepted acronym, database accession numbers, domain names, E.C. numbers, or any other non-sequence suitable identifier. A publicly available database such as BioMint, UniProt, or Entrez may be employed as a source to obtain protein name synonyms and other related information about protein.
The textmining step may be a single step, or a group of textmining steps. The textmining space may be expanded by querying the “data sources” with permutations and combinations of search terms including sequences or fragments of the target protein name, and those of homologs, or complement proteins; protein names; synonyms; commonly accepted acronyms; database accession numbers; domain names; E.C. numbers; or other suitable identifiers. The search terms may include non-identifier keywords; for example sequence, ligand, peptide, epitope, paratope, motif, etc.
Once candidate peptides are identified, the candidate peptide ligands may be ranked according to the degree of sequence homology between the primary sequence of the homologous protein used for the textmining query and the primary sequence of the target protein, wherein ligands corresponding to proteins with higher homology scores would be ranked highest.
The library of candidate peptide ligand sequences may be culled of redundant or “entrained” sequences using one or more Multiple Sequence Alignment (MSA) algorithms, such that when two sequences are identical or almost identical, only the longer of the two sequences is selected. For example, MSA could be performed using ClustalW2, DCA, Dialign, POA, T-Coffee, MAFFT, or MUSCLE. The inventors also created two databases called LiBase and PepLib comprising other ligand binding databases, the text mining derived data and experimental data from array work done by the inventors. This database is particularly useful since it comprises experimental data that will become increasingly important and more accurate as its size increases with new ligand binding results from studies conducted by the inventors.
The present invention also includes a computer program embodied on a computer readable medium capable of performing the methods disclosed. Another embodiment of the present invention is a computer system, comprising programming to perform the method disclosed. The present invention also includes a computer storage media, comprising programming to perform the methods disclosed. Furthermore, the present invention includes the compositions produced by applying the methods disclosed.
In one embodiment, the instant invention provides a method for discovery of one or more high-affinity peptide ligands comprising: (i) obtaining an amino acid sequence of at least a portion of a target protein, (ii) identifying one or more homologous proteins of the target protein, (iii) identifying complementary proteins, wherein the complementary proteins are proteins that bind to the target or homologous proteins, (iv) extracting from literature or from one or more databases, peptides that have a ligand binding probability for the target protein or homologous proteins, (v) generating a library of one or more candidate peptide ligands from the extracted literature; and (vi) determining if any of the one or more candidate peptide ligands binds to the target protein. The target protein as described above is selected from the group consisting of cell membrane receptors, nuclear membrane receptors, cytoplasmic, nuclear, and mitochondrial proteins, secreted proteins, circulating peptide and non-peptide receptors, membrane and circulating transporters, enzymes, chaperonins and chaperonin-like proteins, antibodies, and surface and intracellular proteins of infectious agents.
The method as described hereinabove further comprises the steps of: performing a “sequence modulation,” wherein subsequences of the one or more candidate peptide ligand sequences each one corresponding to an n-residue long contiguous sequence are selected to yield a super-library of n-mer parent peptides (the “n-mer superlibrary”) and substituting one or more residues in each peptide in the n-mer superlibrary with other amino acids, to yield a “substituted n-mer superlibrary.” In one aspect of the method of the present invention, the one residue in each peptide in the n-mer superlibrary is sequentially substituted by a scanning amino acid (by alanine). The amino acid substitutions are knowledge-based and are selected using a substitution matrix, wherein the substitution matrix is a member of the BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) family of substitution matrices, PAM or any custom generated matrix from data analysis. In another aspect of the method disclosed herein, n is 8. In yet another aspect, two, four, six, or all n residues are substituted. In another aspect, the two substituted residues occupy central positions in the peptide. In another aspect, one or more central amino acids of the peptide are kept unchanged while pairs of outlying amino acids in the sequence are permuted. In related aspects, the amino acid residues in positions 1 and 8 or in positions 2 and 7 are substituted and the substitutions are random variations of the peptide chemistry.
The method of the present invention further comprises the step of “screening” peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary for binding ability to the target protein using one or more experimental methods. In one aspect of the method the steps of “sequence modulation” and “screening” iteratively to identify the amino acids which contribute to optimal binding of peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary to the target protein, and thereby selecting candidate peptide ligand sequences having optimized binding activity (evaluated using an established assay method selected from the group consisting of FRET (Fluorescence Resonance Energy Transfer), spectrophotometric assay, fluorescence assay, NMR, MS, electrical/electronic (e.g., FET, SPR) assays, in silico docking, microarray-based analysis, pep-spot analysis, combinatorial libraries, phage display, cell based assays, and combinations or modifications thereof). In another aspect, the optimized binding activity comprises affinity (Kd), specificity, selectivity, in vitro and in vivo availability, viability, and combinations and modifications thereof. The affinity may be modulated by dimerization, oligomerization, multivalent display, or any combinations thereof to facilitate multiplex binding. In yet another aspect, a collection of peptides selected from the library of candidate peptide ligands, the n-mer superlibrary, and the substituted n-mer superlibrary is screened using high-throughput peptide microarrays selected from the group consisting of chip based assays, liquid arrays comprising flow cytometry bead assays, multiplex labeled beads on a substrate, or any combinations thereof. The collection of peptides is synthesized on the microarray chip substrate using a digitally controlled light source, in situ laser printing, PEPspot and other pre-synthesized peptide spotting technologies or any combinations thereof. In a specific aspect, about 4,000 to 2,000,000 peptides are arrayed on the chip.
In one aspect of the method of the present invention, the screening of peptides selected from an initial iteration or a subsequent iteration of a library of candidate peptide ligands, its respective n-mer superlibrary, and its substituted n-mer superlibrary includes: a) determining the binding affinity of ligands in vivo or in vitro by linking to a contrast agent or a detection agent, wherein the contrast agent or the detection agent is selected from the group consisting of a radioactive tracer, a GFP, a fluorophore, a quantum dot, a nanoscale structure or a nanowire, b) determining in vitro modulatory (e.g., agonist, inverse agonist, antagonist) ability of ligands towards one or more agonists, inverse agonists, or antagonists, and c) performing cell-based activity assays, in vivo pharmacological assays, drug sensitivity assays, clinical/preclinical testing, or any combinations thereof.
In another aspect, the target protein or fragment thereof is involved in, but not limited to, peptide-protein or protein-protein or protein-small molecule interactions. In yet another aspect, the peptide ligand sequences bind to and otherwise modulate, activate, or inhibit the function of the target protein, homologous proteins, or DNA. In another aspect, the peptide ligands sequences can be polyvalently linked or displayed for protein, DNA, and cell capture. In another aspect, peptide ligands can be conjugated to a payload such as a drug, protein or other molecules such as, but not limited to, liposomes, dendrimers, or combinations thereof for drug delivery. In yet another aspect, the step of identifying one or more homologous protein is performed using the target protein primary sequence or a fragment thereof to query one or more suitable databases with one or more suitable homology search tools, such that proteins homologous to the target protein are identified, wherein the query protein amino acid sequences are obtained by translating nucleic acid sequences. In one aspect of the present invention, the homologous sequences have significant sequence similarity to the target sequence as judged by an alignment score below a threshold of statistical significance. In another aspect, the one or more of the homology search tools includes sequence to sequence comparison (pairwise sequence alignment) done by an algorithm (selected from the group consisting of Smith-Waterman, Needleman-Wunsch, BLAST, PSI-BLAST, PHI-BLAST, WU-BLAST2, BLAT, and FASTA).
In yet another aspect of the method, the one or more of the homology search tools is based on sequence to profile comparison (sequence-profile method) method. In a specific aspect, the profile method is Position Specific Scoring Matrix (PSSM)-based (PSI-BLAST) or a Hidden Markov Model (HMM) based method (HHMER or SAM). In a related aspect, the one or more of the homology search tools is based on profile to profile comparison (profile-profile) method, wherein the profile-profile method is selected from the group consisting of FFAS, ORFeus, COMPASS, COACH, and HHpred. In another aspect, the one or more of the homology search tools includes protein domain comparison tools, wherein the domain comparison tools include CDD, CDART, RPS-BLAST, HMMER, or IprScan. In yet another aspect the domain comparison tools are applied to search for homology between the target protein primary sequence or a fragment thereof and protein domain stored in an individual protein domain databases (selected from the group consisting of Pfam, SMART, PROSITE, Propom, PRINTS, UniProt, TIGRFAMs, PIR-SuperFamily, and SUPERFAMILY) or a protein domain meta-database (InterPro, CDD or CDART). In another related aspect, one or more of the homology search tools includes protein domain architecture comparison tools, more specifically CDART.
In one aspect of the method disclosed herein, the identification of peptides that have a protein or DNA binding probability includes functional domain prediction bioinformatic methods from the group consisting of ab-initio domain predictors, secondary structure predictors, disorder predictors, linker predictors, gene fusion methods, domain co-occurrence methods, genetic context methods (gene neighborhoods, gene clusters and operons), phylogenomic profiles, and metabolic reconstruction. The method of the present invention further comprises the step of identifying “complement proteins” wherein sequences selected from group consisting of the target protein, the DNA, the homologous proteins, and fragments thereof are queried against one or more protein-protein interaction databases, such that proteins known to interact with the query sequences are identified. The protein-protein interaction databases that are queried include DIP IntAct and DOMINO, and all other open-source computational or manually curated databases.
In another aspect of the method disclosed herein, the identification of peptides with a ligand binding probability is performed using textmining, such that candidate peptide ligands are likely to bind to the target protein, candidate peptide ligands are likely to bind to the DNA, candidate peptide ligands are likely to bind to homologous proteins, and candidate peptide ligands present in the sequence of complement proteins that are likely to bind to the target protein or to homologous proteins are identified. The step of textmining, as disclosed herein, includes mining and curating ligand-binding data collected from scientific literature and protein-ligand databases, publications, literature reports, documents, computerized records, abstracts, scientific journals, and other public and non-public sources (“data sources”). The textmining can be performed manually or is computer-assisted in which case it is performed using one or more text similarity search engines, wherein the search engine comprises eTBLAST, Contaro, the search engine and database HALO, or any combinations or modifications thereof. It will be understood herein that the textmining is both manual and computer-assisted.
In yet another aspect of the method, the candidate peptide ligands are identified by text searching publicly accessible databases using customized search terms, for references including peptide ligands and interacting sequences, epitopes/paratopes, motifs, regions, and/or domains to the target protein and/or to homologous proteins, wherein input text used to search a database for relevant protein-ligand interactions in the protein synopsis is selected from the group consisting of, but not limited to, PubMed, UniProt, ChemAbstracts, PDB, or InterPro. In another aspect, the textmining is performed by querying “data sources” with one or more protein names, synonyms, commonly accepted acronyms, database accession numbers, domain names, E.C. numbers, or another non-sequence suitable identifiers of the target protein, homologs, or complement proteins. In a specific aspect, the BioMint, UniProt, PDB databases or combinations thereof are employed to obtain protein synonyms. In yet another aspect, the textmining is performed by querying “data sources” with permutations and combinations of search terms including sequences or fragments of the target protein, homologs, or complement proteins; protein names; synonyms; commonly accepted acronyms; database accession numbers; domain names; E.C. numbers; or other suitable identifiers, wherein the search terms include non-identifier keywords.
In another aspect of the method, the candidate peptide ligands are ranked according to the degree of sequence homology between the primary sequence of the homologous protein used for the textmining query and the primary sequence of the target protein, wherein ligands corresponding to proteins with higher homology scores are ranked highest. In yet another aspect, the evaluation of ligand binding probabilities is performed by querying a peptide database, wherein the peptide database is PepLib, LiBase or other databases. In one aspect of the method, the library of candidate peptide ligand sequences is culled of redundant or “entrained” sequence using a Multiple Sequence Alignment (MSA) algorithm (selected from the group consisting of ClustalW, DCA, Dialign, POA, T-Coffee, MAFFT, and MUSCLE), such that when two sequences are identical or almost identical, only the longest of the two sequences is selected. In a specific aspect, the MSA algorithm is ClustalW2.
The target protein, homologous protein, complementary protein or peptide ligand, disclosed hereinabove, binds a nucleic acid. The present invention further includes a computer program embodied on a computer readable medium, a computer system, comprising programming, a computer storage media, comprising programming or any combinations thereof to perform the method described hereinabove. The present invention also discloses composition comprising one or more novel peptides capable of binding to the target protein, DNA, or both identified by the method provided herein. Finally, the invention presented herein provides an optimized binding agent or binding agent composition from multiple peptides discovered by the method of the present invention or as otherwise identified, wherein the agent targets several different protein domains and comprises several short peptides that are linked.
For a more complete understanding of the features and advantages of the present invention, reference is now made to the detailed description of the invention along with the accompanying figures and in which:
While the making and using of various embodiments of the present invention are discussed in detail below, it should be appreciated that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed herein are merely illustrative of specific ways to make and use the invention and do not delimit the scope of the invention.
To facilitate the understanding of this invention, a number of terms are defined below. Terms defined herein have meanings as commonly understood by a person of ordinary skill in the areas relevant to the present invention. Terms such as “a”, “an” and “the” are not intended to refer to only a singular entity, but include the general class of which a specific example may be used for illustration. The terminology herein is used to describe specific embodiments of the invention, but their usage does not delimit the invention, except as outlined in the claims.
Protein targets that are relevant to the present invention include cell membrane receptors, nuclear membrane receptors, intracellular, nuclear, and mitochondrial proteins including nuclear transcription factors, secreted proteins, circulating peptide and non-peptide receptors, membrane and circulating transporters, enzymes, chaperonins and chaperonin-like proteins; antibodies, surface and intracellular proteins of infectious agents, and more generally, any protein involved in peptide-protein, protein-protein, and/or protein-DNA interactions. The protein targets may be naturally present in humans or may be a natural component of other species. The protein target may not necessarily be naturally occurring. Some protein targets may have been generated by artificial means. The peptides resulting from this invention are selected to selectively bind to and/or detect, or otherwise modulate, activate and/or inhibit the function of the target protein.
The invention makes use of the fact that there is an enormous amount of scientific and technical information, published journal articles, technical papers, and computer databases regarding the interaction of peptides and peptide like molecules with proteins. This information is generally of limited use in predicting peptide or protein interactions. The invention teaches a newly discovered process for extracting and correlating scientific information about proteins and peptides to discover new peptides with desired binding characteristics. It is noteworthy that there exist catalogs or databases that collect and classify information on ligand molecules that are known to interact with proteins. Examples of such collection of information are the PepBank database (pepbank.mgh.harvard.edu/search/basic) and the antimicrobial peptide database (aps.unmc.edu/AP/main.php). The present invention is distinct from these types of catalogs. These catalogs record information from public sources, publications and report, but do not teach or describe methods or processes to identify or predict and discover new peptide ligands with a specific protein target in mind.
The present method is particularly applicable when the structure of the target is unknown, as the only requirement to perform the invention is knowledge of the linear amino acid sequence of the target. An artisan would appreciate that the method is also applicable to targets whose tridimensional structure is known, by using the primary sequence or part of the primary sequence of the target protein as input. In some instances, the process may be applied knowing only partial sequence information about the target protein rather that the amino acids sequence in its entirety. The invention includes the discovery of peptides or peptide-like molecules which have a high probability of binding to and/or otherwise modulating the function of polypeptides or proteins, and to methods for designing and validating and improving the functionality (binding characteristics) of such peptides or peptide-like molecules. The discovery of high-affinity peptide ligands disclosed here uses a novel procedure that narrowly selects ligand candidates to a target or query protein from an initial input of only the amino acid sequence of the target or query protein.
The process of the invention begins by obtaining the amino acids sequence of the target protein. Often the target may have been previously studied to the extent that the primary amino acid sequence will be known. Many resources are available from which the amino acid sequence information on the protein can be obtained. There may be occasions when the target protein is new and not previously been sequenced. In this instance, there is a requirement to sequence the unknown protein to determine its sequence. It may not be necessary to determine the sequence in its entirety, knowledge of only portions of the sequence may be adequate. There are standard laboratory methods for determination of the amino acid sequence of a protein. Once the amino acids sequence is obtained, the invention requires a specific form of search to be carried out. Specifically, we search for proteins that have amino acids sequence homology with the target protein. For the purposes of this invention, we define an homologous protein as a protein that possess an amino acid sequence that matches or partially matches the amino acid sequence found in the target or query protein (for example 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 100% homology).
One method for finding protein with sequences that match or are similar to the target is to use a BLAST (Basic Local Alignment Search Tool) or similar protocol. A BLAST search (blast.ncbi.nlm.nih.gov/Blast.cgi) is carried out using the protein sequence of the target protein. The non-redundant protein database can be selected for the BLAST query because it comprises proteins from all other databases, including SwissProt, PDB, PIR, and PRF. Open source protein databases to be queried may include the Protein Databank (PDB) of protein structures, RefSeq, and UniProt (European Bioinformatics Institute consortium).
The threshold E (expectation) value for the query can be varied but generally is set to 0.1 or lower (for e.g., 0.01, 0.001, 0.0001). E-value specifies stringency of the threshold or cut off. The higher the threshold, the greater possibility that homology may not be real but ‘accidental’. Gap values can be changed to increase or decrease stringency, and the matrix used may also be changed (e.g., PAM or BLOSUM). These are all part of a BLAST query and available to one trained in the arts.
In one embodiment of the present invention, the BLAST search for each protein provides a dataset of homologous proteins ranked by their total gap scores and E-value. Position-specific iterative BLAST (PSI-BLAST) may be used to provide a large hit set containing more remotely related proteins. Another way of determining homologous proteins to the target protein is through using InterPro (http://www.ebi.ac.uk/interpro/). InterPro combines a number of member databases (including Pfam, ProSite, PRINTS, and Propom) that use different methodologies and a varying degree of biological information on well-characterized proteins to derive protein signatures such as families, domains, repeats, and sites. InterPro permits searching the database by protein sequence (e.g. FASTA) using InterProScan or by protein name (keyword search). The results from an InterPro search lists proteins sharing domains, repeats, and functional sites broken down by ‘parent/child’ and ‘contains/found-in’ relationships to the query. By this approach additional datasets of homologous proteins may be derived. It will be understood by the skilled artisan that other databases for domain searching may be used (for e.g. CDD and CDART databases).
Persons of ordinary skill will recognize there are multiple alternative search methods to achieve the same goal of identifying proteins that containing amino acid sequences that match exactly or in part, sequences possessed by the target protein. Additional methodologies may become available in the future as computerized record keeping, data storage and retrieval, and web based search engines continue to evolve.
Many protein-protein interactions are mediated by small protein modules binding to short linear peptides in their protein complement. The interaction map of both eukaryotic and prokaryotic proteins is being documented using various studies, and various public source databases are available from which to obtain this information. These include but are not limited to DIP (dip.doe-mbi.ucla.edu) and IntAct (www.ebi.ac.uk/intact/). These databases are used to collate information about other proteins that are known to interact with the query or homologous proteins. Textmining can be used to infer peptide sequences present at these protein-protein interaction sites that could serve as potential leads for optimization.
After identifying homologous proteins and complement proteins (i.e., proteins that interact with target protein and homologous proteins to the target), the invention involves curating information contained in publications, literature reports, documents and computerized records for the purposes of identifying peptide ligands that bind to the homologous proteins. The invention teaches that ligands that are able to bind to the homologous protein might also be expected to have a binding affinity for the target protein. This is because the target protein and the homologous protein have amino acid sequences in common. Then, if the ligand binds to the homologous protein at point where the amino acids sequence is similar to the sequence found in the target protein, that same ligand may be able to bind to the target protein (e.g. the ligand will have some degree of affinity for the target protein, which can, optionally, be optimized). In this way, we may reasonably expect to indirectly identify a narrow range of candidate molecules with the desired binding properties to the target.
In one embodiment, the required text searching can be performed manually by reading the relevant documents and publication describing studies of the homologous protein, specifically for the purposes of seeing if peptide ligands to the homologous proteins have been described. This information is contained in diverse databases including abstracts, scientific journals, and other public sources. A problem to address is that multiple documents have to be read and screened to determine if they contain the desirable subject matter. Specifically, the reader would reasonably be familiar with protein chemistry and the associated scientific nomenclature and would be able to correctly identify passages of text, items in tables or figures where peptide sequences have been included. The search is further complicated because of the need to search by protein ‘name’ as well as using synonyms and commonly accepted acronyms. Thus, manually identifying potential ligand binding partners to the datasets of homologous proteins is a significant challenge.
A preferred embodiment would be to use computer assisted text searching, employing links to publicly accessible databases, such as MedLine and PubMed Central (PMC, www.pubmedcentral.nih.gov), although the invention is not confined to these data bases. Private or commercial data bases can be used. These sources are structured in a way that is protracted but simple for humans to read and comprehend but difficult for computers to interpret. In such a computerized approach, scientific literature is queried, using customized search terms, for references to peptide ligands, interacting domains, motifs, sequences, epitopes, etc., to proteins identified from the homologous protein search aspect of the invention. Queries are also made for published sequences within the DIP (Database of interacting proteins) and IntAct (Interacting proteins database) proteins that ‘bind’ to their corresponding partners.
Preferably, the new computer-driven text mining software program HALO can be used to speed this aspect of the invention (
Typically, the process described above identifies several (>100) and preferably 10-20 peptide sequences as candidates, to be further assayed for binding or for their ability to modulate or affect target protein activity. This is a custom designed sample ligand space for the protein, in marked contrast to the usual drug discovery methods which can require synthesis and screening of vast unfocussed libraries of randomly selected compounds. At this point in the process, it is not known if the candidate peptides have the desired property, i.e., affinity, selectivity, specificity, etc., for the target protein. Therefore, once candidate ligands have been identified, their affinity and specificity for the target protein is optimized experimentally.
The invention also embodies methods to improve and refine the protein binding properties (selectivity, specificity, affinity, etc.) of the candidate peptides. A process of amino acid “sequence modulation” may be performed, if required. Here the inventors provide an example. From the set of candidate ligands to a query protein, all sequences longer than 8 (or 10 or 12)-amino acids are subjected to ligand walking (a process similar to epitope walking) to obtain an n-mer (n=8-12) super-library of candidate ligands. The reasons for this are described herein:
It has been demonstrated that linear sequences of ˜5-6 amino acids long are typically sufficient to provide significant interaction with a protein (McEntyre and Gibson 2004) since the binding energy for a ligand-protein is not distributed evenly across their interface but is localized to small regions of the protein of less than five amino acids that represent energetic focal points of the interaction (Arkin and Wells 2004). The yield of the step-wise, divergent synthesis of peptide ligands on the array is not quantitative (ca. 95-97%), and sequences longer than 10 amino acids tend to lose sequence fidelity.
Then, a matrix of peptides comprising ‘knowledge based’ permutations of the core amino acids (central 2 or 4 or ‘n’) of each n-mer (8-mer taken as example) peptide in the super-library are created initially for arraying, using a substitution matrix such as BLOSUM (Henikoff and Henikoff 1992), as these residues are generally predicted to contribute maximally to protein-protein or protein-peptide binding interactions. The BLOSUM matrix is derived from complex mathematical analysis of the evolutionary design of proteins that have elucidated the most commonly observed point-mutations between protein structure families. For example, acceptable permutations might include GlyAlaValLeuIle, PheTyrTrp, LysArg, GluAsp, ThrSer, GlyPro, etc. It is recognized that there are multiple ways to perform these substitutions and variation to the candidate ligands. An artisan would also appreciate that longer or shorter than 8 amino acids may be used, and that the acceptable combinations and permutations may involve more or less than 2 amino acids in contiguous or non-contiguous positions. Uncommon substitutions, referred to as high-penalty substitutions as defined by BLOSUM, may also be used to generate sequences not commonly found in nature.
High-throughput screening is critically important at this stage, because although a relatively narrow ligand sample space has been provided, it is reasonable to expect automated text-mining to identify a certain number of false positives for a variety of reasons, such as finding nonsense or non-binding peptide sequences. Moreover, since sequence space can expand rapidly (following 20n, where n is the length of the ligand), being able to increase the number of ligand molecules screened at a time is highly desirable. Screening involves synthesis of the candidate ligands, and its rationally expanded ‘sequence space’ analogs, and directly testing their binding affinity towards the target. The value of the technique is thus enhanced further through systematic (rather than random all-for-all) variations in the binding ligand's chemistry (i.e. peptide sequence) to modulate the binding efficiency, as described herein.
Optimization of ligands for protein binding may be performed in numerous ways. In order to facilitate screening of the target protein against several thousand fine-tuned variations of our ligands simultaneously, for example, high-throughput screening can be achieved using a high-density custom microarray in which arrays of candidate ligands (and their permutations) customized for the target protein are built from the ground up using a state of the art automated peptide arraying technique. It will be understood by a skilled artisan that alternative methods may also be used to synthesize and screen peptide ligands in parallel, including other peptide microarray methodologies (for e.g., PepSpot, inkjet printing), combinatorial library of peptides, and phage display. Affinity constants may also be determined using any number of methods, including on custom peptide microarrays. Iterative screening is often carried out to further refine ligand binding properties (affinity, selectivity, specificity) desirable for the required application.
For iterative screening, sequence modulation of the ligands may be carried out as follows. The core (2 or 4) amino acids of the ligand from the primary screening are kept unchanged while pairs of outlying amino acid residues (e.g., positions 1 & 8 and 2 & 7) in the sequence are permuted using BLOSUM based substitutions. This approach allows a cipher lock-like systematic tuning of the ligand interaction of the high-affinity 8-mer from the primary screen with its target ‘epitope’ in order to refine and optimize its affinity. Alanine scanning to determine immutable amino acids can also assist in controlled expansion of the ligand space.
The methodology, when successfully implemented, can significantly advance existing methods for generating high affinity ligands for targeting protein-protein interactions. Automation of this process into a bioinformatics program has been completed, which can rapidly customize or refine the candidate ligand space to be screened will vastly expedite the process of hit-to-lead discovery, and it will provide a handy tool in the arsenal of drug discovery research. Once an active ligand has been identified, further studies may be required to determine e.g. mode of interaction, potential side effects, and possibility of improving the drug-characteristics of the ligand. A notable advantage to our approach in addition to the rapid turn around from protein sequence to peptide ligand is that the performance of our process can be continually enhanced as more proteins and their ligands are added to the existing databases (including through our screening and iteration process).
ESRRG (Estrogen-receptor related receptor gamma, a.k.a., ERRγ, ERR gamma, ERR3) is involved in transcriptional activation of gene expression. ESRRG has been proposed to be a prognostic breast cancer biomarker indicative of clinical outcome and sensitivity to hormonal therapy, and it is a putative therapeutic target in prostate cancer.
The ERR proteins (comprising ERR α, β, and γ) are ubiquitous, constitutively active, orphan members of the nuclear receptor superfamily that display a wide range of cell-specific gene transcriptional activities which contribute to cell maintenance, apoptosis, and metabolic pathways (Ariazi and Jordan 2006). The ERRs are closely related to the ERs, specifically within their DNA-binding (DBD) and ligand-binding (LBD) domains. Due to their shared DBD homology, ERRs are known to transactivate ERα genes through estrogen response elements (EREs). There is strong evidence suggesting that the control of overlapping gene regulatory pathways plays a role in cancer therapy response of ER+ breast cancer cells. ERRs also transcribe a different set of genes, through a unique ERR response element (ERRE), that control aspects of plasma membrane generation and the synthesis of steroid hormones, which could augment proliferation rate of tumor cells (Anderson and Clarke 2004; Clarke, Anderson et al. 2004). ESRRG was shown to directly mediate Tamoxifen resistance in a representative invasive lobular carcinoma cell model by activating transcription through cis-acting AP-1 sites (similar to ERα), but only when bound by Tamoxifen (Riggins, Lan et al. 2008). AP-1 activity is significantly upregulated in Tamoxifen-resistant cells (Dumont, Bitonti et al. 1996; Johnston, Lu et al. 1999; Schiff, Reddy et al. 2000; Zhou, Yau et al. 2007).
In a recent study (Galindo, McCormick et al. 2010), a custom oligo-array measuring global microsatellite content among individual genomic DNA samples has determined an unique and reproducible pattern of 26 motif-specific microsatellite families that specifically characterize breast cancer, and 4 motif families that were cancer-type specific. The same motifs were found in tumor DNA of the same patients but not in germline DNA samples from healthy volunteers. The results indicate that some cancer patients might possess variable numbers of microsatellites that are predictive of future cancer development. A key determination was that a variable-length microsatellite in the 5′-UTR promoter region of the ESRRG gene may correlate with heritable cancer risk. Of the 22 candidate transcription factors that potentially could bind to this region of ESRRG containing the repeat (plus 100 bp flanking sequences), one of which (paired box 2, PAX2) is capable of binding the repeat unit itself. This finding suggests another potential mechanism of action, as PAX2 was recently implicated in ER-mediated regulation of ErbB2 (Her2, EGFR2) and resistance to Tamoxifen (Hurtado, Holmes et al. 2008).
Thus, there is mounting evidence for an important role for ESRRG in breast cancer etiology and therapy response, and as a putative biomarker for breast cancer prognosis.
Step 1: Datamining
Protein Sequence
Objective—To obtain the protein or biomarker primary sequence.
Several sources for primary sequence information exist, including protein databases, (e.g., PDB, Human Mitcohondrial Protein Database, AMPer, UniProt, etc.), Mass Spectrometry based methods (e.g., SEQUIT), methodologies based on DNA translation), among others.
FASTA sequence for ESRRG was obtained from Protein Databank
Synonyms—a.k.a. ERR3, ERRG2, KIAA0832, NR3B3 (UniProtKB)
Alternate sources of obtaining protein synonyms include, but are not limited to, biomedical literature (Entrez PubMed, BIOMINT (gene and protein synonym database, 42 synonyms for ESRRG), UniProt, PDB.
Gene Information: Gene information for ESRRG was obtained from Entrez. This information also is available from sundry other databases within Entrez (GenBank, UniGene, etc.). Although this information was not directly relevant to the ligand discovery process for ESRRG, it offers useful insights/information about target proteins.
Step 1.1: Dataset of Homologous Proteins 1 (DHP1)
Objective: To determine proteins sharing sequence homology to entire protein and domains within it (if known) using BLAST/PSI-BLAST/FASTA etc. (similar information may be obtained through InterPro, CDD, CDART, or other protein motif, domain or family database).
Primary sequence of ESRRG was BLASTed using iterative PSI-BLAST with the following parameters—E (expectation threshold)=0.001 (0.005 for PSI BLAST); Max. target sequences=500; with all other parameters (word size, matrix, etc.) at ‘default’, against the NR (non-redundant protein) database.
Information of known domains or regions within ESRRG—e.g., Ligand-binding (LBD, 220-442) and DNA-binding (DBD, 121-228) domains—was obtained from the CDD (also available through UniProt, BLAST, BLink, etc.). These were also subjected to independent PSI-BLAST queries.
Step 1.2: Dataset of Homologous Proteins 1 (DHP2)
Another way of identifying homologous proteins to the target protein is through InterPro (www.ebi.ac.uk/interpro/). InterPro combines a number of member databases (including Pfam, ProSite, PRINTS, and Propom) that use different methodologies and a varying degree of biological information on well-characterized proteins to derive protein signatures such as families, domains, repeats, and sites. InterPro permits searching the database by protein sequence (e.g., FASTA) using InterProScan or by protein name (keyword search). The results from an InterPro search lists proteins sharing domains, repeats, and functional sites broken down by ‘parent/child’ and ‘contains/found-in’ relationships to the query protein.
An InterPro query for ESRRG was carried out starting from its FASTA sequence (i.e., using InterProScan).
Step 1.3: IntAct/DIP (Database of Complementary Proteins)
Objective: To determine proteins that are known to bind to query (target) protein and its homologs. This information assists in focused textmining for protein-protein interacting sequences.
Many protein-protein interactions are mediated by small protein modules binding to short linear peptides in their protein complement. IntAct and DIP (Database of Interacting Proteins) are two well known such databases; other similar databases may also exist. These afford a way to determine interacting partners to a target query protein, which can assist in determining interacting linear peptide sequences within these partners, through customized text mining, which are suitable as starting points for ligand discovery.
Proteins that interact with ESRRG and its homologs (e.g., ER, RXR, etc. from DHP1 and DHP2 above) were queried using IntAct. Some of this information is detailed below. In examples below, ER interacts with FOXO3 (Forkhead box Protein O3) and FOXM1; likewise, RXR interacts with a set of other proteins, including RAR alpha and PPAR alpha; and IL12a p35 unit interacts with IL12b (receptor subunit b). Other proteins were datamined in similar fashion (examples not shown).
1.3.1 ER (Nuclear Hormone Receptor)
1.3.2 RXR (Retinoid X Receptor)
1.3.3 IL-12a p35 Subunit (IL12a)
Step 2: Textmining (PubMed Central/PubMed/MedLine)
Objective—To identify known peptide ligands, sequences, motifs, paratopes, etc., within the medical literature database through custom combination keyword and co-occurring term queries.
ESRRG and its homologous proteins were textmined by interrogation of the biomedical literature database using custom keyword query strings (see Table 1 below) followed by manual curation of the text to collate peptide ligands, interacting linear sequences, phage display ligands, etc. Typically, query strings comprised a combination of terms from Table 1 below. For example, “Term 1 AND Term 2 AND Term 3”; “Term 1 AND Term 2 AND Term 4”; “Term 1 AND Term 3 AND Term 4”; “Term 1 AND Term 4”; and so on.
A sampling of the queries attempted is provided in the following sections, along with context within text (in italics, within “double quotes”), reference (author, publication title in ‘single quotes’, journal reference, etc.), PMID# or PMCID#, and any other relevant details. An exhaustive search can now be carried out using our automated datamining portal Libase on the HALO/Gen website.
Query 1: Estrogen Receptor+Peptide Ligand
ESRRG #01: RIP40 peptide (“LERNNIKQAANNSLLLHLLKSQTIP”) (SEQ ID NO: 2)—(also a PDB entry).
Wang, L.; et al. ‘X-ray crystal structures of the estrogen-related receptor-gamma ligand binding domain in three functional states reveal the molecular basis of small molecule regulation’. J Biol. Chem. 2006 Dec. 8; 281(49):37773-81. Epub 2006 Sep. 21. PMID: 16990259 (ESRRG #01).
ESRRG #02: “RHKILHRLLQEGSPS” (SEQ ID NO: 3)—Steroid receptor coactivator 1 (PDB entry 1 KV6; ‘X-ray structure of the orphan nuclear receptor ERR3 ligand-binding domain in the constitutively active conformation’. PMID: 11864604 (ESRRG #02).
ESRRG #03: “KHKILHRLLQDSS” (SEQ ID NO: 4)—Steroid receptor coactivator 3 (PDB entry: 1×7R; ‘Understanding the selectivity of genistein for human estrogen receptor-beta using X-ray crystallography and computational methods’. PMID: 15576033, (ESRRG #03).
ESRRG #04: “SGSHKLVQLLTTT” (SEQ ID NO: 5)—WAY 697 (PDB entry: 1×76; ‘Structure-Based Design of Estrogen Receptor-Beta Selective Ligands’. J. Am. Chem. Soc. 126: 15106-15119. PMID: 15548008 (ESRRG #04)
ESRRG #05: SMRT Peptide “TNMGLEAIRKALMGKYDQWEE” (SEQ ID NO: 6) (see
SRC-1 NR box 4 peptide (ERβ)—“QAQQKSLLQQLLTE” ((SEQ ID NO: 7);
SRC-1 NR box 2 peptide (ERα)—“LTERHKILHRLLQEGSPSD” (SEQ ID NO: 8);
cyclo-KDCILCRLLN (PERM-1) (SEQ ID NO: 9)—See Table 2.
Leduc, A. M. et al. ‘Helix-stabilized cyclic peptides as selective inhibitors of steroid receptor coactivator interactions’. PNAS 2003, 100(20), 11273-8. PMID: 13679575 (ESRRG #05).
ESRRG #06: GGSCPSSHSSLTERH—(SEQ ID NO: 10) (“residues 683-697 of Swiss-Prot entry Q15788”).
Jin, K. S., et al. ‘Small-angle X-ray scattering studies on structures of an ERRalpha ligand binding domain and its complexes with ligands and coactivators’. J. Phys. Chem. B 2008, 112(32), 9603-12. PMID: 18646811 (ESRRG #06)
Abstract—“We also synthesized steroid receptor coactivator-1 (SRC-1), a 15-mer peptide corresponding to the leucine-rich repeat 4 of human SRC-1”.
Results and Discussion, paragraph 1—“In addition, SRC-1, a 15-mer peptide corresponding to the leucine-rich repeat 4 of human SRC-1 (residues 683-697 of Swiss-Prot Entry Q15788), was synthesized”.
ESRRG #07: QQQKPQRRPCSELLKYLTTNDD—(SEQ ID NO: 11)—PPAR Gamma Coactivator-1 (PGC-1) Box 3 peptide.
KENALLKYLLDK—(SEQ ID NO: 12)—TIF-2 Box 3 peptide
Greschik, H., et al. J. Biol. Chem. 2008, 283(29), 20220-30. PMID: 18441008 (ESRRG #07)
“We co-crystallized the ERRa-LBD with a “wild-type” PGC-1α Box3 peptide (198QQQKPQRRPCSELLKYLTTNDD219)—(SEQ ID NO: 13) whereas in the case of the reported structure a shorter, mutant peptide (205RPASELLKYLTT216; C207A) (SEQ ID NO: 14) was used”.
“A favorable contribution of Arg-205, Pro-206, and Cys-207 to the ERRα/PGC-1α interaction is also interesting with respect to the crystal structure of the ERα LBD bound to a TIF-2 box3 peptide” (740KENALLKYLLDK751; PDB ID: 1GWR)—(SEQ ID NO: 21).
ESRRG #08: L3-09—PSEDRGELWRLLSVTERQN (SEQ ID NO: 22)
L3-28—MSVQYPELQRLLMAGSTEL (SEQ ID NO: 23)
L3-80—IPLSGSELSRLLLTEMPEL (SEQ ID NO: 24) (found within text; see ESRRG #09: DAFQLRQLILRGLQDD (SEQ ID NO: 35)—CoRNR-box sequence
Roelens, F.; et al. ‘Subtle side-chain modifications of . . . ER alpha and beta’. J. Med. Chem. 2006, 49(25), 7357-65. PMID: 17149865 (ESRRG #09)
ESRRG #10: SRC1 NR box 4—GPQTPQAQQKSLLQQLLTE (SEQ ID NO: 36)
PGC1a—EAEEPSLLKKLLLAPANTQ (SEQ ID NO: 37)
D22 (RIP140, PGC1, DAX1, SHP)—LPYEGSLLLKLLRAPVEEV (SEQ ID NO: 38) (see Table 4 below)
Table 3-Phage display peptides—“Peptides L3-09, L3-28, and L3-80 were the most effective inhibitors reducing the transcriptional activity of EPPα to basal levels”.
From ‘Abad, M. C., et al. J. Steroid Biochem. Mol. Biol. 2008, 108(1-2), 44-54. PMID: 17964775 (ESRRG #08)—ESRRG #42 was also found through the same PubMed query; —Gaillard, S. et al. ‘Definition of the molecular basis for ERRα-Cofactor interactions’. Mol. Endocrinol. 2007, 21, 62-76.
ESRRG #09: DAFQLRQLILRGLQDD (SEQ ID NO: 35)—CoRNR-box sequence
Roelens, F.; et al. ‘Subtle side-chain modifications of . . . ER alpha and beta’. J. Med. Chem. 2006, 49(25), 7357-65. PMID: 17149865 (ESRRG #09)
ESRRG #10: SRC1 NR box 4—GPQTPQAQQKSLLQQLLTE (SEQ ID NO: 36)
PGC1a—EAEEPSLLKKLLLAPANTQ (SEQ ID NO: 37)
D22 (RIP140, PGC1, DAX1, SHP)—LPYEGSLLLKLLRAPVEEV (SEQ ID NO: 38) (see Table 4 below)
L3-80
IPLSGSELSRLLLTEMPEL
L3-09
PSEDRGELWRLLSVTERQN
L3-28
MSVQYPELQRLLMAGSTEL
Gowda, et al. Anal Biochem. ‘Development of a coactivator displacement assay for the orphan receptor estrogen-related receptor-gamma using time-resolved fluorescence resonance energy transfer’. 2006 Oct. 1; 357(1):105-15. PMID: 16889744 (ESRRG #10) (Invitrogen Corporation, Drug Discovery Solutions)
“An initial screen of these coregulator peptides bearing the coactivator LXXLL motif, the corepressor LXXI/HIXXXI/L motif, or other interaction motifs from natural coactivator sequences or random phage display peptides indicated that the peptides PGC1alpha, D22, and SRC1-4, known as class III coregulators, interacted most strongly with ERRgamma in the absence of ligand. Unliganded ERRγ also interacted with RIP140 motifs, in agreement with previously reported results”.
SRC1
NR box 2
LTARHKILHRLLQEGSPSD
SRC1
NR box 4
GPQTPQAQQKSLLQQLLTE
PGC1
α
EAEEPSLLKKLLLAPANTQ
D22 (RIP140, PGC1,
LPYEGSLLLKLLRAPVEEV
DAX1,
SHP)
ESRRG #11: A63SHKQLSELLRGG75 (CBP LXXLL Motif) (SEQ ID NO: 68)
Klein, F. A.; et al. “Biochemical and NMR mapping of the interface between CREB-binding protein and ligand binding domains of nuclear receptor: beyond the LXXLL motif”. J. Biol. Chem. 2005, 280(7), 2682-92. PMID: 15542861
“The antagonist 4-hydroxytamoxifen induces a conformation of ERR-LBD in which helix H12 is dissociated from the core of the domain (46). Abolition of the interaction of CBP31-90 with ERRγ LBD by 4-hydroxytamoxifen suggests that LXXLL motif I of CBP contacts NR LBDs in a canonical manner and interacts with helix H12 in an agonist conformation. The hydrophobic cleft formed by this helix was identified previously as the site of interaction on PPARγ LBD by NMR studies using the peptide A63SHKQLSELLRGG75 (SEQ ID NO: 68) from CBP”.
PGC-1α-(205-216): RPCSELLKYLTT (SEQ ID NO: [[14]]72); (Same as in entry ESRRG #07).
ESRRG #12: SRC-1-(682-697): SLTARHKILHRLLQEG (SEQ ID NO: 69) (Same as SRC-1 NR Box 2 in Entry 11).
Kallen, J.; et al. ‘Evidence for ligand-independent transcriptional activation of the hERRalpha: crystal structure of ERRalpha LBD in complex with PPAR coactivator-1 alpha’. J. Biol. Chem. 2004, 279(47), 49330-7. PMID: 15337744
“PGC-1α peptide containing the L3 site (Neosystem; sequence: RPASELLKYLTT (SEQ ID NO: 14), i.e. amino acids 205-216 of PGC-1α with the mutation C207A) was added to the protein solution”.
ESRRG #13: RHKILHRLLQEGSPS (SEQ ID NO: 70)—SRC-1 Coactivator Peptide
Greschik, H.; et al. ‘Structural basis for the deactivation of the ERRG by DES or 4-HT and determinants of selectivity’. J. Biol. Chem. 2004, 279(32), 33639-46. PMID: 15161930.
Greschik, H.; et al. ‘Structural and functional evidence for ligand-independent transcriptional activation by the ERR3’. Mol. Cell. 2002, 9(2), 303-13. PMID: 11864604
“New crystals of the ERRα apoLBD grew in the presence of a 3-fold molar excess of a SRC-1 coactivator peptide (686RHKILHRLLQEGSPS700)” (SEQ ID NO: 70).
ESRRG #14: “SLTERHKILHRLLQEGSPSDI”—(SEQ ID NO: 71) SRC-1.2 (SRC-1 NR Box 2).
Coward, P.; et al. ‘4-HT binds to and deactivates the ERRG’. PNAS 2001, 98(15), 8880-4.
Query 2: NR3B3+peptide ligand
ESRRG #15: Hentschke, M. et al. ‘Characterization of calmodulin binding to the orphan nuclear receptor ERRgamma’. Biol. Chem. 2003, 384(3), 473-82.
No peptide ligand sequences were found.
Query 3: ESRRG+peptide ligand
ESRRG #16: See Wang, et al. entry ESRRG #1.
ESRRG #17: See Gowda, et al. entry ESRRG #10.
ESRRG #18: See Greschik, et al. entry ESRRG #13.
ESRRG #19: See Hentschke, et al. entry ESRRG #15.
Query 4: Nuclear receptor subfamily 3+peptide ligand
ESRRG #20: RPCSELLKYLTT (SEQ ID NO: 72)
Kallen, J.; et al. ‘Crystal structure of hERR alpha in complex with a synthetic inverse agonist’. J. Biol. Chem. 2007, 282(32), 23231-9.
“As FRET acceptor, 1 μl of an N-terminally Cy5-labeled PGC-1a derived peptide (Cy5-RPCSELLKYLTT, (SEQ ID NO: 72), 50 nM final assay concentration) was added”.
ESRRG #21: RGAFQNLFQSV (SEQ ID NO: 73)—AR 20-30.
KENALLRYLLDK (residues 1-12 of SEQ ID NO: 74)—TIF2-III 740-741 (see
He, B.; et al. ‘Structural basis for AR interdomain and coactivator interactions suggests a transition in NR activation function dominance’. Mol. Cell 2004, 16(3), 425-38. PMID: 15525515.
“Peptide binding affinities were determined by fluorescence polarization (Stanley, et al., 2003) using 40 M R1881 and 17b-estradiol and 10 nM AR 20-30 (fluorescein-RGAFQNLFQSV) (SEQ ID NO: 73) and TIF2-III 740-751 (fluoroscein-KENALLRYLLDK) (residues 1-12 of SEQ ID NO: 74). GAL4-DNA binding domain and VP16 activation domain fusion peptides were prepared.”
ESRRG #22: No sequence found—but reference to GRIP-I, P300, and DRIP 205.
Harris, J. M.; et al. ‘Characterization of RORalpha coactivator binding interface: a structural basis for ligand independent transcription’. Mol. Endocrinol. 2002, 16(5), 998-1012. PMID: 11981035.
“Previously Lau et al. (21) and Atkins et al. (22) had demonstrated ROR interaction with DRIP205, GRIP-1, and P300 using this technique”.
Query 5: “Retinoic Acid Receptor” and Peptide and Ligand—627 references
Changed query term to Query 6.
Query 6: Retinoic Acid Receptor+Binding+Domain—956 references (manually curated top 60 hits)
ESRRG #23: EKHKILHRLLQDSY (SEQ ID NO: 75)—co-activator peptide.
Chandra V, et al. ‘Structure of the intact PPAR-gamma-RXR-alpha nuclear receptor complex on DNA’. Nature, 2008, Nov. 20; 456(7220(350-6. PMID 19043829.
ESRRG #24: PCFXILP (SEQ ID NO: 76)—Dax-1 co-repressor of LRH.
Sablin, E P, et al. ‘The structure of corepressor Dax-1 bound to its target nuclear receptor LRH-1’. PNAS 2008, 105(47), 18390-5. PMID: 19015525.
“Superimposition of the (Dax-1)2:LRH-1 complex with the structures of hLRH-1 bound to the GRIP-1 peptide (34, 37) shows that the position and conformation of the RH in the complex are similar to those of the docked regulatory peptide. However, the RH site of mDax-1 does not have the LXXLL motif present in nuclear receptor coactivators. Instead, the RH site includes a sequence motif, 275-PCFXXLP-281 (SEQ ID NO: 339), conserved among all members of the NR0B1 subfamily. Thus, the exposed hydrophobic residues C276, I279, and L280 from the Dax-1 repressor helix substitute for the three Leu residues from the LXXLL motif of the coactivators”.
ESRRG #25: “The sequences used in this study were SRC2 NR box 2 (residues 686-698), CGSGKHKILHRLLQDSS (SEQ ID NO: 77); and the only LXXLL motif on DAX1, termed DAX1-3 (residues 142-154), CGSGQGSILYSLLTSSK (SEQ ID NO: 78).”
Valadares, N F et al. ‘Ligand induced interaction of thyroid hormone receptor beta with its coregulators’. J. Ster. Biochem. 2008, 112, 205-12.
ESRRG #26: SRC3-1 peptide (AENQRGPLESKGHKKLLQLLTSS) (SEQ ID NO: 79).
Kruse S W et al. ‘Identification of COUP-TFII Orphan Nuclear Receptor as a Retinoic Acid-Activated Receptor’, PLoS Biol. 2008, 6(9), 227.
ESRRG #27: KHKILHRLLQDSS (SEQ ID NO: 80)—Nuclear receptor coactivator 2 peptide (PDB 2plt, 2plu, 2plv)—Also see entry ESRRG #3 and
Nahoum V., et al. ‘Nuclear receptor ligand binding domains: reduction of helix H12 dynamics to favor crystallization’. Acta Cryst. 2008, F64, 614-16. PMID: 18607089
ESRRG #28: No sequence found in text—reference to AA 71-105 of FOXP3. (can be obtained through UniProt or other protein sequence database).
Du, J., et al. ‘Isoform-specific inhibition of RORα-mediated transcriptional activation by Human FOXP3’. J. Immunol. 2008, 180(7), 4785-92. PMID: 18354202.
“To map the region in FOXP3 that is responsible for the interaction with RORα, a yeast 2-hybrid test was used”. . . . “As shown in
ESRRG #29: SMRT (a.k.a. NCoR2) and NCoR Corepressor peptides—No sequences given.
Lammi, J., et al. ‘Corepressor interaction differentiates the permissive and non-permissive retinoid X receptor heterodimers’. ABB 2008, 472(2), 105-12. PMID: 18282463.
ESRRG #30: GRIP1 NR box 2 peptide 686KHKILHRLLQDSS698 (SEQ ID NO: 81).
Lu, J.; The RXRalpha C-terminus T462 is a NMR sensor for coactivator peptide binding. BBRC 2008, 366(4), 932-7. PMID: 18088598.
“Herein, we report a nuclear magnetic resonance (NMR) study of the RXRα ligand-binding domain complexed with 9-cis-retinoic acid and a glucocorticoid receptor-interacting protein 1 peptide.”
ESRRG #31: SRC1 coactivator peptide.
Ito, M. et al. ‘Ab initio fragment molecular orbital study of molecular interactions between liganded retinoid X receptor and its coactivator; part II: influence of mutations in transcriptional activation function 2 activating domain core on the molecular interactions’. J. Phys. Chem. A. 2008, 112(10), 1986-98. PMID: 18020317.
ESRRG #32: CQQQKPQRRP (PGC1-α regions necessary for binding) (SEQ ID NO: 82).
Kanaya, E. and Jingami, H. ‘The region of CQQQKPQRRP (SEQ ID NO: 82) of PGC-1α interacts with the DNA-binding complex of FXR/RXRα’. BBRC, 2006, 342(3), 734-43. PMID: 16494845.
“The results obtained using truncated PGC-1α proteins suggested that two regions are necessary for PGC-1α to interact with the DNA-binding complex of RXRα/FXR. One is the region of the second leucine-rich motif, and the other is that of the amino acid sequence CQQQKPQRRP (SEQ ID NO: 82), present between the second and third leucine-rich motifs”.
ESRRG #33: LTKTNPILYYMLQK (SEQ ID NO: 83)—Derived from RIP40; PMID: 12549917.
“Interestingly, while RIP140 contains nine copies of the LXXLL interaction motif, its ligand-enhanced interaction with holoreceptor appeared to be attributed to a novel peptide (LTKTNPILYYMLQK) (SEQ ID NO: 83) found in its carboxyl-terminal region (amino acids 1063-1076)”.
Query 7: Estrogen Receptor+Peptide+Ligand (PubMed Central)
ESRRG #34: LPALDPTKRWFFETK (SEQ ID NO: 84)—Phage display peptide with Estrogen-like activity. Venkatesh, N. A synthetic peptide with estrogen-like activity derived from a phage-display peptide library. Peptides 2002, 23(3), 573-80. PMID: 11836009.
ESRRG #35: Several peptides from Phage Display—see image below from literature text. Highlighted sequences were inferred from context within text to be candidate ligands for ESRRG.
Hall, J. M. et al. ‘Development of peptide antagonists that target estrogen receptor beta-coactivator interactions’. Mol. Endocrinol. 2000, 14(12), 2010-23. PMID: 11117531 (also see PMID 11117532, 10567548).
LXXLL
LLHLL
LISLL
LESLL
LYPLL
LTYLL
LLSLL
LIDLL
LWGLL
LWSLL
LMKLL
LVSLL
LGGLL
LEQLL
LLQLL
LLKLL
ESRRG #36: VSWFFE (SEQ ID NO: 100), VSWFFED (SEQ ID NO: 101)—Derived from mutation analysis of PhD peptide LPALDPTKRWFFETK (SEQ ID NO: 84) (see entry ESRRG #34).
Kasher, et al. ‘Design, synthesis, and evaluation of peptides with estrogen like activity’. Peptide Science 2004, 76(5), 404-420. PMID: 15468062.
“VSWFFE (EMP-1) (SEQ ID NO: 100) and VSWFFED (EMP-2) (SEQ ID NO: 101) (EMP: estrogen-mimetic peptide), bind mAb-E2 with high affinity (IC50 of 6 and 30 nM, respectively), recognize ERs with increased affinity (IC50 of 100 μM for ERα, and 100-250 μM for ERβ), and possess estrogenic activity in vivo”.
ESRRG #37: KClLCRLLQ—(H-Lys-cyclo-(d-Cys-Ile-Leu-Cys)-Arg-Leu-Leu,Gln-NH2) (SEQ ID NO: 102).
Leduc, A. M.; et al. ‘Helix-stabilized cyclic peptides as selective inhibitors of steroid receptor-coactivator interactions’. (see entry ESRRG 5).
ESRRG #38: NIFSEVRVYN (SEQ ID NO: 103) AA 795-804 of DRIP150 coactivator protein.
Lee, J and Safe, S. ‘Coactivation of estrogen receptor alpha (ER alpha)/Sp1 by vitamin D receptor interacting protein 150 (DRIP150)’. ABB 2007, 461(2), 200-210. PMID: 17306756.
“Deletion analysis of DRIP150 demonstrates that coactivation requires an alpha-helical NIFSEVRVYN (SEQ ID NO: 103) (amino acids 795-804) motif within 23 amino acid sequence (789-811) in the central region of DRIP150 and similar results were obtained for coactivation of ERalpha by DRIP150”.
ESRRG #39: VLWKLLKVV (SEQ ID NO: 104)—Sequence from R-MGMT.
Teo, et al. ‘The modified human DNA repair enzyme O(6)-methylguanine-DNA methyltransferase is a negative regulator of estrogen receptor-mediated transcription upon alkylation DNA damage’. Mol. Cell Biol. 2001, 21(20), 7105-14. PMID: 11564893.
“R-MGMT, which adopts an altered conformation, utilizes its exposed VLWKLLKVV (SEQ ID NO: 104) peptide domain (codons 98 to 106) to bind ER. This binding blocks ER from association with the LXXLL motif of its coactivator, steroid receptor coactivator-1, and thus represses ER effectively from carrying out transcription that regulates cell growth”.
Query 8: Estrogen Receptor and Peptide and Ligand (125 PMC articles)
ESRRG #40: SGSGLTSRDFGSWYA (SEQ ID NO: 105), EKHKILHRLLQDS (SEQ ID NO: 106).
Kong, et al. ‘Delineation of a unique P-P interaction site on the surface of the ER’. PNAS, 2005, 3593-98. PMID: 15728727, PDBID—2BJ4-A.
“αII Peptide Motif Interacts Specifically with Liganded ERαLBD. Previous reports of αII binding to ERα showed that the interaction occurs within the C-terminal LBD region between residues 282 and 535”.
Query 9: Retionoid X Receptor and Peptide and Ligand (58 PMC articles)
ESRRG #41: Multiple peptides—co-activators, co-repressors, and interacting protein sequences.
Folkertsma, S.; et al. ‘The use of in vitro peptide binding profiles and in silico ligand-receptor interaction profiles to describe ligand induced conformations of the RXRa LBD’. Mol. Endocrinol. 2007 21(1), 30-48. PMID: 17038419.
“FIG. 1, A and B, shows that the remaining 13 peptides equally bind to RXRα with and without 9-cis RA. The estimated Kd app values of the dose-response curves in the absence and presence of 9-cis RA are nearly equal (152 nM in the absence and 155 nM in the presence of compound), which indicates that there is no effect on the peptide binding by 9-cis RA.”.
ESRRG #42: 686RHKILHRLLQEGSPS700 (SEQ ID NO: 107)—SRC-1 Coactivator.
Groot, A. D.; et al. ‘Crystal structure of a novel tetrameric complex of agonist-bound LBD of Biomphalaria glabrata RXR’. JMB, 2005, 354, 841-53. PMID: 16274693.
Query 10: Progesterone Receptor and Peptide and Ligand (40 PMC articles, no hits)
Query 11: RXR+peptide+ligand
ESRRG #43: Crystal Structure Of Human Lrh-1 Bound With Tif-2 Peptide And Phosphatidylglycerol. (Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Mol_id: 2; Synthetic: Yes; Other_details: The Peptide Was Chemically Synthesized. The Sequence Of The Peptide Is Naturally Found In Homo Sapiens. (Human)
Query 12: “ESRRG-binding domain”
ESRRG #44: Gowda K, et al. Anal Biochem. 2006 Oct. 1; 357(1):105-15. (Epub 2006 Jul. 10. PMID: 16889744) ‘Development of a coactivator displacement assay for the orphan receptor estrogen-related receptor-gamma using time-resolved fluorescence resonance energy transfer’. Invitrogen Corporation, Drug Discovery Solutions, Madison, Wis. 53719, USA.
“The estrogen-related receptor-gamma (ERRgamma) is a constitutively active orphan receptor that belongs to the nuclear receptor superfamily and is most closely related to the estrogen receptors. Although its physiological ligand is unknown, ERRgamma has been shown to interact with synthetic estrogenic compounds such as 4-hydroxytamoxifen (4-OHT), tamoxifen, and diethylstilbestrol (DES). To assess how coregulator proteins interact with ERRgamma in response to ligand, an in vitro interaction methodology using time-resolved fluorescence resonance energy transfer (TR-FRET) was developed using glutathione 5-transferase (GST)-tagged ERRgamma ligand-binding domain (LBD), a terbium-labeled anti-GST antibody, a fluorescein-labeled peptide containing sequences derived from coregulator proteins, and various ligands. An initial screen of these coregulator peptides bearing the coactivator LXXLL motif, the corepressor LXXI/HIXXXI/L motif, or other interaction motifs from natural coactivator sequences or random phage display peptides indicated that the peptides PGC1alpha, D22, and SRC1-4, known as class III coregulators, interacted most strongly with ERRgamma in the absence of ligand. Given its assay window and biological relevance in energy metabolism and obesity, further studies were conducted with PGC1alpha. Fluorescein-labeled PGC1alpha peptide was displaced from the ERRgamma LBD in the presence of increasing concentrations of 4-OHT and tamoxifen, but DES was less effective in PGC1alpha displacement. The statistical parameter Z′ factor that measures the robustness of the assay was greater than 0.8 for displacement of PGC1alpha from ERRgamma LBD in the presence of saturating 4-OHT over an assay incubation time of 1-6 h, indicating an excellent assay. These findings also suggest that binding of 4-OHT, tamoxifen, or DES to ERRgamma results in differential affinity of coregulators for ERRgamma due to unique ligand-induced conformations”.
ESRRG #45: Wang L, et al. ‘X-ray crystal structures of the estrogen-related receptor-gamma ligand binding domain in three functional states reveal the molecular basis of small molecule regulation’. J Biol Chem. 2006 Dec. 8; 281(49):37773-81. Epub 2006 Sep. 21. (PMID: 16990259) (Discovery Research, GlaxoSmithKline, Research Triangle Park, N.C. 27909, USA).
“X-ray crystal structures of the ligand binding domain (LBD) of the estrogen-related receptor-gamma (ERRgamma) were determined that describe this receptor in three distinct states: unliganded, inverse agonist bound, and agonist bound. Two structures were solved for the unliganded state, the ERRgamma LBD alone, and in complex with a coregulator peptide representing a portion of receptor interacting protein 140 (RIP140). No significant differences were seen between these structures that both exhibited the conformation of ERRgamma seen in studies with other coactivators. Two structures were obtained describing the inverse agonist-bound state, the ERRgamma LBD with 4-hydroxytamoxifen (4-OHT), and the ERRgamma LBD with 4-OHT and a peptide representing a portion of the silencing mediator of retinoid and thyroid hormone action protein (SMRT). The 4-OHT structure was similar to other reported inverse agonist bound structures, showing reorientation of phenylalanine 435 and a displacement of the AF-2 helix relative to the unliganded structures with little other rearrangement occurring. No significant changes to the LBD appear to be induced by peptide binding with the addition of the SMRT peptide to the ERRgamma plus 4-OHT complex. The observed agonist-bound state contains the ERRgamma LBD, a ligand (GSK4716), and the RIP140 peptide and reveals an unexpected rearrangement of the phenol-binding residues. Thermal stability studies show that agonist binding leads to global stabilization of the ligand binding domain. In contrast to the conventional mechanism of nuclear receptor ligand activation, activation of ERRgamma by GSK4716 does not appear to involve a major rearrangement or significant stabilization of the C-terminal helix”. (See Queries 1-5).
ESRRG #46: ‘A signature motif in transcriptional co-activators mediates binding to nuclear receptors’. David M. Henry, et al. Nature Letters 1997, 387, 7330736. (see
ESRRG #47: Zhou D, et al. ‘PNRC: a proline-rich nuclear receptor coregulatory protein that modulates transcriptional activation of multiple nuclear receptors including orphan receptors SF1 (steroidogenic factor 1) and ERRalpha1 (estrogen related receptor alpha-1)’. Mol Endocrinol. 2000 July; 14(7):986-98. PMID: 10894149.
“PNRC (proline-rich nuclear receptor coregulatory protein) was identified using bovine SF1 (steroidogenic factor 1) as the bait in a yeast two-hybrid screening of a human mammary gland cDNA expression library. PNRC is unique in that it has a molecular mass of 35 kDa, significantly smaller than most of the coregulatory proteins reported so far, and it is proline-rich. PNRC's nuclear localization was demonstrated by immunofluorescence and Western blot analyses. In the yeast two-hybrid assays, PNRC interacted with the orphan receptors SF1 and ERRalpha1 in a ligand-independent manner. PNRC was also found to interact with the ligand-binding domains of all the nuclear receptors tested including estrogen receptor (ER), androgen receptor (AR), glucocorticoid receptor (GR), progesterone receptor (PR), thyroid hormone receptor (TR), retinoic acid receptor (RAR), and retinoid X receptor (RXR) in a ligand-dependent manner. Functional AF2 domain is required for nuclear receptors to bind to PNRC. Furthermore, in vitro glutathione-5-transferase pull-down assay was performed to demonstrate a direct contact between PNRC and nuclear receptors such as SF1. Communoprecipitation experiment using Hela cells that express PNRC and ER was performed to confirm the interaction of PNRC and nuclear receptors in vivo in a ligand-dependent manner. PNRC was found to function as a coactivator to enhance the transcriptional activation mediated by SF1, ERR1 (estrogen related receptor alpha-1), PR, and TR. By examining a series of deletion mutants of PNRC using the yeast two-hybrid assay, a 23-amino acid (aa) sequence in the carboxy-terminal region, aa 278-300, was shown to be critical and sufficient for the interaction with nuclear receptors. This region is proline rich and contains a SH3-binding motif, S-D-P-P-S-P-S (SEQ ID NO: 340). Results from the mutagenesis study demonstrated that the two conserved proline (P) residues in this motif are crucial for PNRC to interact with the nuclear receptors. The exact 23-amino acid sequence was also found in another protein isolated from the same yeast two-hybrid screening study. These two proteins belong to a new family of nuclear receptor coregulatory proteins”.
Candidate Ligand Browse Table
A list of candidate ligand sequences for ESRRG proteins is shown below. A compendium of identified peptide sequences known to bind to/interact with ESRRG or its sequence homologs is given below in Table 6. Table 6 lists each of the candidate sequences, the source of the sequence and the protein with which it interacts, as well as a reference to the literature source from which it was derived. All the listed information, and the context in literature from which peptide was derived (e.g., Abstract, Material and Methods, etc.) can be obtained currently through the computerized HALO/Gen web portal.
Step 3: Sequence Selection
Objective: To parse candidate ligand table to remove redundant, closely similar, or entrained (shorter sequence within larger candidate peptide ligand) sequences in order to focus the ligand space (see Table 7).
In a first step, candidate peptide ligands for ESRRG were subjected to a multiple sequence alignment algorithm (ClustalW2). Stringent alignment parameters were selected for this: penalty for Gap Extension=10.0, Gap Open=100, and End Gaps=“No”; in order to prevent gaps during alignment. All other parameters were set at default. Sequence alignment was then taken into consideration as candidate ligands were downselected for screening. All sequences contained within larger sequences (entrained), or highly similar (pairwise similarity score of >70%, incremental scale) to other sequences were removed. Longer sequences were retained in each case. Cysteine containing sequences were either removed or substituted by Alanine. Sequences longer than 8 were only considered, shorter sequences were culled.
Candidate Ligand Down-Selection from ClustalW2 Input 1
Step 4: Candidate Ligand Walking and Sequence Expansion
Ligands of longer than 8 AAs were ligand (or frame) walked 1 amino acid at a time to create a library of candidate 8-mer parent ligands. Sequence space was then expanded in the manner described below to create an 8-mer superlibrary of parent peptides.
The BLOSUM62 matrix was divided into Tiers based, with Tier 1 comprising “most conserved” (lowest penalty) and Tier 3 comprising “least conserved” (highest penalty) mutations observed in nature. Sequences for microarray production for the primary stage (Iteration 1) of screening for protein binding were created by permuting AA positions 4 and/or 5 only (i.e., core of sequence) of parent 8-mer peptides using Tier 1 and Tier 3 substitutions. This allows the systematic variation of the sequence in order to identify the most important motif within the larger candidate ligand sequence that is needed for binding the target protein. All peptides on the array were grown from the C→N terminal, with the exposed N-terminal amino acid masked by an acetyl group to prevent folding of the sequence and enable full presentation of the peptide to its target protein.
Screening was carried out typically by hybridization of the protein (1 μg/mL) against the array for 1 h at 4° C.; detection of bound protein was carried out using a fluorescently labeled anti-ESRRG mAb. (Direct labeling of ESRRG and alternative detection strategies are equally valid).
Highest binders (based on relative fluorescence signal) were selected and iteratively sequence expanded and arrayed (Iteration 2); and ESRRG binding sequences were selected as above. For iterative arrays, the core positions (AA 4 and 5) were kept unchanged while the outer amino acids (2, 3, 6, and/or 7) were permuted using Tier 3 substitutions.
Step 5: Screening and Assays
Daughter candidate ligands and their new Tier 3 and/or Tier 1 based combinations were synthesized in situ on an addressable peptide microarray and screened against native human ESRRG protein in order to determine sequences that bind to this protein. Bound protein was detected by a fluorescently labeled anti-ESRRG mAb. Results are obtained in fluorescence intensity units which were used to rank the peptide ligands. Excess (100×) competing biological matrices (human serum, HSA/BSA or bacterial protein extract) were used as backgrounds to select for specific ligands to the protein. Other competing proteins can be added to background for specificity selection and directed evolution of ligands.
Iteration 1: Results of Screening
The top 50 strongest binding ligands (ranked by Log2 [Signal/Background] and fluorescence signal) from Iteration 1 screening against ESRRG are listed in Table 9. All peptides were clearly distinct from their parent 8-mer and showed significantly higher signal from protein binding than their respective parent sequences.
Iteration 2—Results of Screening
Iterative sequence expansion and arraying of 10 highest binding ligands to ESRRG from Iteration 1 was carried out and the new peptide sequences were synthesized on an array and screened against the protein, as described above. In this step, the central 2 amino acids (core) in peptide sequences were maintained unchanged while outlying amino acids (2, 3, 6, and 7) were permuted (using Tier 1 and Tier 3 substitutions) in a ‘cipher’ or ‘combination’ lock approach. A list of the top 50 ligands from iterative screening is given in Table 10. Further affinity maturation of sequence was observed, with multifold enhancement in binding signal of Iteration 2 sequences (i.e., Granddaughters) compared to their respective parents from Iteration 1 (i.e., Daughters) on the same array.
Binding Affinity Measurement
The binding affinity (KD) of more than 150 strongest-binding ligands from Iteration 1 and Iteration 2 screening were measured using a custom peptide microarray “density” chip-based experiment. Several ligands with nM range affinities were obtained. A list of 30 highest affinity ligands discovered for ESRRG is given in Table 11.
In Vitro Assay
A subset (14) of highest affinity peptide ligands to ESSRG were selected to be assayed for inverse agonist activity for the interaction of ESSRG with its co-activator PGC1α using TR-FRET (time-resolved Forster Resonance Energy Transfer) (
Annexins (ANX) are a family of ubiquitous, Ca2+ dependent, phospholipid binding cellular proteins, common to plants and animals, many of whom are involved in membrane organization, membrane traffic, and regulation of Ca2+ concentrations within cells. Deregulation in annexin expression and activity has been correlated with several human disease characterized as annexinopathies. Certain annexins (e.g. ANX-A1 and ANX-A5) are also proposed to be disease biomarkers, such as for lung exposure to diesel exhaust particles (Lewis, Rao et al. 2007).
All annexins share a core domain of four similar repeats, approximately 70 amino acids long, but comprise varying amino terminal regions. The vertebrate annexins share only 45-55% homology overall but display ca. 80% homology among the repeat regions, and they reveal conservation of their secondary and tertiary structures. Thus, candidate ligands (to shared regions of homology; i.e., excluding ligands to variable N-terminal regions) are expected to be shared among the annexins, using the principle applied in this approach.
Candidate ligands to Annexin A1 and A5, which share 44% overall homology, were obtained through data and textmining strategies that have been described in detail previously. The following is a brief description.
Step 1.0: Datamining
Protein Sequence
Synonyms (UniProtKB)—a.k.a. ANX-1, Annexin I, ANX-V, Lipocortin I, Calpactin-2, Calpactin II, Chromobindin-9, p35, Phospholipase A2 inhibitory protein.
Synonyms (UniProtKB)—a.k.a. ANX-5, Annexin V, ANX-V, Lipocortin V, Endonexin II, Calphobindin I, CBP-1, Placental anticoagulant protein 1 (PAP-1), Placental anticoagulant protein 4 (PP4), Thromboplastin inhibitor, Vascular anticoagulant alpha, Anchorin CII.
Step 1.1: Dataset of Homologous Proteins (DHP1)
The primary sequences of ANXA1 and ANXA5 were ‘BLAST’ed iteratively using PSI-BLAST with the following parameters: E (expectation threshold)—0.001 (0.005 for PSI BLAST); Max. target sequences—500; with all other parameters (word size, matrix, etc.) at ‘default’, against the NR (non-redundant protein) database.
Identical set of sequence homologous proteins were found through iterative PSI-BLAST (total of 3 iterations). The proteins with sequence homologies to ANXA1 and ANXA5 are given in the above table. These homologous sequence proteins are used in subsequent text mining searches (Step 2 below). The text mining searches identify words, phrases, and text containing the names of the homologous sequence proteins co-occurring with words, phrases, text, abbreviations, or annotation indicative of peptide sequences and peptide ligands.
Step 1.2: InterPro (DHP2)
An InterPro query for ANXA1 and ANXA5 was carried out starting from its FASTA sequence (i.e., using InterProScan). Both proteins comprise a common ‘annexin repeat’ region, as inferred earlier. Various members of the annexin family were found as children of the parent Annexin—ANXA 1, 3, 5, 13, etc.
Step 1.3: IntAct/DIP (Database of Complementary ‘Interacting’ Proteins)
IntAct is an open source database system and analysis tools for protein interaction data. An IntAct database search was performed. Proteins known to interact with ANXA1 and ANXA5 are listed below.
ANXA5—DNA methyltransferase 1, DNMT1; basophilic leukemia-expressed protein; ATP-dependent helicase ATRX; translation initiation factor, IF-2; farnesyl diphosphate farnesyltransferase, FPP.
ANXA1—ubiquitin thioesterase 25; protein phosphatase 2C; beta-actin; glucose transporter type 5, Glut4; IKBA, I-kappa-B-alpha, major histocompatibility complex enhancer-binding protein, MAD3.
Other annexin family proteins interact with F-actin, profiling, non-erythroid spectrin (fodrin) calspectin, synapsin, GFAP, etc.
This information on interacting proteins is used in subsequent text mining (Step 2 below). Text mining selectively identifies words, phrases, or text containing the names of the interaction proteins, co-occurring in text containing words, phrases, abbreviation, or annotation associated with peptide amino acid sequences or peptide ligands. By this process information is obtained that indicates possible amino acid sequences involved in interactions with ANXA1 and ANXA5. From this sequence, information about candidate peptide ligands able to interact with ANXA1 and ANXA5 will be sourced.
Step 2: Textmining (PubMed Central)
Text mining is performed where the search terms include the proteins and homologs identified in Step 1.0 along with co-occurring words, text, phrases, or abbreviations that designate amino acid sequences and peptide ligands; e.g., peptide, ligand, motif, sequence, fragment, region, epitope, paratope, amino acid, etc.
For example, a PubMed query using the terms ‘Annexin and Peptide and Ligand’ was carried out, which leads to several references including a review on ‘Annexins’ (Genome Biology, 2004). (Moss and Morgan 2004) Perusal of the journal article further provided proteins that interact with Annexins (a subset of those also obtained through a query of the IntAct database). A subsequent search of PubMed using combination keywords was conducted. For example,
1. (Annexin A5) and Collagenase Type II) and binding (or peptide/ligand/motif/epitope/region/sequence, etc.);
2. (Annexin A7) and Sorcin and binding (or peptide/ligand/motif/epitope/region/sequence, etc.); and so on.
Ensuing literature was curated manually for references to peptide sequences, motifs, etc.; or referrals to other literature containing the same. A non-exhaustive list of the candidate ligand table for Annexin A1 and Annexin A5 is shown (Table 12).
The same candidate ligand set was used for ANXA1 and ANXA5 screening, due to significant shared homology and homologous proteins between Annexin family proteins.
Step 3: Candidate Ligand Selection for Array
A subset of ligands (Table 13) was selected for automated sequence expansion and peptide microarray production, as previously described. Only the core 2 amino acids were permuted to create an 8-mer superlibrary.
Step 4: Screening and Data Analysis
Study 1: The intent of this study was to demonstrate that a shared, customized ligand space shared by closely similar proteins may be used to obtain distinct ligands for those proteins. The example provided examines the case of ANXA1 and ANXA5.
Both ANXA1 and ANXA5 were directly labeled with a fluorescent tag (Cy 3 and Cy 5, respectively) and independently screened against a custom microarray (ca. 3900) of 8-mer peptides (from ligand walking of ANXA5 candidate ligands) and their permutations generated from the substitution matrix (Table 14). Due to the expected presence of these proteins in serum, each protein screening was carried out in the presence of a 100-fold excess of BSA that simulated a complex background of proteins.
The results of hybridization of fluorescently labeled ANXA1 and ANXA5 against an identical peptide microarray are shown below. Input sequences clearly did not favor any single amino acid (
Iterative screening for affinity maturation is underway. It is anticipated that affinity maturation will provide high-affinity peptide ligands that can selectively and specifically bind/detect/modulate only ANXA1 or ANXA5.
Study 2: On-chip binding affinity measurements were carried out for a set of highest-binding Iteration 1 (Iter 1) ligands to ANXA5 from the primary screening experiment. A range of affinities, from nM to low mM, were obtained. The sequences showing the highest and their affinities are displayed in Table 15. The lineage of all Iter 1 ligands could be traced easily to their respective parents. The two families of highest Iter 1 binders came from two interacting proteins of ANXA5; chondrocalin and v 5-integrin (see Table 15). Iterative array generation and screening is anticipated to lead to further affinity maturation to provide Iter 2 (and higher generation) ligands with nM to sub-nM affinities.
Study 3: The methodology could be used for candidate ligand generation using HALO/Gen and direct competitive screening of a customized peptide ligand array comprising shared candidate ligands to highly sequence similar proteins within a family (e.g., annexins, estrogen receptors (ERs), ERRs, interleukins) in order to evolve selective or highly-specific high-affinity ligands to such proteins. In this example, ANXA1 and ANXA5, orthogonally labeled with Cy-3 and Cy-5 fluorescent tags, were screened simultaneously against a custom ligand microarray of ca. 13,000 ligands generated from Tier-1 permuted sequence combinations of the entire candidate ligand library for annexins (Table 13). The results (consensus chart of high affinity ligand motifs) of the competitive screening study are demonstrated in
The studies described hereinabove validate the methodology of the present invention. Proteins could be datamined for homology and textmined for candidate ligands. Array screening allowed parallel interrogation of several thousand rationally designed peptide ligands leading to the discovery of several high-affinity, high value candidates, which are expected to have multiple applications including in drug discovery and as antibody replacements. Since the design of the customized ligand space sampled through screening stems only from knowledge of protein sequence, it will be possible by this approach to derive high affinity peptide ligands to even proteins whose structures are not available (e.g., G-Protein Coupled Receptors, GPCRs). This technology offers the potential to very quickly generate lead candidates for major disease protein targets.
It is contemplated that any embodiment discussed in this specification can be implemented with respect to any method, kit, reagent, or composition of the invention, and vice versa. Furthermore, compositions of the invention can be used to achieve methods of the invention.
It will be understood that particular embodiments described herein are shown by way of illustration and not as limitations of the invention. The principal features of this invention can be employed in various embodiments without departing from the scope of the invention. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, numerous equivalents to the specific procedures described herein. Such equivalents are considered to be within the scope of this invention and are covered by the claims.
All publications and patent applications mentioned in the specification are indicative of the level of skill of those skilled in the art to which this invention pertains. All publications and patent applications are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference.
The use of the word “a” or “an” when used in conjunction with the term “comprising” in the claims and/or the specification may mean “one,” but it is also consistent with the meaning of “one or more,” “at least one,” and “one or more than one.” The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.
As used in this specification and claim(s), the words “comprising” (and any form of comprising, such as “comprise” and “comprises”), “having” (and any form of having, such as “have” and “has”), “including” (and any form of including, such as “includes” and “include”) or “containing” (and any form of containing, such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unrecited elements or method steps.
For the purposes of this invention, proteins refer to molecules that possess an amino acid sequence that matches or partially matches the amino acid sequence found in the target or query protein an “homologous protein.”
Peptide is used to denote an oligomer in which the monomers are alpha amino acids joined together through amide bonds. A “peptide” can also be referred to as a “polypeptide”. In the context of this invention, one should appreciate that the amino acids may be the L-optical isomer or the D-optical isomer.
Standard single letter abbreviations for amino acids are used (e.g., P for proline). When incorporated into a protein, an amino acid is referred to as an “amino acid residue”. The term “protein” is understood to include the terms “polypeptide” and “peptide” (which, at times, may be used interchangeably herein) within its meaning. The polypeptide's “primary structure” is the particular amino acid sequence of a given protein written from the amino-terminus to carboxy-terminus.
As is the case for all proteins, the precise chemical structure depends on a number of factors. As ionizable amino and carboxyl groups are present in the molecule, a particular protein may be obtained in an acidic or basic salt, or in a neutral form. All such forms are included in the definition.
Furthermore, the primary amino acid sequence may be augmented by derivatization using sugar moieties (glycosylation) or by other supplementary molecules such as lipids, phosphate, acetyl groups, and the like. The primary amino acid structure may also aggregate to form complexes. In any event, such modifications are included in the definition.
Also, individual amino acid residues in the chain may be modified by oxidation, reduction, or other derivatization, and the protein may be cleaved to obtain fragments which retain binding activity. Such alterations which do not destroy binding activity do not remove the protein sequence from the definition,
In the case of naturally occurring proteins, an amino acid residue's R group differentiates the 20 amino acids from which proteins are synthesized, although one or more amino acid residues in a protein may be derivatized or modified following incorporation into protein in biological systems (e.g., by glycosylation and/or by the formation of cystine through the oxidation of the thiol side chains of two non-adjacent cysteine amino acid residues, resulting in a disulfide covalent bond that frequently plays an important role in stabilizing the folded conformation of a protein, etc.).
As those in the art will appreciate, non-naturally occurring amino acids can also be incorporated into proteins, particularly those produced by synthetic methods, including solid state and other automated synthesis methods. Examples of such amino acids include, without limitation, a-amino isobutyric acid, 4-amino butyric acid, L-amino butyric acid, 6-amino hexanoic acid, 2-amino isobutyric acid, 3-amino propionic acid, ornithine, norlensine, norvaline, hydroxyproline, sarcosine, citralline, cysteic acid, t-butylglyine, t-butylalanine, phenylylycine, cyclohexylalanine, β-alanine, fluoro-amino acids, designer amino acids (e.g., β-methyl amino acids, α-methyl amino acids, N-α-methyl amino acids) and amino acid analogs in general.
In addition, when an α-carbon atom has four different groups (as is the case with the 20 amino acids used by biological systems to synthesize proteins, except for glycine, which has two hydrogen atoms bonded to the a carbon atom), two different enantiomeric forms of each amino acid exist, designated D and L. In mammals, only L-amino acids are incorporated into naturally occurring polypeptides. Of course, the instant invention envisions proteins incorporating one or more D and L-amino acids, as well as proteins comprised of just D or L-amino acid residues.
To confer resistance to proteolysis, peptide bonds may also be substituted by retro-inverso pseudopeptide bonds. According to this modification, the amino acid sequence of the modified peptide may be identical to the sequence of the native L-amino acid sequence, except that one or more of the peptide bonds are replaced by a retro-inverso pseudopeptide bond. The synthesis of peptides with one or more reduced retro-inverso pseudopeptide bonds is known in the art.
A protein is “synthetic” when produced by in vitro chemical (e.g., solid phase peptide synthesis) or enzymatic synthesis. The peptides of this invention, including analogs and other modified variants may generally be prepared following known techniques. Preferably, synthetic production of the peptides of the invention may be according to the solid phase synthesis method. Alternatively, peptides of this invention may be prepared in recombinant systems using polynucleotide sequences encoding the peptides. It is understood that a peptide of this invention may contain more than one of the above described modifications within the same peptide.
Herein, the following abbreviations may be used for the following amino acids (and residues thereof): alanine (Ala, A); arginine (Arg, R); asparagine (Asn, N); aspartic acid (Asp, D); cyteine (Cys, C); glycine (GIy, G); glutamic acid (GIu, E); glutamine (GIn, Q); histidine (His, H); isoleucine (He, I); leucine (Leu, L); lysine (Lys, K); methionine (Met, M); phenylalanine (Phe, F); proline (Pro, P); serine (Ser, S); threonine (Thr, T); tryptophan (Trp, W); tyrosine (Tyr, Y); and valine (Val, V). Non-polar (hydrophobic) amino acids include alanine, leucine, isoleucine, valine, proline, phenylalanine, tryptophan, and methionines. Neutral amino acids include glycine, serine, threonine, cysteine, tyrosine, esparagine, and glutamine. Positively charged (basic amino acids include arginine, lysine and histidine. Negatively charged (acidic) amino acids include aspartic acid and glutamic acid.
In addition, proteins comprising multiple polypeptide subunits (e.g., DNA polymerase III, RNA polymerase II) or other components (for example, an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of “protein” as used herein. Similarly, fragments of proteins and polypeptides are also within the scope of the invention and may be referred to herein as “proteins”.
A peptidomimetic is a molecule that mimics the biological activity of a peptide but is no longer peptidic in chemical nature. By strict definition, a peptidomimetic is a molecule that no longer contains any peptidic bonds. However, the term peptidomimetic is sometimes used to describe molecules that are no longer completely peptidic in nature, such as pseudo-peptides, semi-peptides, and peptoids. Whether completely or partially non-peptide, peptidomimetics according to this invention provide a spatial arrangement of reactive chemical moieties that closely resembles the three-dimensional arrangement of active groups in the peptide on which the peptidomimetic is based. As a result of the similar active-site geometry, the peptidomimetic has effects on biological systems which are similar to the biological activity of the peptide.
The present invention encompasses peptidomimetic compositions which are analogs that mimic the activity of biologically active peptides according to the invention.
The term “or combinations thereof” as used herein refers to all permutations and combinations of the listed items preceding the term. For example, “A, B, C, or combinations thereof” is intended to include at least one of: A, B, C, AB, AC, BC, or ABC, and if order is important in a particular context, also BA, CA, CB, CBA, BCA, ACB, BAC, or CAB. Continuing with this example, expressly included are combinations that contain repeats of one or more item or term, such as BB, AAA, MB, BBC, AAABCCCC, CBBAAA, CABABB, and so forth. The skilled artisan will understand that typically there is no limit on the number of items or terms in any combination, unless otherwise apparent from the context.
As used herein, words of approximation such as, without limitation, “about”, “substantial” or “substantially” refers to a condition that when so modified is understood to not necessarily be absolute or perfect but would be considered close enough to those of ordinary skill in the art to warrant designating the condition as being present. The extent to which the description may vary will depend on how great a change can be instituted and still have one of ordinary skilled in the art recognize the modified feature as still having the required characteristics and capabilities of the unmodified feature. In general, but subject to the preceding discussion, a numerical value herein that is modified by a word of approximation such as “about” may vary from the stated value by at least ±1, 2, 3, 4, 5, 6, 7, 10, 12 or 15%.
All of the compositions and/or methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the compositions and/or methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
This application is a non-provisional application of U.S. Provisional Patent Application Ser. No. 61/452,025 filed on Mar. 11, 2011 and entitled “METHODS FOR DISCOVERING MOLECULES THAT BIND TO PROTEINS” the entire contents of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61452025 | Mar 2011 | US |