Representing and understanding the three-dimensional structural information of biological molecules is becoming a critical step in the rational drug discovery process. With the advent of massive virtual chemical library screening, as well as the recent advancements in X-ray crystallography, NMR and homology modeling techniques, the amount of structural information is increasing rapidly. The traditional analysis methods are inadequate and inefficient in dealing with such massive structural information.
The past decade has seen an explosion of the three-dimensional structural information of biologically important molecules, due to the recent developments of X-ray crystallography, NMR and molecular modeling techniques. There are currently more than 20,000 structures deposited in the Protein Data Bank, and a significant portion of these structures contain ligands bound to macromolecules. In addition, combinatorial chemistry and virtual library screening are becoming routine procedures in the drug discovery process. This process generates thousands to millions of virtual protein-ligand complex structures, making detailed examination of these structures a daunting task. Representing the three-dimensional structural information of macromolecules efficiently has challenge due to the complexity of identifying residues and atomic interactions. Representing the covalent or non-covalent interactions between molecules poses even more difficult challenges, because in addition to the geometric location of each interaction, the direction, type, and magnitude of the interaction are also important and need to be captured. Understanding the intermolecular interactions between proteins and their ligands provides insights into the functional mechanism of the proteins. It is important for structure-based drug design to understand the key forces between small molecules (SMs) and proteins and to be able to compare different orientations or different small molecules binding to the same receptor site, or different binding sites.
Traditionally, understanding and comparing the interactions between proteins and ligands is achieved by visually inspecting an individual structure with structure-rendering software on a graphic terminal. The inspection is sometimes facilitated by other software tools that generate 2-D or 3-D schematic representations of the interactions (e.g., LIGPLOT™). Such time-consuming processes require human intervention and become more and more tedious as the number of complex structures increases. It is important for successful drug discovery to have a tool that allows this massive amount of structural information to be organized and analyzed.
More recently, structure-based virtual chemical library screening has become a common procedure in the drug discovery process. Virtual library screening typically generates hundreds of thousands of virtual protein-ligand complex structures. Effectively mining this massive structural library becomes a tremendous task, as it is impossible to analyze the structures individually. Traditionally, different types of empirical docking scores and some pharmacophoric filters are used to sift the docking results for tight binders with desired binding interactions. However, these methods have limitations. Correlation between good docking scores and high activity is not always satisfactory. The docking scores are an overall summation of interaction and do not discern differences in binding modes. Therefore, a method that allows accurate representation of the interaction and fast analysis of a large number of structures is in great demand.
Energy based scoring schemes for ranking predicted poses from receptor based virtual screening (docking, or VS) are well known. In order to address the limitations inherent in traditional scoring functions, a variety of “knowledge-based” or “target-biased ” approaches have been developed that impose contraints based on ligand or receptor pharmacophores thought to be required for activity. However, the success of the VS strategy is dependent on the application of constraints derived from knowledge of how small molecule inhibitors bind at the active site. These constraints typically filter virtual libraries based on the presence of known binding motifs, or the ability to satisfy key interactions with the receptor. However, the ability to apply constraints during VS that predict the selectivity of inhibitors for one protein over another is a much more challenging problem that has not been widely addressed.
A method is provided for generating a structural interaction fingerprint (SIFt). The SIFt is in the form of an information string which includes a plurality of information blocks, and each information block includes a plurality of information units. The method includes the steps of selecting a plurality of positions (selected positions) on a target molecule where each selected position corresponds to an information block in the information string; selecting a plurality of interaction types and calculating a value that is indicative of the characteristic of each interaction type at each selected position of the target molecule; assigning the value to the corresponding information unit thereby indicating the characteristic of that particular interaction type at the corresponding selected position; and
joining the information units of each selected position together to form the corresponding information blocks, which joins together to generate a SIFt.
The SIFt methodology can include an interaction profile based approach termed profile-SIFt, or p-SIFt. The p-SIFt can be derived from a collection of SIFts, and can measure the conservation of interactions observed in clusters of protein-ligand complexes. A p-SIFt can be used to generate target-specific knowledge-based filters for virtual screening as well as provide an understanding of the interaction patterns responsible for inhibitor selectivity.
Interaction profiling and p-SIFt can be a powerful approach to identify and understand interactions that small molecules exploit in order to bind to a target molecule. The information encoded in a p-SIFt can be used to selectively filter virtual libraries for ligands that are inhibitors to a particular target molecule.
SIFts are described, for example, in U.S. Patent Application Nos. 60/484,308, filed Jul. 3, 2003, and 60/524,083, filed Nov. 24, 2003, in PCT application No.
US04/20992, filed Jul. 1, 2004; and U.S. Patent Application No. 60/602,852, filed Aug. 20, 2004, each of which is incorporated by reference in its entirety.
The target molecule can be a protein or a fragment thereof such as a peptide (e.g., polypeptide or oligopeptide). Alternatively, a target molecule can be a nucleic acid. In certain circumstances, the ligand can be a peptide, a nucleic acid, or even a small molecule (e.g., an organic molecule (e.g., molecular weight equal to or less than 1,500 dalton) that is neither a peptide or a nucleic acid). In certain circumstances, both the target molecule and the ligand can be proteins. In this case, the SIFt can be descriptive of protein-protein interactions.
Note that the target molecule is forming a complex with a ligand (i.e., the binary complex), and the selected positions are the positions on the target molecule that participate in intermolecular interaction with the ligand. These positions can be obtained from a three-dimensional structure of a binary complex formed between the target molecule and the ligand. The three-dimensional structure can be derived from an experimental method or a prediction method such as, for example, an in silico prediction method. In one embodiment, a set of selected positions can be obtained from comparing the common positions (e.g., residues or bases) of the target molecule that participate in intermolecular interactions among a set of target molecule-ligand structures. The target molecule can be the same or different in the set of target molecule-ligand structures.
For a protein or peptide target molecule, each selected position can include one or more secondary structure elements (e.g., an α-helix or a β-strand), amino acid residues (e.g., a lysine residue), main chain atom groups (the α-carbon of a particular amino acid residue), side chain atom groups (e.g., the butylamine group of a Lys), or individual atoms of the target molecule. As to a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule.
The value that is assigned to a particular information unit can be a binary value or a numeric value selected from a scale or range of numbers. The binary value indicates whether a particular interaction type is present (1) or absent (0) at the corresponding selected position of the target molecule, whereas the numeric value indicates the magnitude of a particular interaction type at the corresponding selected position of the target molecule (e.g., a value of “3” in a scale that ranges from “0” to “5”).
As mentioned above, the value indicates the characteristic of a particular interaction type at that selected position. Note that the interaction types represent different types of intermolecular interactions between the target molecule and the ligand. For example, the interaction type can be classified as contact interaction. One can detect the presence of contact interaction between a target molecule and a ligand at a selected position (e.g., a protein residue) according to a number of methods. In one embodiment, the target molecule-ligand pair is considered to have established contact interaction at a selected position if the interaction involves a change or reduction in the accessible surface area at that position of the target molecule upon forming a complex with the ligand. Alternatively, one can measure the intermolecular distance between a target molecule and a ligand at a selected position to determine whether contact interaction occurs at that position (i.e., whether the intermolecular distance is within the predetermined distance cutoff limit). In one embodiment, the target molecule-ligand pair is considered to be interacting if the interatomic contact distance between the target molecule and the ligand is equal to or less than 10 Å (e.g., equal to or less than 6 Å, or even 4 Å). The interaction type can be further classified as polar interaction, non-polar interaction, and/or hydrogen bonding interaction, depending on the nature of the interactions. In one embodiment, the hydrogen bonding interaction can involve a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the selected position. In one embodiment, the hydrogen bonding interaction can involve a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the selected position. Note that intermolecular interactions can be characterized by interaction energy-based approach. The interaction type can be characterized by the contribution of the selected position to the interaction energy between a target molecule and a ligand where the total interaction energy between the target and the ligand is a summed over all positions. The interaction energy may be computed by a variety of scoring functions or intermolecular force-fields such as common ligand-receptor docking scoring functions (e.g., Dock, Gold, ChemScore, FlexX score, PMF, Screencore, Drugscore, etc.) or intermolecular potential energy functions or force-fields (e.g., CHARMM, Amber, OPLS, etc.). The interaction energy calculated for each information unit (which corresponds to a selected position) may take the form of a real number (i.e., −43.2 kcal/mol), integer (i.e., −43 kcal/mol), or an integer representing a binned form of the interaction energy. In the latter case, the energy range of the function is divided into bins (e.g., −70 to −50 kcal/mol, −50 to −20 kcal/mol, −20 to 0 kcal/mol, or 0-10 kcal/mol) where the interaction energy is represented as an integer identifying the bin (in this case for example 1, 2, 3, or 4).
In one aspect, a method is provided for generating a profile-structural interaction fingerprint (p-SIFt) in the form of an information string which comprises a plurality of information blocks wherein each information block comprises a plurality of information units. The method includes selecting a plurality of selected positions on a plurality of target molecules, wherein each selected position corresponds to an information block in the information string. Each target molecule forms a complex with a ligand. The method includes selecting a plurality of interaction types and calculating an aggregate value that is indicative of a characteristic of each interaction type at each selected position of the plurality of target molecules. The value is assigned to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The information units of each selected position are joined together to form corresponding information blocks, and the information blocks are joined together to generate a first p-SIFt.
The method can include comparing the first p-SIFt to a SIFt. The method can include generating a second p-SIFt. The first p-SIFt can be compared to the second p-SIFt. Comparing can include subtracting the first p-SIFt and the second p-SIFt.
In another aspect, a method of describing target molecule-ligand interactions includes generating a first plurality of SIFts for a first plurality of target-molecule-ligand complexes, and compiling the first plurality of SIFts to generate a first p-SIFt. The method can include generating a second SIFt or a second p-SIFt, and comparing it to the first p-SIFt. The method can include creating a target molecule-test ligand complex model and generating a SIFt for the model. The SIFt for the model can be compared to the first p-SIFt.
In another aspect, a computer program is provided for generating a profile structural interaction fingerprint (p-SIFt) in the form of an information string which comprises a plurality of information blocks, wherein each information block comprises a plurality of information units. The computer program includes instructions for causing a computer system to select a plurality of selected positions on a plurality of target molecules, where each selected position corresponds to an information block in the information string. Each target molecule forming a complex with a ligand. The computer program also includes instructions for causing the computer to select a plurality of interaction types and calculate an aggregate value that is indicative of a characteristic of each interaction type at each selected position of the plurality of target molecules. The value is assigned to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The information units of each selected position are joined together to form corresponding information blocks, and the information blocks are joined together to generate a first p-SIFt. The computer program can cause the computer system to generate a second p-SIFt, and to compare the first p-SIFt to the second p-SIFt.
In one aspect, a method of predicting the interaction pattern between a target molecule and a test ligand is provided. A test ligand is a ligand whose affinity to the target molecule is under examination. The prediction method involves identifying a plurality of selected positions between the target molecule and a first ligand, wherein the first ligand is known to bind to the target molecule (i.e., the affinity between the first ligand and the target molecule is known). As described above, selected positions are positions on the target molecule that participate in intermolecular interactions with the ligand (here, the first ligand). Based on the selected positions, the method then involves generating a first structural interaction fingerprint (SIFt) as described above (i.e., formation of an information string that includes a plurality of information blocks, where each information block includes a plurality of information units, and where each information unit is assigned a calculated value indicative of the presence/absence or the magnitude of a particular interaction type at the selected position of the target molecule to which the information unit/block corresponds). Using the same selected positions, the method then involves the generation of a second SIFt between the same target molecule and a second ligand (i.e., a test ligand) employing the same steps as described above. Finally, the method involves comparing the first SIFt with the second SIFt to determine the level of overlapping between the first and second SIFts. A pattern of substantial overlapping between the two SIFts predicts that the second ligand interacts with the target molecule in a similar pattern as the first ligand. In one embodiment, the first ligand is the natural ligand of the target molecule. In one embodiment, the first ligand is a ligand of known affinity to the target molecule.
In one aspect, a method of generating a structural interaction fingerprint (SIFt) database is provided. The method involves (1) identifying a plurality of selected positions on a target molecule (which forms a complex with a first ligand) and (2) generating a first SIFt of the database as described above (i.e., formation of an information string that includes a plurality of information blocks where each information block includes a plurality of information units, and where each information unit is assigned a calculated value indicative of the presence/absence or the magnitude of a particular interaction type at the selected position of the target molecule to which the information unit/block corresponds). The method then requires that steps (1) and (2) be repeated using the same target molecule but a different ligand such that another SIFt can be generated and added to the databases. The method then repeats steps (1) and (2) with different ligands and generates more SIFts until the database contains a desired number of SIFts. In one embodiment, the method further involves analyzing the SIFts of the database to generate one or more interaction patterns between the target molecule and the ligands. Typically, ligands that belong to a particular interaction pattern indicate that they bind to the target molecule in a similar manner. In one embodiment, the method further involves comparing one (or more) interaction pattern of the database with a SIFt generated by using the same target molecule and a test ligand. A test ligand is a ligand that was not employed in generating the database. From the degree of similarity between the SIFt generated using the test ligand and the interaction pattern, one can predict whether or not the test ligand binds to the target molecule in a similar manner.
One can even predict whether or not the test ligand belongs to the same family of ligands used to generate the database. In one embodiment, the method further includes the step of storing the database in a computer readable medium.
In one aspect, a method of analyzing the interaction pattern of two or more related target molecules is provided. The method includes conducting sequence and structural alignments among each of the related target molecules resulting to derive a uniform residue or base numbering system. The method then involves identifying a plurality of selected positions on the target molecule of each target molecule-ligand complex using the uniform residue or base numbering system. This is followed by generating a SIFt for each target molecule-ligand complex as described above and comparing different SIFt patterns. The interactions can be conserved or unconserved.
The method can include compiling the SIFts to identify selected interactions that are conserved among the complexes. The method can include calculating a score for each interaction among the target molecule-ligand complexes. The score can include a conservation score. The method can include compiling the SIFts to form a p-SIFt from the calculated conservation score, or comparing a SIFt generated from a test ligand with a p-SIFt generated from a group of target molecule-ligand complexes, thereby predicting whether the test ligand interacts with the target molecule in a similar pattern with the group. The method can include comparing two p-SIFts, thereby predicting whether two groups of structures share conserved binding interactions, and/or have similar binding pattern.
In another aspect, a method is provided for generating an R-group-structural interaction fingerprint (r-SIFt) in the form of an information string which includes a plurality of information blocks where each information block includes a plurality of information units.
The method includes selecting a plurality of selected positions on a first ligand. Each selected position corresponds to an information block in the information string. The first ligand forms a complex with a target molecule. The method includes selecting an interaction type and calculating a value that is indicative of a characteristic of the interaction type at each selected position of the first ligand, and assigning the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The method also includes joining the information units of each selected position together to form corresponding information blocks, and joining the information blocks together to generate an r-SIFt.
The target molecule can be a protein, a peptide, or a nucleic acid. The first ligand can be a small molecule, a peptide, a protein or a nucleic acid. The value that is assigned to an information unit can be a binary value which indicates the presence or absence of a particular interaction type at the corresponding selected position. The interaction type can be contact interaction.
The method can include selecting a plurality of selected positions on a plurality of ligands, where each selected position corresponds to an information block in the information string. Each of the plurality of ligands forms a complex with the target molecule. The method includes calculating a value that is indicative of a characteristic of the interaction type at each selected position of the plurality of ligands, and assigning the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The method also includes joining the information units of each selected position together to form corresponding information blocks, and joining the information blocks together to generate an r-SIFt for each of the plurality of ligands.
The plurality of ligands can be selected from a combinatorial library. The method can include comparing one r-SIFt to a second r-SIFt. The method can include grouping an r SIFt based on the comparison.
The method can include classifying each of the plurality of ligands into a class according to the degree of similarity of their respective r-SIFts to the r-SIFt of the first ligand. The method can include determining a chemical or physical property of the selected positions of the plurality of ligands. The chemical or physical property can be correlated with the class. The method can include determining a chemical or physical property for a part of a compound and classifying the compound into a class. The chemical or physical property can be F—COUNT, P—COUNT, S—COUNT, CL—COUNT, BR—COUNT, ALOGP, MOLECULAR—POLARSURFACEAREA, NUM—H—ACCEPTORS, NUM—H—DONORS, NUM—ATOMS, NUM—HYDROGENS, NUM—POSITIVEATOMS, NUM—ROTATABLEBONDS, NUM—BRIDGEBONDS, NUM—RINGS, NUM—AROMATICRINGS, NUM—RINGASSEMBLIES, NUM—CHAINS, NUM—CHAINASSEMBLIES, NUM—STEREOBONDS, NUM—UNKNOWNSTEREOBONDS, NUM—ATOMCLASSES, LOGD, or MOLECULAR—WEIGHT.
In another aspect, a computer program is provided for generating an R-group structural interaction fingerprint (r-SIFt) in the form of an information st a plurality of information blocks, wherein each information block includes a plurality of information units. The computer program includes instructions for causing a computer system to select a plurality of selected positions on a first ligand, where each selected position corresponds to an information block in the information string, and the first ligand forming a complex with a target molecule. The computer program includes instructions to select an interaction type and calculating a value that is indicative of a characteristic of the interaction type at each selected position of the first ligand, and assign the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The computer program also includes instructions to join the information units of each selected position together to form corresponding information blocks, and join the information blocks together to generate an r-SIFt. The computer program can include instructions for causing the computer system to generate a second r-SIFt.
As used herein, the target molecules are related if they exhibit at least 20% sequence similarity or a structural similarity with a root-mean squared deviation over the aligned positions no greater than 4 Å (e.g., 6 Å). In yet another embodiment, the target molecules are related if they exhibit at least 20% protein sequence similarity with a root-mean squared deviation over the aligned positions no greater then 6 Å. For protein target molecules, sequence and structural alignments are commonly applied within the structural biology field. There are databases including the PFAM database that includes protein sequence alignments (http://www.sanger.ac.uk/software/Pfam/index.shtml) and the SCOP database (http://scop.mrc-lmb.cam.ac.uk/scop/) that contains protein structural alignments.
In some embodiments, at least one interaction type includes a chemical or physical property of a part of ligand interacting with each selected position. In other embodiments, each interaction type includes a chemical and physical property of a part of ligand interacting with each selected position. The interaction types can include information bits about the chemical composition of a ligand (e.g., various R groups in a combinatorial library), or an experimentally determined or computed property of the part of the ligand interacting with the selected position. For example, interaction types can include information bits representing varying groups of a combinatorial library. Properties and descriptors of a molecule or part of a molecule can include fragment constant descriptors (e.g., hydrophobic, hydrogen bond acceptor, hydrogen bond donor, hydrophobic aliphatic, hydrophobic aromatic, negative charge, negative ionizible, positive charge, positive ionizible, or aromatic ring), electronic descriptors (e.g., charge, partial positive surface area, partial negative surface area, dipole moment, atomic polarizability, polar surface area), topological descriptors (e.g., Wiener index, Zagreb index, Hosoya index), molecular flexibility index, spatial descriptors (e.g., shadow indices, molecular surface area, density, principal moment of inertia, molecular volume), structural descriptors (e.g., number of chiral centers, molecular weight, number of rotatable bonds), or thermodynamic descriptors (e.g., partition coefficient, desolvation free energies for water and octanol, pKa). The interaction type can also include a chemical fingerprint for a part of the ligand interacting with the selected position of the target molecule. A chemical fingerprint is a string of values (usually an array of binary bits) that contains the unique information about the chemical makeup (e.g., atoms, substructures, chirality) of the molecule. In some embodiments, the interaction types can also include information about the selected position in the target molecule, such as variables measuring the sequence conservation, structural conservation and flexibility of the selected position of the target molecule.
In a further aspect, a computer-readable data storage medium is provided. The medium includes a data storage material encoded with a computer-readable database. The database includes a plurality of SIFts generated from a target molecule and a plurality of ligands. Each SIFt is in the form of an information string that includes a plurality of information blocks, and each information block includes a plurality of information units.
The target molecule interacts with each ligand at a plurality of selected positions on the target molecule via a number of interaction types. As described above, selected positions are positions on the target molecule that participate in intermolecular interaction with the ligand.
The magnitude of each interaction type at each selected position is calculated and represented by a value, which is assigned to a corresponding information unit. The target molecule a be a protein, a peptide, or a nucleic acid, and the ligand can be a small molecule, a peptide, a protein or a nucleic acid. In one embodiment, the value that is assigned to an information unit is a binary value, which indicates the presence or absence of a particular interaction type at the corresponding selected position. In one embodiment, the value that is assigned to an information unit is selected from a range of scaled numeric values, which indicates the magnitude of a particular interaction type at the corresponding selected position. For a protein/peptide target molecule, each selected position can include one or more amino acid residues, main chain atom groups, side chain atom groups, or individual atoms of the target molecule. For a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule. In one embodiment, the interaction type can be a contact interaction. For example, the interatomic contact distance between the target molecule and the ligand can be equal or less than 10 Å (e.g., equal or less than 6 Å, or even 4 Å) for the target molecule-ligand pair to be considered as having contact interaction. As another example, the contact interaction can include a change in the accessible surface area of the target molecule upon forming a complex with the ligand. In one embodiment, the interaction type can be a polar interaction, non-polar interaction, and hydrogen bond interaction. In one embodiment, the hydrogen bond interaction can include a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the corresponding selected position. In one embodiment, the hydrogen bond interaction can include a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the corresponding selected position.
In yet a further aspect, a computer program for generating a SIFt that is in the form of an information string comprising a plurality of information blocks, where each information block includes a plurality of information units is provided. The computer program contains instructions for causing a computer system to select a plurality of positions (selected positions) on a target molecule (which is forming a complex with a ligand). The selected positions are positions on the target molecule that participate in intermolecular interaction with the ligand. Each selected position corresponds to an information block in the information string. The computer program can perform one or more of the following steps: select a plurality of interaction types that exist between the target molecule and the ligand; calculate a value that is indicative of the characteristic of each interaction type at each selected position of the target molecule; assign the value to the corresponding information unit so as to indicate the characteristic of that particular interaction type at the corresponding selected position; join the information units of each selected position together to form the corresponding information blocks; and join the information blocks to generate a SIFt. The target molecule can be a protein, a peptide, or a nucleic acid, and the ligand can be a small molecule, a peptide, or a nucleic acid. In one embodiment, the value that is assigned to an information unit is a binary value, which indicates the presence or absence of a particular interaction type at the corresponding selected position. In one embodiment, the value that is assigned to an information unit is selected from a range of scaled numeric values, which indicates the magnitude of a particular interaction type at the corresponding selected position. In one embodiment, the selected positions are obtained from a three-dimensional structure of a binary complex formed between the target molecule and the ligand. Such a three-dimensional structure may be derived from an experimental method or a prediction method such as, for example, an in silico prediction method. For a protein/peptide target molecule, each selected position can include one or more amino acid residues, main chain atom groups, side chain atom groups, or individual atoms of the target molecule. For a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule. The interaction types represent different types of intermolecular interactions between the target molecule and the ligand and can be characterized by binding energy-based approach. In one embodiment, the interaction type can be a contact interaction. For example, the interatomic contact distance between the target molecule and the ligand can be equal or less than 10 Å (e.g., equal or less than 6 Å, or even 4 Å) for the target molecule-ligand pair to be considered as having contact interaction. As another example, the contact interaction can include a change in the accessible surface area of the target molecule upon forming a complex with the ligand. In one embodiment, the interaction type can be a polar interaction, non-polar interaction, and hydrogen bond interaction. In one embodiment, the hydrogen bond interaction can include a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the corresponding selected position. In one embodiment, the hydrogen bond interaction can include a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the corresponding selected position. In one embodiment, the method can further include instructions to store the SIFt in a database. In one embodiment, the computer program can include instructions for generating a plurality of SIFts by the repeating the steps recited above using, e.g., the same target molecule and selected positions, but different ligands. The plurality of SIFts may then be stored in a database. In one embodiment, the computer program can further include instructions to generate a SIFt using the same target molecule and a test ligand, and to compare this SIFt with another SIFt (e.g., generated using the same target and a known ligand) or another group of SIFts (i.e., either one SIFt or a plurality of SIFts forming an interaction pattern). Various methods can be used to compare the generated SIFt with one or more other SIFts. For example, a comparison can be performed using a simple sum of matching bits (units) across the entire SIFT, or by the application of one or more similarity measures (including, e.g., Tanimoto coefficient, Euclidean distance, cosine correlation coefficient, correlation, half square Euclidean distance, and city block distance). Furthermore, a library of SIFts can be compared by, for example, first carrying out all pairwise comparisons using one of the similarity measures mentioned above and then applying hierarchical clustering to group SIFts according to the similarity. The clustering can use, for example, one or more common cluster similarity methods (including, e.g., UPGMA (Unweighted Pair-Group Method with Arithmetic mean), WPGMA (Weighted Pair-Group Method with Arithmetic mean), single linkage, complete linkage, and Ward's method).
As used herein, a target molecule generally refers a biomolecule whose functions are desired to be modulated. A target molecule contains a region (i.e., binding site) that allows it to bind to one or more ligands that satisfy the binding criteria. A target molecule can be a macromolecule such as a protein (or polypeptide) or a nucleic acid. A target molecule is typically a bio-macromolecule whose functions can be altered when it is bound to a molecule (i.e., ligand) that fits its binding or active site.
As used herein, a ligand refers to a molecule that binds to the binding or active site of a target molecule. A ligand is typically a smaller molecule than a target molecule and typically binds to a target molecule with high affinity (e.g., with a Kd of at least 1 mM). A ligand can be a natural ligand or substrate (i.e., naturally occurring in a biological system) to the target molecule, e.g., ATP to certain kinases such as p38. A ligand can also be a small molecule inhibitor, e.g., SB203580 that is a well-known inhibitor of p38.
As used herein, a naturally occurring amino acid is defined as one of the twenty amino acids naturally occurring in proteins. These naturally occurring amino acids are the L-isomers of glycine, alanine, valine, leucine, isoleucine, serine, methinine, threonine, phenylalanine, tyrosine, tryptophan, cysteine, proline, histidine, aspartic acid, asparagine, glutamic acid, glutamine, arginine, and lysine. A so-called “unnatural” amino acids is any amino acid other than the twenty named above. Included are D-isomers of the twenty amino acids named above, D or L isomers or racemic mixtures of selenocysteine and selenomethionine, and the D or L forms (or racemic mixtures) of, e.g., nor-leucine, para-nitrophenylalanine, homophenylalanine, para-fluorophenylalanine, 3-amino-2-benzylproprionic acid, homoarginine, and the like. These unnatural amino acids may be used, e.g., in rational drug design in developing inhibitors and/or binding molecules to modulate a protein's activity.
An amino acid is a molecule having the structure where a central carbon atom (the α-carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of which is referred to herein as a “carboxyl carbon atom”), an amino group (the nitrogen atom of which is referred to herein as an “amino nitrogen atom”), and a side chain group that is linked to the α-carbon atom. For example, the side chain group of alanine is a methyl group. Any atom that is not part of a side chain group is a main chain atom, e.g., the α-carbon atom or the hydrogen that joins this carbon atom.
A positively charged amino acid is any naturally occurring or unnatural amino acid having a side chain that is positively charged under normal physiological conditions. The positively charged, naturally occurring amino acids are arginine, lysine, and histidine. A negatively charged amino acid is any naturally occurring or unnatural amino acid having a side chain that is negatively charged under normal physiological conditions. Examples of negatively charged, naturally occurring amino acids are aspartic acid and glutamic acid. A hydrophobic amino acid is any naturally occurring or unnatural amino acid that contains a hydrophobic side chain group. Examples of naturally occurring hydrophobic amino acids are alanine, leucine, isoleucine, valine, proline, phenylalanine, tryptophan, and methionine. An uncharged, hydrophilic amino acid is any naturally occurring or unnatural amino acid that is contains a hydrophilic side chain group, but is uncharged at physiological pH. Examples of naturally occurring uncharged, hydrophilic amino acids are serine, threonine, tyrosine, asparagine, glutamine, and cysteine.
As used herein, a polypeptide refers to a polymer of two or more amino acids linked via a peptide bond (i.e., amino acid residues), and occurs when the carboxyl carbon atom of the carboxylic acid group bonded to the α-carbon of one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen atom of the amino group bonded to the α-carbon of an adjacent amino acid. A protein can include one or more polypeptide subunits (e.g., DNA polymerase III, RNA polymerase II) or other components (e.g., an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of “polypeptide” as used herein. Similarly, fragments of full-length proteins are also “polypeptides”.
The amino acid sequence of a given naturally occurring polypeptide (i.e., the polypeptide's “primary structure”) can be determined by the nucleotide sequence of the coding portion of a mRNA, which is in turn specified by genetic information, typically genomic DNA (including organelle DNA, e.g., mitochondrial or chloroplast DNA).
The secondary structure of a polypeptide refers to local regular structure of a polypeptide segment, without considering the conformations of the side chain its residues. Common secondary structure elements include α-helix and β-strand. The tertiary structure refers to the three-dimensional arrangement of all atoms in a polypeptide chain.
An amino acid residue of a polypeptide interacts with adjacent residues (e.g., residues that are adjacent in primary, secondary or tertiary structure of a polypeptide) as well as with ligands or substrates based, in part, on the type of side chain g roup present. For example, hydrophobic amino acids are more likely to interact with other hydrophobic amino acids or hydrophobic molecules. Similarly, hydrophilic amino acids are more likely to interact with other hydrophilic amino acids or hydrophilic molecules. These types of interactions can be identified and characterized as discussed herein based upon a residues chemical characteristics as well as its interaction with adjacent atoms or molecules.
As used herein, a nucleic acid refers to DNA and RNA, which are both linear polymers of nucleotide subunits. Each nucleotide unit contains a base, a sugar and a phosphate. In DNA, the sugar is deoxyribose, and there are four types of bases: adenine (A), thymine (T), guanine (G), and cytosine (C). In RNA, the sugar is ribose, and bases are made up of adenine (A), uracil (U), guanine (G), and cytosine (C). In either DNA and RNA, the base is linked to the sugar moiety through a beta-glycosyl linkage, and the nucleotide units are joined together through phosphodiester bonds with phosphates at 03'and 05'of the sugars.
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Also, the residues are grouped into six different regions as described in
the Y-axis represents the conservation scores of the interaction bits.
FIG 10 is a graph depicting a p-SIFt.
FIGS. 12(a)-12(d) are images showing p-SIFt information mapped on to a structure of a complex between a target molecule and a ligand.
FIGS. 15(a)-15(b) are graphs depicting enrichment of a chemical library.
FIGS. 16(a)-16(c) are graphs depicting Z scores for a p-SIFT with SIFts.
a is a drawing showing an overlay of 150 poses of 1ouk-inh docked onto the human p38 structure.
b is a hierarchical clustering of the r-SIFts of 150 1ouk-inh docking poses. Each r-SIFt is represented as one horizontal line in the heat map, and only ON-bits (1) are shown. The interaction bits are colored accordingly to their respective molecular fragments (red—core, blue—R1, purple—R2, green—R3). On the left side of the heat map shows the dendrogram of the hierarchical clustering result r-SIFts in the heat map are rearranged according to the order given by clustering. Four major clusters (labeled 1-4) identified from the dendrogram are labeled on the right side of the r-SIFt heat map. The line of block above the heat map indicates the locations of the corresponding binding site residues in the protein. The residues are grouped into six different regions as described previously.
c-f each displays an overlay of the docking poses of within each cluster (1-4), shown in the same reference frame as
a shows a hierarchical clustering of docking poses of five different compounds docked onto p38 structure (1ouk). The bit-coloring scheme and structure layout are identical to those in
b shows structures of the best docking pose of each of the five molecules, shown within the same active site of target molecule structure of 1ouk. The co-crystal structure of 1ouk-inh is shown as thin yellow line model for comparison.
c-g show structures of the docking poses of each compounds (three poses per molecule) used in
a shows a classification of the 2208 1ouk-inh R1 library compounds based on their r-SIFt similarities. The coloring scheme is the same as in
b shows the 3D structures of 200 example compounds in the native cluster. The co-crystal structure is shown as yellow stick model.
c shows examples of compounds in native and non-native clusters. The R1 attachment points are labeled.
As used herein and in the appended claims, the singular forms “a,” “and,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a protein” includes a plurality of proteins and reference to “the polypeptide” generally includes reference to one or more polypeptides and equivalents thereof known to those skilled in the art, and so forth.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art. Although any methods, devices and materials similar or equivalent to those described herein may be used, the typical methods, devices and materials are now described.
All publications mentioned herein are incorporated herein by reference in full for the purpose of describing and disclosing the databases, proteins, and methodologies described in the publications that might be used in connection with the presently described techniques. The publications discussed above and throughout the text are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention.
Techniques are provided for a simple and robust method for representing and analyzing three-dimensional target molecule-ligand interactions. This method generates a structural interaction fingerprint (SIFt)—a representation of the interactions in the three-dimensional binary complexes, i.e., target molecule-ligand (e.g., protein-ligand or nucleic acid-ligand) complexes. The representation is in the form of an information string (e.g., a binary bit string) containing a plurality of information blocks; each of which, in turn, contains a plurality of information units. Before one constructs a SIFt, one has to select the binary (target molecule-ligand) complexes.
I. Selection of Three-Dimensional Binary Complex Structures
The SIFt-based method employs a set of three-dimensional binary structures (e.g., the molecular docking results) to generate a set of SIFts. The set of structures can be obtained from different poses of a selected pair of target molecule (e.g., a protein such as a kinase) and ligand (e.g., a natural ligand or an inhibitor). See, e.g., Example 1 wherein the set of structures was obtained from 100 of different poses of a pyridinyl imidazole inhibitor docking onto a single protein kinase p38 structure. In another aspect the set of structures can be obtained from structural data (e.g., docking results) of a number of different ligands interacting with a single target molecule. See, e.g., Example 2 wherein the set of structures was obtained from docking a group of different small molecules (a library of 1,016 small molecules) onto the same target molecule (a protein kinase p38 structure). In a further aspect, the set of structures can be obtained from different target molecules and different ligands (see, e.g., Example 3 wherein both the target molecules (protein kinases) and ligands are different). Using different target molecules requires additional structural and sequence alignment steps, which will be further discussed below. Once a set of structures has been obtained, one can proceed to construct SIFts.
(i) Identification of the Selected Positions of a Target Molecule
The next step involves selection of a set of positions (“selected positions”) on the target molecule of each of the structures where each of these selected positions is commonly involved in interactions (e.g., non-covalent interaction) between the target molecule and the ligand. These positions serve as reference points covering all of the interactions in the target molecule-ligand complex, and are then used as the common reference frame for constructing SIFts.
The selected positions are defined as regions of the target molecule that are in contact with the ligand. Different methods have been developed to determine whether contacts have been made between the target molecule and the ligand in the context of a particular interaction. Below is a description of two exemplary methods.
For example, the program AREAIMOL of the CCP4 suites (which refers to “Collaborative Computational Project, Number 4.” See the CCP4 suite: programs for protein crystallography. Acta Cryst., D50, 760-763, 1994; and Lee et al., J. Mol. Biol. 55:379-400, 1971) can be used to identify the target molecule atoms that are involved in the non-covalent intermolecular interactions with the ligand. AREAIMOL evaluates the covalent accessible area by allowing a probe sphere of 1.4 Å rolling over the Van der Waals surface of the target molecule and the target molecule-ligand complex. Note that solvent molecules can be excluded for the sake of simplicity, although in theory well-ordered solvent molecules can be included and treated in the same way as target molecule atoms. For protein target molecules, if non-hydrogen atoms show that solvent accessibility decreases upon ligand binding and these atoms are also within 4.5 Å of any of the non-hydrogen atoms of the ligand, the residues corresponding to these atoms are identified as selected positions (or ligand binding atoms). The determination of selected positions in nucleic acid can be done in a similar manner.
As to hydrogen bonding interaction between the target molecule and the ligand, one can employ programs such as HBPLUS. See McDonald et al., J. Mol. Biol. 238:777-793, 1994. HBPLUS calculates and list all possible hydrogen bond donor and acceptorpairs in the complex.
For a set of structures using the same target molecule, after all the ligand binding atoms and their respective residues or bases have been identified, these ligand binding positions are computed and defined as the “selected positions” of the target molecule. As mentioned above, different target molecules can be used. In such circumstances, additional structural and sequence alignment steps are required to convert different but related target molecules into a standard residue numbering system so that a common framework can be employed for constructing the SIFts (see, e.g., Example 3). In some cases, the selected positions can be modified after a SIFt is first constructed, for example, if a subset of the selected positions is found to be more important than other positions in the initial SIFt.
(ii) Determination and Calculation of Interaction Types
After identification of the selected positions (i.e., regions of the target molecule where intermolecular interactions take place), one has to determine and calculate the types of interactions present at these positions. In one embodiment, the target molecule can be a polypeptide or a protein and seven interaction types can be employed based on the AREAIMOL and HBPLUS results. The presence or absence of the interaction types can be calculated at each selected position based on the following inquiries: 1) whether or not it is in contact with the ligand; 2) whether or not any peptide backbone atom is involved in the contact; 3) whether or not any side-chain atom is involved in the binding; 4) whether or not polar interaction is involved; 5) whether or not non-polar interaction is involved; 6) whether or not this residue provides hydrogen bond acceptor(s); and 7) whether or not it provides hydrogen-bond donor(s). The answer to each inquiry constitutes an information unit (in this embodiment, a bit) that corresponds to a particular selected position. By joining the information units together, an information block is formed (in this embodiment, a seven-bit-long block). The entire SIFt can then be constructed by sequentially to ascendent position information blocks of each of the selected positions together, according to ascendant position number (e.g., residue number) order.
The SIFts resulting from a set of structures are therefore of the same length, and each information unit (e.g., bit) in the fingerprint represents the strength or the presence/absence of a particular interaction type at a particular selected position. As a result, the SIFts are directly comparable. Once SIFts are generated from a set of structures, one can perform analyses of the SIFts to obtain valuable interaction patterns and information (e.g., the degree of binding conservation among the target molecule-ligand pairs).
The interaction types can be classified in a number of ways. For example, the interaction types can be fragment constants descriptors (e.g., hydrophobicity, hydrogen bond acceptor, hydrogen bond donor), electronic descriptors (e.g., charge, partial positive surface area, partial negative surface area, dipole movement, atomic polarizability), topological descriptors (e.g., Wiener index, Zagreb index, Hosoya index), molecular flexibility indices, spatial descriptors (e.g., shadow indices, molecular surface area, density, principal moment of inertia, molecular volume), structural descriptors (number of chiral centers, molecular weight, number of rotatable bonds), or thermodynamic descriptors (e.g., partition coefficient, desolvation free energies for water and octanol, pKa).
Hydrophobicity is a measure of the thermodynamics of the partitioning of a molecule or part of a molecule between water and a non-aqueous phase (e.g., an organic solvent), in particular, the free energy change (ΔG0trasfer) associated with transferring a molecule or part of the molecule from a non-aqueous phase to water. In one popular definition (CATALYST™, Accelrys Inc., San Diego, CA 92121, USA), a contiguous set of atoms are defined as hydrophobic if they are not adjacent to any concentrations of charge (charged atoms or electronegative atoms), in a conformation such that the atoms have surface accessibility. Some examples of hydrophobic groups include phenyl, cycloalkyl, isopropyl, and methyl.
(i) Measurement of Similarity of SIFts
As discussed above, each SIFt represents the interaction pattern between a target molecule and a ligand. It follows that similar SIFts reflect similar interaction patterns among the target molecule-ligand pairs.
Different methods can be employed to measure similarity between SIFts. For example, one can use Tanimoto coefficient (Tc, see Willet, Chem. Inf. Comput. Sci. 38:983-996, 1998), which reflects the quantitative measurement of the similarty. Using the bit-string embodiment described above, the Tc between bit-strings A and B is defined as:
where |A ∩B| is the number of ON-bits common in both A and B and |A ∪B| is the number of ON-bits present in either A or B.
(ii) Classification of SIFts Based on Similarity
Based on the similarity measurements, one can classify similar SIFts displaying similar interaction patterns for further analysis, using methods such as hierarchical clustering. From the clustering results, structures can be clustered into groups having similar binding modes.
To analyze and compare the interaction patterns within a group or between groups, a p-SIFt can be generated by quantifying the degree of similarity of each information unit at each selected position within the SIFts. One example is to calculate an interaction conservation score for each information unit (e.g., bit) among each group. This score represents the percentage of SIFts that are ON (i.e., occurrence or presence of the interaction type) at this particular selected position. The higher the score, the more conserved this interaction type is within this group. Variations in the conservation scores between two groups reveal the differences of their interaction patterns.
The p-SIFt approach is similar to profile-based techniques that have proven to be very useful in the analysis and database mining of groups of protein sequences and structures. See, for example, Gribskov, M.; et al., Proc. Natl. Acad. Sci. USA 1987, 84, 4355-4358; Gribskov, M.; et al., Methods Enzymol. 1990, 183, 146-159; Wang, G.; and Dunbrack, R. L., Jr. Protein Sci. 2004, 13, 1612-1626; Mehta, P. K.; et al., Proteins 1999, 35, 387-400; Rice, D. W.; and Eisenberg, D. J. Mol. Biol. 1997, 267, 1026-1038; and Koonin, E. V.; et al., Adv. Protein Chem. 2000, 54, 245-275; each of which is incorporated by reference in its entirety. The sequence profile can be constructed from a set of multiply aligned sequences or structures of a probe family and is used to identify distant relationships to a database of target proteins. The profile is essentially a sequence position-specific scoring matrix encoding the probability of finding any of the 20 amino acid residues at that position in the target. In the case of p-SIFt, the SIFts derived from a set of probe structures are used to derive a position-dependent profile encoding the probability that a given interaction at that position is present. The probe set of structures can correspond to members of a gene family, e.g., kinases, or to sub-families of structures representing ligands with a particular activity or selectivity profile.
A structural interaction fingerprint profile (p-SIFt) represents the degree to which interactions are conserved across a set of ligand-receptor complexes. The p-SIFt, P(r), is derived from an array, denoted below as b, of SIFt patterns. The array has length N for the total number of protein ligand-complexes and width K of SIFt fingerprints bits. The value of each element of P(r) is derived by averaging the elements in each column of the SIFt matrix, yielding a numerical interaction frequency that varies from 0 to 1 for unobserved to fully conserved, respectively. The SIFt array, b, and resulting P(r) are given by,
and
P(r)=[P1P2P3P4PK],
where bi,r is the binary bit value in the SIFt i=1,N at position r=1,K. The values in the P-SIFt at position r is given by
(iii) Measurement of Similarity between SIFts and/or p-SIFts
A Tanimoto coefficient can measure the similarity between two SIFts, between two p-SIFts, between a SIFt and a p-SIFt, or between two r-SIFts (see, for example, Willett, P. J. Chem. Inf Comput. Sci. 1998, 38, 983-996, which is incorporated by reference in its entirety). A set of SIFt patterns can be clustered using the Tanimoto similarity measure by applying standard hierarchical clustering algorithms. See, for example, Deng, Z., et al., J. Med. Chem. 2004, 47, 337-344; Dubes, R., and Jain, A. K. Adv. Comput. 1980, 19, 113-228; and Raymond, J. W. et al., J. Mol. Graph. Model. 2003, 21, 421-433, each of which is incorporated by reference in its entirety.
The statistical Z score can measure how significant the similarity between a SIFt and a target p-SIFt (i.e., a group of structures) is with respect to a certain background. The Z score is an indication of how many standard deviations an observation differs from the mean. The Z score can be defined as:
where target refers to a target molecule, χtarget is the Tanimoto coefficient of the SIFt against the target p-SIFt, <χb> and σb are the mean and standard deviation of the Tanimoto coefficients of all the SIFts in the background set, respectively, against the same target p-SIFt. A background set can include dummy SIFts having the same length as the target SIFt or p-SIFt. Each position in the dummy SIFt bit string is randomly 1 or 0, where the probability of being 1 is equal to the value in the target SIFt or p-SIFt. Alternatively, the background set can be a set of SIFts derived from structures.
A convenient way to compare p-SIFts is to calculate a difference profile by the subtraction of one p-SIFt from another. Another way to compare two SIFt, p-SIFt, or r-SIFt patterns a and b is the cosine coefficient, given by:
where N denotes the number of bits in the SIFt patterns. The cosine coefficient can be applied to measure the similarity between a difference profile, d, and a SIFt pattern, c, where
In this case, the cosine coefficient measures the similarity between the difference profile, d, and a SIFt pattern, c, by varying in value from 1 to −1. If the difference profile is given by d=a-b, then a positive values of the cosine coefficient indicates that c is more similar to a than to b, whereas a negative value indicates that c is more similar to b than to a. The cosine coefficient score is most sensitive to the bits that differentiate a and b. Consequently, the cosine coefficient may be useful in predicting selectivity between inhibitors.
Once a three dimensional structure has been derived and selected positions (e.g., binding site residues) identified, a plurality of intermolecular interaction types occurring at each selected position is determined and measured, using any computational methods well known in the art. These interaction types can also include chemical and physical properties of the part of a ligand interacting with each selected position, and sequence conservation, structural conservation and flexibility properties of each selected position.
A SIFt for each target molecule-ligand complex structure is generated. The SIFt includes a numeric (e.g., binary) code representation of each interaction type determined/measured for each of the selected positions of the target molecule.
The SIFt containing information regarding characteristic of the interaction types at each selected position is stored within a database for subsequent retrieval and analysis. Alternatively, the SIFt can be used to query a database, generate a p-SIFt comprising possible alternative ligands that fit the SIFt, and/or define a structure based upon the type of SIFt obtained.
In one embodiment, a primary amino acid sequence of a polypeptide target molecule that is encoded by a selected genetic sequence is determined, and a three-dimensional structure is generated by homology modeling techniques. This aspect is generally represented in
In one embodiment, a ligand's three-dimensional structure is also obtained by similar techniques (e.g., modeling techniques and/or experimental crystallization techniques). For example, many protein molecules are co-crystallized with substrates and/or ligands. The three-dimensional ligand binding structure can then be modeled using programs that demonstrate interactions with a putative protein target molecule or binding domain thereof. Thus, one of skill in the art utilizing the 3D-protein structure and/or the 3D-ligand structure can obtain interaction data for the molecules being characterized. The ligand molecule may be any of a number of different types of compositions such as organic molecules, inorganic molecules, ions, proteins, protein fragments, nucleotides, RNA, DNA or other molecules representative of substrates, ligands, co-factors, and the like. In one embodiment, the ligand is obtained from a library of molecules.
Upon formation of the 3D complex structure, the interaction of the target molecule with a ligand is computed. Positions (e.g., amino acid residues) that play a role in the interaction with the ligand are selected. This is generally represented in
The target molecule-ligand interactions that are modeled result in the identification of certain selected positions (e.g., amino acid residues or bases) as well as the nature of interaction types between the ligand and the target molecule. The interaction types between a ligand and a particular selected position will depend upon the chemical-physical characteristics of the selected position in the target molecule as well as the nature of atoms or groups of atoms present in the ligand. For example, one of skill in the art will recognize that various equilibrium binding constants or binding energy values will be determinative in the type of interactions that will occur. This process is represented in
The selected positions that play a role in interacting with the ligand as well as the interaction types that occur with each selected position are then used to generate a SIFt (see, e.g.,
In certain embodiments, the SIFt fingerprint records the presence or absence of an interaction with a protein. The information unit containing this information can be simple to indicate whether a residue is involved in a particular interaction or not. In other embodiments, the SIFt can also include other chemical information about the ligand. In one example, a SIFt can include an information unit that contains information about a combinatorial library, which can include a core and variable group (in some examples, two, three or more R groups). Specifically, a small molecule library can be converted into a core and variable groups, a SIFt pattern can be created for each library member, information units can be turned on or off at each of the selected positions based on the nature of the contact between the core and variable groups with the protein target. In another example, a SIFt can include an information unit that contains chemical feature information. For example, a series of chemical features can be mapped onto the ligand molecule. Each residue can be represented by an information block of a series of information units, each of which can be turned on or off depending on whether this residue is interacting with a particular chemical feature on the ligand. Examples of suitable chemical features include hydrophobic, hydrogen bond donor, hydrogen bond acceptor, negatively charged, positively charged, etc. In another example, a computed or experimentally determined property can be included in a SIFt.
Information blocks that includes these properties can be used to identify chemical groups that are associated with specific residues of the protein.
ATanimoto coefficient can be used as the similarity measurement between two r-SIFts. When a group of docking poses is generated for a targe-molecule-ligand complex, the best docking poses (i.e., with top FlexX scores) for the compound can be examined, and a best pose selected for each. The selected pose can make conserved interactions with the target. An agglomerative hierarchical clustering can be applied to analyze and reorganize a group of poses, for example using Tanimoto coefficients as the similarity measurement. A dendrogram prepared from the clustering results can reveal clusters of protein-ligand complex structures having. Poses that cluster together can have similar binding interactions.
Combining SIFt-based approaches and conventional scoring functions can yield better results in reproducing the true binding modes of the compounds and better library enrichment performance. When docking known ligands, the best pose given by a conventional scoring function may not adopt the native binding mode, however, a good placement with correct binding mode usually can be found among the top 10 poses.
As discussed above, one embodiment involves the use of a seven-bit information block (e.g., contact, main-chain atom group, side-chain atom group, polar, non-polar, hydrogen bond donor, hydrogen bond acceptor) to represent the interaction pattern of each selected position of the target molecules (e.g., binding site residue of a protein target molecule). In such an embodiment, the interaction pattern represents the binding modes formed from seven different interaction types. Although such an implementation is able to successfully organize, analyze and mine a large structural library in a meaningful way, a 7-bit-long binary string does not represent all the intermolecular interactions occurring at a particular selected position. The richness of information can be improved by incorporating more bits representing other interaction types. For example, one can focus on functional groups instead of the entire residue as the basic unit, or take solvent molecules into consideration, or substitute the Boolean bits with ordinal or continuous data that reflect the strength and energetics of the interaction types. Such an enriched SIFt provides a “higher-resolution” picture of the target molecule-ligand binary complex. In situation where computational speed is a critical issue, “lower-resolution” SIFts using fewer information units may be used. Accordingly, the information units for a particular selected position (i.e., the size of the information block) may range from 1-50 units or more. Simpler SIFts can be constructed in less time at the expense of richness of information. One skilled in the art can design, select, and identify the number of information units (and thus the size of the information block) for a particular selected position based upon the details and speed desired. For example, shorter information strings (containing, e.g., 2-3 information units per information block) may be useful during the initial screening of a huge virtual library. On the other hand, longer information strings (and hence longer SIFts) provide more information at the expense of quick performance and are more useful for detailed structural analysis such as comparing groups of closely related structures. Choosing the right size of SIFt is a matter of finding a proper balance between these two competing considerations, with that balance dictated by the needs of a given situation. Another variable is the relative weight given to each interaction type. In one embodiment, information units reflecting each interaction type can contribute equally to the total similarity score. It is also possible to tailor them in a different way by focusing on one or more particular interaction types, while down-playing other kinds of interactions.
Another embodiment uses an information block to represent positions on a ligand (e.g., the bits represent a core and R groups). The number of bits in the information block can be selected with regard to the structure of a compound, e.g., the number of R-groups present. Each bit can have a 0 or 1 value, for example, to represent the presence of a contact between an atom at that position on the ligand and an atom of the target molecule. r-SIFt is a variation of structural interaction fingerprint (SIFt). r-SIFt incorporates the binding information about different variable R-groups of a compound into the fingerprint. It was specifically designed for processing and analyzing virtual screening results of combinatorial libraries. In SIFt, the interaction bits represent the presence or absence of different types of interactions (contact, polar interaction, hydrogen bonds, hydrophobic interaction, etc.) occurring at each selected residue, whereas in r-SIFt, the interaction bits represent whether or not a certain R-group or core fragment of the compound makes contact interaction (i.e., within a distance threshold) with a particular protein residue.
One advantageous feature of the SIFt-based method is that it is generic. The SIFt method works well for the protein target molecule and small molecule ligand system, and can also work for other systems including protein-protein, nucleic acid-ligand, nucleic acid protein/polypeptide systems, and the like. Indeed, the methods and systems amino acid sequences, as well as nucleotide sequences. For example, the methods can be applied to a nucleotide sequence or an amino acid sequence which corresponds to the nucleotide sequence in question. If the coding sequence is not known, translation from the nucleotide sequence to the amino acid sequence may be performed in all frames of the nucleotide sequence. Programs that can translate a nucleotide sequence are known in the art.
In one embodiment, the method can start by identifying a primary amino acid sequence of a protein. A number of source databases are available, as described below, that contain nucleotide sequences and/or deduced amino acid sequences for use with this step.
The primary direct experimental methods for determining the structure of proteins involved in particular interactions are X-ray crystallography, relying on the interaction of electron clouds with X-rays; and liquid nuclear magnetic resonance (NMR), relying on correlations between polarized nuclear spins interacting via indirect dipole-dipole interactions. X-ray methods provide information on the location of every heavy atom in a crystal of interest, accurate to 0.5-2.0 Å (1 Å=10−10 m).
A number of databases are available that contain 3D protein structures and/or structures showing 3D protein-ligand interactions. For example, protein-protein interaction databases include the Biomolecular Interaction Network Database (BIND), which is a database designed to store full descriptions of interactions, molecular complexes and pathways; Database of Interacting Proteins (DIP), which catalogs experimentally determined interactions between proteins; an Object Oriented Database for Protein-Protein Interactions (INTERACT); and Pronet Online, which provides protein-protein interaction data and is maintained by Myriad Genetics. Other structural databases include Cambridge Crystallographic Data Centre; CATH-Protein Structure Classification; SCOP (Structural Classification of Proteins), based upon 3D fold classifications; PARTS LIST, which dynamically performs comparative fold surveys and is built on top of SCOP's fold classification and acts as an accompanying annotation; PDB (Protein Data Bank), which is an international repository for the processing and distribution of 3D macromolecular structure data primarily determined experimentally by X-ray crystallography and NMR; PRESAGE, a database for structural genomics; Structural Biology Software Database, a software database maintained by University of Illinois; BiMSSECOST, a conformational database for amino acid residues in proteins; BioMagResBank, a repository for data on proteins, peptides, and nucleic acids from NMR spectroscopy; SWISS-3DIMAGE 3D, which contains images of proteins and other biological macromolecules; SWISS-MODEL, a repository of structures generated by protein modeling; and the Cambridge Structural Database (CSD) of the Cambridge Crystallographic Data Center (CCDC). Other sources of primary amino acid sequence, modeled 3D structures and other crystallographical data will be apparent to those of skill in the art.
The various techniques, methods, and aspects described above can be implemented in part or in whole using computer-based systems and methods. Additionally, computer-based systems and methods can be used to augment or enhance the functionality described above, increase the speed at which the functions can be performed, and provide additional features and aspects as a part of or in addition to those described elsewhere in this document. Various computer-based systems, methods and implementations in accordance with the above-described technology are presented below.
In one implementation, a general-purpose computer may have an internal or external memory for storing data and programs such as an operating system (e.g., DOS, Windows 2000™, Windows XP™, Windows NT™, OS/2, UNIX or Linux) and one or more application programs. Examples of application programs include computer programs implementing the techniques described herein, authoring applications (e.g., word processing programs, database programs, spreadsheet programs, or graphics programs) capable of generating documents or other electronic content; client applications (e.g., an Internet Service Provider (ISP) client, an e-mail client, or an instant messaging (IM) client) capable of communicating with other computer users, accessing various computer resources, and viewing, creating, or otherwise manipulating electronic content; and browser applications (e.g., Microsoft's Internet Explorer) capable of rendering standard Internet content and other content formatted according to standard protocols such as the Hypertext Transfer Protocol (HTTP).
One or more of the application programs may be installed on the internal or external storage of the general-purpose computer. Alternatively, in another implementation, application programs may be externally stored in and/or performed by one or more device(s) external to the general-purpose computer.
The general-purpose computer includes a central processing unit (CPU) for executing instructions in response to commands, and a communication device for sending and receiving data. One example of the communication device is a modem. Other examples include a transceiver, a communication card, a satellite dish, an antenna, a network adapter, or some other mechanism capable of transmitting and receiving data over a communications link through a wired or wireless data pathway.
The general-purpose computer may include an input/output interface that enables wired or wireless connection to various peripheral devices. Examples of peripheral devices include, but are not limited to, a mouse, a mobile phone, a personal digital assistant (PDA), a keyboard, a display monitor with or without a touch screen input, and an audiovisual input device. In another implementation, the peripheral devices may themselves include the functionality of the general-purpose computer. For example, the mobile phone or the PDA may include computing and networking capabilities and function as a general purpose computer by accessing the delivery network and communicating with other computer systems. Examples of a delivery network include the Internet, the World Wide Web, WANS, LANs, analog or digital wired and wireless telephone networks (e.g., Public Switched Telephone Network (PSTN), Integrated Services Digital Network (ISDN), and Digital Subscriber Line (xDSL)), radio, television, cable, or satellite systems, and other delivery mechanisms for carrying data. A communications link may include communication pathways that enable communications through one or more delivery networks.
In one implementation, a processor-based system (e.g., a general-purpose computer) can include a main memory, preferably random access memory (RAM), and can also include a secondary memory. The secondary memory can include, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive reads from and/or writes to a removable storage medium. A removable storage medium can include a floppy disk, magnetic tape, optical disk, etc., which can be removed from the storage drive used to perform read and write operations. As will be appreciated, the removable storage medium can include computer software and/or data.
In alternative embodiments, the secondary memory may include other similar means for allowing computer programs or other instructions to be loaded into a computer system. Such means can include, for example, a removable storage unit and an interface. Examples of such can include a program cartridge and cartridge interface (such as the found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, and other removable storage units and interfaces, which allow software and data to be transferred from the removable storage unit to the computer system.
In one embodiment, the computer system can also include a communications interface that allows software and data to be transferred between computer system and external devices. Examples of communications interfaces can include a modem, a network interface (such as, for example, an Ethernet card), a communications port, and a PCMCIA slot and card. Software and data transferred via a communications interface are in the form of signals, which can be electronic, electromagnetic, optical or other signals capable of being received by a communications interface. These signals are provided to communications interface via a channel capable of carrying signals and can be implemented using a wireless medium, wire or cable, fiber optics or other communications medium. Some examples of a channel can include a phone line, a cellular phone link, an RF link, a network interface, and other suitable communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are generally used to refer to media such as a removable storage device, a disk capable of installation in a disk drive, and signals on a channel. These computer program products provide software or program instructions to a computer system.
Computer programs (also called computer control logic) are stored in the main memory and/or secondary memory. Computer programs can also be received via a communications interface. Such computer programs, when executed, enable the computer system to perform the features as discussed herein. In particular, the computer programs, when executed, enable the processor to perform the described techniques. Accordingly, such computer programs represent controllers of the computer system.
In an embodiment where the elements are implemented using software, the software may be stored in, or transmitted via, a computer program product and loaded into a computer system using, for example, a removable storage drive, hard drive or communications interface. The control logic (software), when executed by the processor, causes the processor to perform the functions of the techniques described herein.
In another embodiment, the elements are implemented primarily in hardware using, for example, hardware components such as PAL (Programmable Array Logic) devices, application specific integrated circuits (ASICs), or other suitable hardware components. Implementation of a hardware state machine so as to perform the functions described herein will be apparent to a person skilled in the relevant art(s). In yet another embodiment, elements are implanted using a combination of both hardware and software.
In another embodiment, the computer-based methods can be accessed or implemented over the World Wide Web by providing access via a Web Page to the methods described herein. Accordingly, the Web Page is identified by a Universal Resource Locator (URL). The URL denotes both the server and the particular file or page on the server. In this embodiment, it is envisioned that a client computer system interacts with a browser to select a particular URL, which in turn causes the browser to send a request for that URL or page to the server identified in the URL. Typically the server responds to the request by retrieving the requested page and transmitting the data for that page back to the requesting client computer system (the client/server interaction is typically performed in accordance with the hypertext transport protocol (HTTP)). The selected page is then displayed to the user on the client's display screen. The client may then cause the server containing a computer program to launch an application to, for example, perform an analysis according to the described techniques. In another implementation, the server may download an application to be run on the client to perform an analysis according to the described techniques.
The described techniques open up the possibility of using an informatics approach in three-dimensional structure analysis and structure-based drug discovery. One application is in the area of virtual chemical library screening process. As discussed herein, SIFt can serve as a post-docking molecular organizer and filter. Docking poses can be organized based on their overall interaction patterns or binding modes. Furthermore, any previously acquired knowledge can be applied as structural constraints to filter out unwanted poses, giving a smaller and better pool of lead compounds. Compared to pharmacophore-based filters, the SIFt-based method is far more generic, flexible and easy to apply. In combination with other pre-existing approaches such as empirical docking scores, the SIFt-based method can weed out more false-positive compounds with undesirable properties, leaving a smaller but better pool of lead compounds, and thus significantly improve the hit rate.
In addition, the SIFt-based approach can be applied in designing, refining and pruning target-focused chemical libraries. As shown in examples 4 and 11, different embodiments of SIFt (e.g., r-SIFt) can be very effective tools for discriminating compounds with different binding modes. With r-SIFt, one can easily distinguish compounds that bind to the target molecule with desirable binding mode(s) and others that do not. Based on this compound classification result, we can then generate prediction models (e.g., decision tree, neural network, support-vector machine) to predict the binding modes of compounds using their chemical properties as predictors. Such prediction models can be applied in the early stage of virtual library screening to filter out undesirable compounds in order to generate a smaller, target-specific pool of compounds.
Besides processing the virtual structures generated during chemical library screening, the SIFt-based method can be used to analyze experimentally determined structures. Furthermore, the methods are not limited to structures involving one particular target molecule; the method is generic enough to work for structures of a family of target molecules (e.g., the kinase family). The prerequisite is that these target molecules are structurally related, so that a common framework of the ligand-binding site can be constructed. By using this method, distinct sub-groups of target molecule-ligand (e.g., enzyme-inhibitor) complex structures, each of which represents a distinct overall interaction pattern, can be identified. The identified sub-groups of these target molecule-ligand complexes can also be classified according to other grouping criteria, such as grouping by different target molecule, by different types of ligands, or by different conformations.
Quantitative comparisons of these clusters would reveal interaction patterns specific for a particular group and thus could provide structural insight into the mechanism of binding activity and selectivity. In addition, the p-SIFt can capture the common features among a group of ligand-target molecule structures. It can be used to compare different groups of structures, and to correlate the differences or commonality in their SIFt profiles to their activities.
In sum, the methods of characterization and generation of information strings representing SIFts provided by the described techniques are an improvement over conventional characterization methodologies that typically rely on sequence-based comparisons. The SIFt facilitates and integrates several desirable functionalities including structural data visualization, organization, analysis, and mining together, making it a powerful tool for analyzing and profiling three-dimensional binding interactions. As mentioned above, a particularly useful feature of this method is that it compares and reveals associations (e.g., binding similarities) between dissimilar target molecules (e.g., proteins that may have functional or behavioral analogies that are not otherwise apparent due to differences in the protein sequence).
The described techniques (including SIFt-based methods, computer implementations, systems, and databases) disclosed herein translate three-dimensional intermolecular interactions into simple, linear information strings, thereby making it possible to efficiently analyze large libraries of structures using mathematics and informatics methods described herein. Although conceptually simple, the described techniques provide a novel method of visualizing, organizing, analyzing, and mining 3D structural information. The SIFt method organizes target molecule-ligand complex structures into groups based on their interaction patterns. Intermolecular interactions between target molecules and ligands are visualized and can be easily comprehended using the heat-map of the SIFts for data visualization. Specifically, each line representing one fingerprint (or SIFt), and each bit in the SIFt colored or shaded according to its value. Using the described techniques, conserved/unconserved interactions within or among different sub-groups of structures (data analysis) can be compared and quantified. In addition, by representing the target molecule ligand complex structures using SIFts, a query can be perfor interactions to select complexes (or ligands) that satisfy predefined criteria (e.g., a certain interaction pattern or binding mode, or even a particular interaction type occurring at a selected position), in a way similar to querying a database (data mining).
The following examples are provided to illustrate the practice of the described techniques, and in no way limit the scope of the claims.
Color versions of
The protein kinase family exemplifies the challenges presented by the large amount of structural data being generated not only on specific drug targets, but also at the gene family level. See, for example, Cohen, P. Nat. Rev. Drug Discov. 2002, 1, 309-315; ter Haar, E.; et al. Mini. Rev. Med. Chem. 2004, 4, 235-253; Manning, G.; et al. Science 2002, 298, 1912-1934; and Vieth, M.; et al. Biochim. Biophys. Acta 2004, 1697, 243-257, each of which is incorporated by reference in its entirety. For example, there exist over 100 structures of protein kinase small molecule complexes that have been deposited in the public domain including 34 different kinase family members. In the Examples below, p-SIFt is applied to analyzing the similarities and differences between ATP, p38 and CDK2 inhibitors binding to the protein kinase family. p-SIFt was able to not only enrich p38 and CDK2 inhibitors, but also importantly show how it can be selective in its enrichment.
Since the majority of kinase inhibitors bind to a conserved ATP site on the enzyme, the ability to understand the selectivity profile for an inhibitor is critical. In silico approaches to predict which kinase inhibitors may cross-react would help avoid downstream toxicity issues as well as enable “target-hopping”, where an inhibitor to a given kinase is used to discover a lead inhibitor for a new target (see, for example, Singh, J.; et al. Bioorg. Med. Chem. Lett. 2003, 13, 4355-4359, which is incorporated by reference in its entirety). Knowledge-based filters applied to virtual libraries during VS preferably enrich libraries with ligands that are likely kinase inhibitors, and are target specific, biasing hits away from undesired “anti-targets” while selecting for ligands that satisfy particular specificity conferring interactions in the target. p-SIFt is a useful tool to complexes from X-ray and NMR, and for analyzing and database mining for the selective enrichment of compounds against specific drug targets.
In Example 1, a set of molecular docking results was generated employing the crystal structure of p38 in complex with a pyridinyl imidazole inhibitor SB203580 (PDB accession code: 1a9u). See, e.g., Wang et al. Structure, 1998, 6(9), 1117-1128. The docking program FlexX (see Rarey et al. J. Mol. Biol., 1996, 261, 470-489) in Sybyl (version 6.8, Tripos, Inc., St. Louis, MO) was used to dock SB203580 onto the crystal structure of p38. In this single ligand study, 100 poses of SB203580 generated by FlexX were retained for subsequent analyses. The ligand binding site was defined using a cutoff radius of 12 Å from the SB203580 ligand (i.e., the conformation in the crystal structure) combined with a core sub-pocket cutoff distance of 4 Å. The FlexX scoring function was used for scoring the docking. For each ligand being studied, ChemScore, Gscore, PMF Score, Dscore, and Consensus Score were evaluated using the Cscore utility in Sybyl. For references of the just-mentioned applications, see, e.g., Eldridge et al. J. Comput.-Aided Mol. Des. 1997, 11 425-445; Jones et al. J. Mol. Biol. 1997, 267, 727-748; Muegge et al. J. Med. Chem., 1999, 42(5), 791-804; Gohlke et al. J. Mol. Biol., 2000, 295, 337-356; and Charifson et al. J. Med. Chem., 1999, 42(25), 5100-5109.
In Example 2, the experiment described was designed to evaluate the database enrichment potential of SIFt by docking a diverse set of compounds spiked with known actives onto the same target protein structure. To this end, 16 known p38 inhibitors were combined with 1,000 small molecules with diverse chemical structures compiled internally.
These inhibitors were pyridinylimidazoles and analogs, covering the majority of the p38 inhibitor families reported thus far, as previously discussed by Adams and Lee (see Adams and Lee. Current Opinion Drug Discovery & Development. 1999, 2, 96-109). These 1,016 compounds were docked onto the p38 structure (1a9u) using FlexX distributed across 50 dual processor nodes of a Linux computing farm. For each ligand, 30 different poses generated from the docking experiment were retained, generating a library of 30,480 (30×1,016) docked ligand structures for subsequent interaction fingerprints analysis. The performance of database enrichment was measured by the enrichment factor (EF), calculated based on the ability of recovering 14 out of 16 (87.5%) known inhibitors. For reference, see, e.g., Pearlman et al. J. Med. Chem. 2001, 44, 502-511. In both docking experiments, three-dimensional conformers of the ligands were generated using OMEGA (OpenEye Sicentific Software, Inc., Santa Fe, NM).
In Example 3, the SIFt-based method was also used to analyze a family of experimentally determined structures. Specifically, a panel of 89 X-ray crystal structures of protein kinase-ligand complexes was selected from the PDB. The selection criteria included:
1) the structures must contain ligands (either ATP, GTP or other inhibitors) present in their ATP-binding pockets; 2) most of the ATP binding site residues are visible and present in the crystal structures. These 89 protein kinase-inhibitor complexes include 25 different kinases, covering 14 different protein kinase subfamilies as classified by Hanks and Quinn. See Hanks and Hunter FASEB J. 1995, 9, 576-596 and Hanks and Quinn Methods Enzymol., 1991, 200, 38-62. In all, the kinase structures contain 54 unique compounds representing a variety of chemical structures (see Table 1).
In each of Examples 1-3, the first step in the construction of SIFts is to identify a list of selected positions or binding site residues that are common in all complex structures being studied. The resulting panel of ligand binding site residues, which covered all of the interactions occurring between the target protein and the ligands, was then used as the common reference frame to construct the interactions fingerprints.
For a group of structures involving the same target protein (experiments such as those described in Examples 1 and 2), the ligand binding site is defined as the list of residues comprising the union of all residues involved in ligand binding over the entire library of structures. For a group of structures involving different target molecules (such as the experiment described in Example 3), additional structural and sequence pre-alignment steps were required as described immediately below.
In Example 3, the crystal structure of murine PKA complexed with ATP and a peptidic inhibitor PKI (PDB accession number: 1ATP; see Zheng et al. Acta Cryst. 1993, D49, 362-365) was used as the reference model for structural and sequence alignment. Initial amino acid sequence alignment of the catalytic cores of these kinases was taken from the Protein Kinase Resources (see Smith et al. TIBS, 1997, 22(11), 444-446). Structural alignment of the kinase structures was carried out manually and focused primarily on the vicinity of the ATP binding sites. Based on the structural alignment results, sequence alignments were carefully checked and adjusted if necessary, so that all structurally equivalent residues match each other in the sequence alignment. After the sequence and structural alignments, the residues of the non-murine PKA protein kinases were renumbered and tallied to the murine PKA residue numbering system, resulting in a uniform residue numbering system for all kinases analyzed. Identification of the list of ligand binding sites was carried out as previously described using the new PKA-equivalent residue numbers.
In each of Examples 1-3, after all the ligand binding site residues were identified and all the protein-ligand intermolecular interactions were calculated, the next step was to classify these interactions, as described previously in the “Detailed Description” Section.
Seven different types of interactions occurring at each binding residue were extracted and classified from the AREAIMOL and HBPLUS results. The inquiries were: 1) whether or not it is in contact with the ligand; 2) whether or not any main-chain atom is involved in the contact; 3) whether or not any side-chain atom is involved in the binding; 4) whether or not a polar interaction is involved; 5) whether or not a non-polar interaction is involved; 6) whether or not the residue provides hydrogen bond acceptor(s); 7) whether or not it provides hydrogen-bond donor(s). By doing so, each residue was represented by a seven-bit-long bit string. The whole interaction fingerprint of the complex was finally constructed by sequentially concatenating the binding bit string of each binding site residue together, according to ascendant residue number order. Therefore, interaction fingerprints are of the same length and each bit in the fingerprint represents presence or absence of a particular interaction at a particular binding site.
As described above in Example 1, the SIFt-based method was applied to analyze the result of a typical docking study. The docking study resulted in 100 docking poses of a small molecule inhibitor (SB203580) of p38, for which the crystal structure was known (PDB entry 1a9u). The poses adopted diverse binding modes, varied in their orientations and positions relative to the target protein and were complex to interpret visually (see
Traditionally, various scoring functions have been used to rank poses from docking studies. Scoring function scores provide an estimate of the binding strength of the compounds in order to identify the potential “good binders” from a large pool of poses, such that a selection of top scoring compounds derived from a rank ordered list of docked ligands will be enriched with active compounds. Scoring functions can be useful in discriminating the poses in the different SIFt clusters (i.e., different binding modes). In
Other different clusters also overlap with each other in their docking scores. Clearly, PMF score is a poor scoring function for discriminating compounds with true binding mode and irrelevant poses in the experiment. In an attempt to broaden the analysis of scoring functions, consensus scoring function that consists of five commonly used scoring functions was also examined (see
The application of the SIFt-based method was extended to other ensembles of structures involving different proteins and a diverse set of small molecules. In Example 3, 89 known crystal structures of the protein kinase family that had been deposited in the Protein Databank were chosen. As mentioned above, they represent 14 different protein kinase subfamilies and 54 unique kinase small molecule ligands/inhibitors. The structure and sequence homology among protein kinases enabled us to analyze these structures using the SIFt-based approach.
A total of 56 residues were identified as the ligand binding site (see
The heat-map and the results from hierarchical clustering are shown in
Comparison of these fingerprints also revealed interactions that are conserved or highly variable among the structures. For instance, contact interactions with residue 57 (in PKA numbering, within the Gly-rich loop) and residue 70 (also in PKA numbering), are strictly conserved among all of the 89 protein kinase-ligand structures. Other highly conserved interactions include contacts with residue 49, 72, 120, 121, 123, 173, 184, etc. (see
The SIFt-based method provides a new and powerful tool for lead discovery and lead optimization, enabling the search for molecules in a chemical database on the basis of expected interaction patterns to a target molecule. This application was specifically tested in Example 2, where a virtual screen for a set of 16 known p38 inhibitors spiked into a diverse library of 1,000 commercially available compounds was performed. These p38 inhibitors were all ATP-competitive inhibitors, and despite representing varied chemical templates had similarities to the pyridinylimidazole series (i.e., SB203580-like) for which the crystal structure of the complex was known (1a9u).
These inhibitors and the random collection of chemical compounds were docked using FlexX onto the crystal structure of p38 (1a9u), and how well these known inhibitors could be enriched using commonly used scoring functions was assessed. These were then compared with the results from a SIFt-based enrichment involving filtering of the compounds based on their similarities in interaction patterns (measured by Tanimoto coefficient) to SB203580, a known pyridinylimidazole inhibitor of p38 for which the X-ray crystal structure was known. The rationale for SIFt-based enrichment is that these 16 known inhibitors, being analogs of the pyridinylimidazole series, are expected to bind to p38 with similar overall binding modes.
*EF is defined as: EF = {Hitssampled/Nsampled}/{Hitstotal/Ntotal}, where Hitssampled is the number of known inhibitors recovered the sampled fraction Of Nsampled poses; Hitstotal is the number of known inhibitors present in the whole library of Ntotal compounds. Here each EF was calculated based on the ability of recovering 14 out of 16 known p38 inhibitors spiked into a random library of 1,000 compounds.
These two examples illustrate two other embodiments of SIFt implementation that include the chemical information about the ligands into their SIFt patterns. In Example 4, the information about core and variable groups (R-groups) of a compound is embedded into the SIFts (e.g., r-SIFts); in Example 5, the pharmacophoric features of the compound are used.
In Example 4, the same set of 100 docking poses of SB203580 docked onto p38 used in Example 1 and 2 was also used. The SB203580 molecule was decomposed into core, R1, R2 and R3 groups as shown in
Grouping of the SIFt patterns was carried out using the same hierarchical clustering method as described in Example 1.
In Example 5, the same set of SB203580 docking poses were used. This time, however, each atom of the molecule was assigned to seven different chemical features, including hydrogen bond acceptor, hydrogen bond donor, hydrophobic, polar, negatively charged, positively charged, or aromatic ring atom. Some atoms fell into more than one category of these chemical features. When constructing the new SIFt patterns, seven binary bits were used to represent a binding site residue, each indicating one of the above seven chemical features. If this residue was within 4.0 Angstroms from any atom that belongs to a particular chemical feature category, then this bit was turned ON (1); otherwise it remained OFF (0). The final SIFt was constructed by concatenating all the binary strings for all binding site residue together, in the same order as used in Examples 1 and 4.
In both Examples 4 and 5, the two different constructions of SIFt pattern provided richer information about the chemical environment around the binding site. Hierarchical clustering results of these two set of new SIFts both gave similar performance, in terms of separating different binding modes of the poses, and the results were comparable with that given by the previous construction of SIFt described in Example 1. This indicates that both the SIFt patterns incorporating the information about the R-group and chemical features were very useful ways of representing the structural information, complimentary to the previous construction of SIFt.
This example demonstrates one of many potential applications of the p-SIFt. A p-SIFt represents the degree of similarity for an interaction occurring at a particular binding site among a group of structures. In this example, the value at each position is the average of all the interaction bit values occurring at this particular position within a group of SIFts.
The above p-SIFt was used to enrich p38 inhibitors from a large library. The idea behind the approach is that if a compound adopts an interaction pattern similar to that of previously known inhibitors (i.e., a p-SIFt), then it is more likely to be a true inhibitor. The statistical Z score was used to measure how significant the similarity between a SIFt and a target profile is above a certain background. Z score is defined as
where x is the Tanimoto coefficient of the SIFt against the target profile, <xb> and σ are the mean and standard deviation of the Tanimoto coefficients of all the SIFts in the background set, respectively, against the same target profile. The background set was used to construct a reference distribution upon which the comparisons were based.
A library comprised of sixteen known p38 inhibitors and 1000 random compounds were docked onto p38 target molecule. For each compound, 10 poses were retained for subsequent analysis. Poses were ranked according to their SIFt Z scores against the p38 p-SIFt, generated from four co-crystal structures. The background set used in Z score calculation included all of the docking poses. For each compound, the pose with the highest Tanimoto coefficient against the p38 profile was selected, and then all 1016 best poses were ranked according to their Z score. The database enrichment curves are shown in
From
A panel of 93 X-ray crystal structures of protein kinase-ligand complexes was selected from the PDB. The selection criteria included the following: (i) the structures were complexed with small molecules (either ATP, ATP-analogs or inhibitors) present in their ATP binding pockets; and (ii) most of the ATP binding site residues were visible and present in the crystal structures.
The crystal structures of p38 in complex with a pyridinyl imidazole inhibitor SB203580 (PDB code 1a9u) and of CDK2 complexed with 4-[3-Hydroxyanilino]-6,7-Dimethoxyquinazoline (PDB code 1di8) was used for docking studies. In each case the ligand-binding site was defined from the bound ligand using a cut-off of 10 Å. Bound waters were removed from the binding sites and the receptors were protonated at pH 7.4.
The set of known inhibitors of p38 were chosen to span several major p38 inhibitor chemotypes (see, for example, Adams, J.; and Lee, D. Curr. Opin. Drug Discovery Dev. 1999, 2, 96-109, which is incorporated by reference in its entirety). Inhibitors of CDK2 were 54 active compounds collected from the literature (see, for example, Claussen, H.; et al. Current Drug Discovery Technologies 2004, 1, 49-60, which is incorporated by reference in its entirety). These known inhibitors for p38 and CDK2 were combined with 1000 small molecules compiled internally. To ensure diversity, the decoy set was selected on the basis of structural and property diversity using the extended connectivity fingerprints (ECFP), molecular weight, and LogP in PipelinePilot. A 3D version of the ligand database was generated with the program Corina, with options set to generate flexible ring conformers and stereoisomers.
The docking program FlexX in Sybyl was used to dock onto the crystal structures of p38 and CDK2 (see, for example, Rarey, M.; et al. J. Mol. Biol. 1996, 261, 470-489; and Kramer, B.; et al. Proteins 1999, 37, 228-241, each of which is incorporated by reference in its entirety). In each study 30 ligand poses generated by FlexX were retained for subsequent analyses. The FlexX scoring function was used for scoring the docking.
A background set of SIFt patterns was used to define a reference distribution upon which the comparisons were based. For the kinase crystal structures analysis, a background set of dummy SIFts around an all kinase p-SIFt was generated. The p-SIFt from all 93 kinase crystal structures was first calculated (see
For the 93 structures, 56 ligand binding site residues were used to construct SIFts. Those playing a significant role in interactions with ligands are listed in Table 3, along with their uniform PKA residue numbering.
Table 3 presents a summary of the raw frequencies observed for contact interactions. Only those residues having a frequency greater than 0.4 for any subgroup are listed. Residues having an interaction frequency of ≧0.7 are considered to be conserved. those less than 0.7 but greater than or equal to 0.4 are considered to be intermediate, and less then 0.4, variable. Entries in the annotation columns including * indicate that the frequency was defined as conserved (≧0.7) for all subgroups independently. Wherever possible, information on the context of the interaction in binding ATP or inhibitors is included as an annotation.
The results of the hierarchical clustering of SIFts computed for the 93 kinases is described above and revealed three major clusters representing three dominant interaction patterns present in the ligand-kinase complexes. Cluster 1 is composed of 9 structures of small molecule inhibitors interacting with p38 kinase (herein referred to as the p38 cluster).
Similarly, Cluster 2 is composed of 20 structures for complexes involving inhibitors of CDK2 kinase (denoted as the CDK2 cluster). The largest distinct group, Cluster 3, is made up of 9 ATP and 16 ATP-analogs complexed with different kinases, which will be termed the ATP-group (ATPg) cluster. The remaining roughly one third of the structures do not belong to any particular cluster. It is noteworthy that the hierarchical clustering procedure, based solely on ligand-receptor interaction features, is able to group structures into meaningful clusters where variable ligands have similar interactions with a fixed receptor (p38 and CDK2 clusters) and where very similar ligands interact in a highly conserved way with a diverse set of kinase receptors (ATPg cluster).
The p-SIFts may be derived using a reduced set of interaction features to represent each interaction. Thus, while the a SIFt can utilize 7 bits to characterize the interaction at each residue, a simplified p-SIFt can be derived from only the interaction frequencies of the contact bit at each residue. In order to simplify the analyses, results presented in Examples 7-10 were based on contact-only p-SIFts.
As an initial application, p-SIFts provided a useful tool to overview the interaction patterns observed between ligands and protein kinases. For this purpose, it can be convenient to define categories from the contact-only p-SIFts to characterize the observed interactions, e.g., conserved ≧0.7, 0.4 ≦ intermediate <0.7, variable <0.4, as denoted by dashed lines on the plot in
The 25 members of the ATPg cluster consisted of9 structures of ATP complexed with 3 different kinases and 16 structures of ATP analogs complexed with 6 kinases. The ATPg p-SIFt computed from the ATPg cluster SIFts is shown in the top panel of
The green blocks below the p38 p-SIFt denote residues making up the hydrophobic pocket of the kinase. For the 9 ATP complexes, 18 out of 23 contacts were classified as conserved between the kinases and the ribose, triphosphate and adenine moieties. Moreover, there were no completely variable positions. Interestingly, even for these ATP-only structures, four interactions fell in the intermediate conservation range. Interactions between the y-phosphate and residues 54 and 55, making up the tip of the glycine rich loop in the kinases, were dependent on the conformation of this flexible region of the binding site and were observed only in approximately half of the structures. Contact between the β-phosphate of ATP and residue 171 was primarily determined by the conformation of the ATP phosphate groups. In approximately 60% of the structures, the α-β-phosphate pyrophosphate bond was rotated such that the β-phosphate was oriented away from residue 171 and towards the glycine-rich loop (see
When the ATP-analogs were considered in addition to the ATP complexes, the degree of variability increases. In particular, interactions with residues 104, 122, and 168shifted from conserved to variable. The extent of variability is clear when the ATPg p-SIFt is compared to the ATP only p-SIFt, as shown in
Table 4 shows a summary of conserved, intermediate, and variable interactions observed across all of the 93 kinase structures and for each of the ATPg, p38, and CDK2 structure clusters. The total number of residues interacting with ligands in each group is denoted as “Contact Residues”. The number of conserved interaction beyond the canonical set observed for all ligands appears in the row labeled “Unique Conserved”.
The contact p-SIFts derived for the ATPg, CDK2, and p38 clusters plotted in
For the p38-ATPg and p38-CDK2 difference profiles, the key distinctions were determined in part by the identity of the residue at position 120. Referred to as the “gatekeeper” residue, it controlled the relative access to the hydrophobic pocket of the ATP site, a region not occupied by ATP. Bulky residues at position 120, such as the Phe in CDK2, restricted access to the hydrophobic pocket, limiting the contacts available to a putative inhibitor. The small Thr “gatekeeper” in p38 rendered the residues making up the hydrophobic pocket accessible to small molecule inhibitors. That small molecule inhibitors of p38 exploit these interactions was clearly evident from the p38 p-SIFt (
In contrast, the CDK2 p-SIFt was more similar to the ATPg p-SIFt as can be observed in the CDK2-ATP difference profile. Unlike p38, in CDK2 the Phe “gatekeeper” residue blocked access to the hydrophobic pocket. As a result, many of the residues accessible to CDK2 inhibitors were those that also interact with ATP. In fact, all of the conserved residues observed in the CDK2 p-SIFt were also conserved in the ATPg p-SIFt.
The main positive difference regions of the CDK2-ATP difference profile, corresponding to intermediate level conserved interactions in the CDK2 p-SIFt that occur with low frequency in the ATP p-SIFt, are colored white in
Unlike contacts with the hydrophobic pocket, several interactions conserved in the p38 cluster were common to CDK2, as well as other non-ATP inhibitors, and are colored red in
Approximately 20% of the contact interactions were conserved in each of the ATPg, CDK2, and p38 p-SIFts as well as over the 93 structures as a whole. These are denoted in Table 3 by the highlighted annotations and form a canonical set of interactions that were evidently fundamental for kinase binding at the ATP site. Further analysis of the full length SIFts revealed that among this set are interactions with residues at positions 121 and 123, which were involved in hydrogen bonding to the adenine moiety of ATP, the “gatekeeper” residue, position 57 in the glycine rich loop, position 70 that for ATP involved hydrophobic interactions between adenine and β3, and position 72 involving the ATP phosphates interacting with β3. The residues involved in the canonical set of interactions are colored green in
The canonical interactions comprise an essential kinase-binding signature for compounds targeting the ATP binding site. Although as noted in Table 3, additional conserved interactions existed for the ATPg, p38, and CDK2 clusters, the canonical interactions were common to all inhibitors and may be used as a basic kinase-like binding filter in virtual screening.
Hierarchical clustering of the SIFts computed from the 93 kinase x-ray structures resulted in the identification of the p38 and CDK2 clusters because they represent two fundamentally different sets of small molecule inhibitors in terms of interactions with the ATP binding site. However, the SIFts within each cluster were not homogenous. In particular, the p38 cluster revealed interesting details about the relationship between interaction patterns and inhibitor selectivity.
Clustering of the nine structures of the p38 cluster identified three distinct SIFt sub-clusters representing two distinct classes of inhibitors (shown in
Results from the analysis of the p38 cluster illustrated the power of the p-SIFt approach, namely, the ability to quantify the similarities and differences in the interaction patterns of inhibitors to a given target. Moreover, the ability to derive p-SIFts and difference, profiles that quantify key conserved interactions can aid in inferring the structural basis for inhibitor potency and selectivity. The detailed binding signature information encoded in the p-SIFts make them ideal filters for screening virtual libraries, as discussed below.
In the preceding examples, clear conservation patterns of interactions for ATPg, p38, and CDK2 clusters have been identified. A canonical set of conserved interactions common to all ligands bound to kinases at the ATP binding site was also identified. These binding signatures can be applied to virtual screening for protein kinase inhibitors.
The success of VS methodologies is typically cast in terms of enrichment studies designed to measure the percentage of known actives identified as a function of the fraction the database screened. Often, the results of these studies indicate that the performance of scoring functions is target specific, for example, leading to significant enrichment of actives for docking against the estrogen receptor but performing poorly against kinase targets (see, for example, Halgren, T. A.; et al. J. Med. Chem. 2004, 47, 1750-1759, which is incorporated by reference in its entirety). Unfortunately, knowledge of the optimal scoring function to apply in a virtual screen against a novel target is not available a priori. As a result, it is often necessary to undertake lengthy validation studies to select a suitable scoring function, or alternately, construct a customized scoring scheme optimized for the target of interest. These problems are compounded when VS screening is carried out against multiple targets.
Some of these difficulties can be addressed by applying p-SIFts to the ranking and filtering of VS results. p-SIFts can be used as target-specific molecular filters encoding binding signatures that are consistent with a particular target specific group of known active inhibitors. Moreover, by comparing the SIFt for each docked solution with a kinase-specific, or binding mode specific, p-SIFt, each p-SIFt is in effect a target specific scoring function. A p-SIFt can be applied in a VS workflow tailored to a specific target without having to rely on the ambiguities of energy based scoring.
To this end, the performance of p-SIFt based scoring in a typical database enrichment application using p38 and CDK2 as targets was tested. In addition, the degree to which the ATP, CDK2, and p38 p-SIFts were selective toward observed kinase inhibitor binding modes was also assessed. Finally, for the generation of enrichment curves and selectivity assessment tests, full-length p-SIFts derived from 7-bit SIFts were used.
Three strategies for virtual screening post-processing were explored. For all three strategies a list of docking poses was generated using the program FlexX, and the top 30 poses were retained using the FlexX scoring function. The output obtained from docking N ligands is then an N×M matrix consisting of M poses (here, M=30) for each docked ligand.
The aim of post-processing the docking results was to arrive at a rank ordered list of N ligands, consisting of a single pose per ligand, which is enriched with actives. In general, the degree of enrichment depends on the success of the post-processing strategy.
All three strategies required the selection of a single pose per ligand and then subsequent ranking of those ligands. One approach utilized energy-based scoring functions for both selecting the top pose per ligand, and for ordering the ligand list. This strategy is referred to as Traditional scoring. The second approach involved using a p-SIFt (instead of an energy-function) to both select the single best pose per ligand and order the ligand list, an approached referred to as p-SIFt scoring. The final approach was a hybrid of the two approaches, in which the p-SIFt was used to filter out undesirable poses, and then an energy-based scoring function was used to select the best pose per ligand and to create an ordered list of ligands. This strategy is called Hybrid scoring. Other strategies that make use of a p-SIFt can be used.
In all three cases, the overall post-processing scheme used to select a single poses generated from docking consisted of four general steps, namely,
(a) re-scoring: each pose generated (N×30) is scored using standard scoring functions and p-SIFts;
(b) filtering: unrealistic poses are removed;
(c) final pose selection: a single pose per ligand is selected; and
(d) ranking: the N ligands are rank ordered.
The three scoring and post-processing strategies applied utilized different strategies to carry out steps (a)-(d) as summarized in Table 5. For the Traditional scoring and Hybrid scoring schemes, docked poses were re-scored using several widely applied scoring functions computed using the Cscore utility in Sybyl. For the Interaction scoring and Hybrid scoring protocols, a value of Z between the SIFt for the pose and the target p-SIFt, Ztarget, (where target is CDK2 or p38) was also computed.
The post-processing schemes applied in this paper to score the ligand poses generated from the docking experiments. In Table 5, scoring function refers to one of ChemScore, Gscore, PMF Score, Dscore, or Consensus Score. See, for example, Eldridge, M.; et al. J. Comput. Aided Mol. Des. 1997, 11, 425-445; Jones, G.; et al. J. Mol. Biol. 1997, 267, 727-748; Muegge, I.; and Martin, Y. C. J. Med. Chem. 1999, 42, 791-804; Meng, C.; et al. J. Comp. Chem. 1992, 13, 505-524; and Charifson, P. S.; et al. J. Med. Chem. 1999, 42, 5100-5109, each of which is incorporated by reference in its entirety. For both the Traditional scoring and Hybrid scoring schemes, the same scoring function was used for final pose selection (step (c)) and Ligand Ranking (step (d)).
The filtering step (b) applied in the Hybrid scoring scheme involved filtering out any poses having ZCDK224.5 and Zp38≧5.0, for VS against CDK2 or p38, respectively. The Ztarget cutoffs were chosen to be at the lowest value of the Ztarget distribution observed for the CDK2 and p38 X-ray structures. In addition, a canonical interaction filter was applied to each pose such that SIFts not satisfying the subset of interactions having an interaction frequency of 1 in the all-kinase p-SIFt.
Incorrect ligand poses can be eliminated from the pool of poses that will be considered for final selection during the filtering step. The aim is to reduce the number of false positive poses while retaining all plausible true positive poses. The filtering step is optional and for comparison purposes was omitted in order to generate results based only on scoring fuictions (Traditional scoring) and only on p-SIFTs (p-SIFt scoring).
Step (c) involves selecting a single pose per ligand from the set of poses that have passed all of the filters, if any, applied in step (b). Enrichment curves and factors were computed by rank ordering (step (d)) the final set of ligand poses using the schemes outlined in Table 5.
A database containing known inhibitors of both p38 and CDK2 and a background of 1000 diverse commercially available compounds was docked against the X-ray structures of CDK2 (PDB code ldi8) and p38 (PDB code 1a9u). The ability of the p-SIFt VS protocol to identify known actives was quantified by computing enrichment curves. The enrichment curves plot the percentage of actives recovered as a function of the percentage of the database screened. Enrichment curves and cumulative enrichment factors for p38 are presented in
From
Enrichment curves were derived using Traditional scoring, p-SIFt scoring, and Hybrid scoring, and are presented in
Attaining database enrichments for CDK2 comparable to those obtained for p38 was a considerably more challenging task for VS. The large gatekeeper residue in CDK2 restricted the number of residues accessible in the ATP binding site. The p-SIFt for CDK2 sampled fewer residues compared to the p-SIFt for p38 and conserved interactions were distributed over a relatively small spatial region. As a result, for CDK2 there were fewer constraints to generate ligand placements and it was therefore easier to generate poses that satisfied conserved interactions in CDK2. In effect, the CDK2 p-SIFt was less selective against false poses as evidenced by the poorer performance of p-SIFt scoring for CDK2 versus p38.
The difference profiles presented in
The results of the self-recognition experiment are shown in
It was clear from
The greatest separation in Z-score distributions was obtained for p38 (
Several chemical libraries and ensembles of docking poses were generated for analysis. The crystal structure of MAP kinase p38 (PDB accession code: 1ouk) was used as the target molecule in all of the virtual screening experiments (see, for example, Fitzgerald, C.E.; et al., Nat. Struct. Biol., 2003, 10, 764-769, which is incorporated by reference in its entirety). The first library of docking poses was used to demonstrate the ability of r-SIFt to efficiently organize and visualize various binding modes. The pyridinyl imidazole inhibitor co-crystallized with p38 in the 1ouk structure, “1ouk-inh”, which has been identified as a very selective and potent p38 inhibitor, was docked with p38. 150 poses with the highest CScores were retained for subsequent analysis. Docking experiment was carried out with FlexX in Sybyl. The ligand binding site was defined using a cutoff radius of 10 Å from the 1ouk-inh ligand (i.e., the conformation in the crystal structure) combined with a core sub-pocket cutoff distance of 4 Å. The FlexX scoring function was used carried out with docking. Five difference scoring functions, including Fscore, ChemScore, Gscore, PMF Score, Dscore, and Consensus Score were used as voting scores in the Cscore utility in Sybyl.
In order to compare and contrast the r-SIFt patterns of different compound structures, docking experiments were performed for five chemically distinct compounds (
In addition, to test the r-SIFt based library filtering strategy, four different combinatorial libraries were enumeratued using three distinct p38 inhibitors as template scaffolds: 1ouk-inh, SKF-86002, and Amgen-10. In order to simplify the analysis and to make the results more interpretable, only one R-group in each library was varied. A common set of monomer library containing 10,000 aryl bromides was used as reactants in the enumeration of these libraries (see, for example, ACD: Available Chemical Directory (version 2004.2), MDL Information Systems: San Leandro, CA). Three libraries were generated by varying the R-1 group of the templates, respectively. The fourth library was enumerated by varying the R-2 group of 1ouk-inh. Based on the co-crystal structures of 1ouk and other similar inhibitors (1a9u, 1b16, 1b17, 1 bmk, 1ouk, etc.), in “native binding mode”, the R-1 groups were expected to interact with the hydrophobic pocket of p38 (see, for example, Radzio-Andzelm, and E. Taylor, S. S.; Structure, 1994, 2, 345-355, which is incorporated by reference in its entirety). The R-2 portion of 1ouk-inh, on the other hand, was positioned in the vicinity of the adenine binding site in the hinge region. These four libraries were named 1ouk-inh-R1, SKF-86002-R1, amgen-10-R1 and 1ouk-inh-R2, respectively. Library enumeration processes were carried out using Pipeline Pilot (Pipeline Pilot™ (version 3.0), Scitegic Inc., San Diego, Calif., U.S.A.). All the reaction products were pre-filtered by removing salts, inorganic compounds as well as molecules with molecular weight less than 400. From the remaining library, a subset of molecules (maximum number 2,000) with maximal chemical diversity were sampled for further analysis. The total number of selected compounds of each library were: 1ouk-inh-R1, 2208; 1ouk-inh-R2, 2450; SKF-86002-R1, 2442; amgen-10-R1, 1750.
These four libraries were docked onto the p38 target molecule (1ouk), using the same docking procedure. The docking experiments were able to reproduce the native co-crystal structure of 1ouk-inh, with an RMSD less than 0.4 Å, confirming the validity of the docking procedure.
Calculation of 2-D descriptors of the Ligands
Molecular descriptors of the R-group monomers (after substituting the bromide with a hydrogen atom) were calculated using Pipeline Pilot™. In order to make the method more amenable to large libraries, the time-consuming calculation of 3D descriptors was omitted. A total of 37 2D descriptors were generated.
The molecular descriptors set was further processed by removing variables with little or no variance across the whole library. In addition, descriptors with high redundancy and multicolinearity were removed. This cleaning step was carried out using the unsupervised forward selection (UFS) algorithm with the stopping criteria of Rmax2 (i.e. the squared multiple correlation coefficient, SMCC) cutoff=0.95 and the minimum standard deviation of variables =0.05 (see, for example, Whitley, D. C.; et al., J. Chem. Inf Comput. Sci., 2000, 40, 1160-1168, which is incorporated by reference in its entirety). The final non-redundant set of descriptors contains 26 descriptors, including: F—COUNT, P—COUNT, S—COUNT, CL—COUNT, BR—COUNT, ALOGP, MOLECULAR—POLARSURFACEAREA, NUM—H—ACCEPTORS, NUM—H—DONORS, NUM—ATOMS, NUM—HYDROGENS, NUM—POSITIVEATOMS, NUM—ROTATABLEBONDS, NUM—BRIDGEBONDS, NUM—RINGS, NUM—AROMATICRINGS, NUM—RINGASSEMBLIES, NUM—CHAINS, NUM—CHAINASSEMBLIES, NUM—STEREOBONDS, NUM—UNKNOWNSTEREOBONDS, NUM—ATOMCLASSES, LOGD, and MOLECULAR—WEIGHT.
Generation of r-SIFts
The panel of 56 residues of p38 previously identified as the kinase ligand binding site were used as the reference frame for r-SIFt construction. These residues are located in the vicinity of the ATP binding pocket in the cleft of the N-terminal and C-terminal domains, as well as at the substrate-binding site. See above.
The implementation of r-SIFt used was based on contact distance between heavy atoms of a residue and different fragments of the ligands. A four-bit-long binary bit string was used to represent the interactions involved in each binding site residue, each bit representing whether or not a particular fragment (core, R-1, R-2 or R-3) is within a certain distance cutoff to the particular residue. In the case of SKF-86002 and Cmp-59076, three bits were used, as these compounds do not have an R3. The distance cutoff was set to 3.5 Å. If any heavy atom of a particular fragment was within 3.5 Å from any heavy atom of the residue, then this particular bit was turned on (1), otherwise this bit remained off (0). The final fingerprints were constructed by concatenating all these 56 small bit-strings together in ascending residue number order. The total length for each r-SIFt pattern was 56×4=224 bits, except for SKF-86002-R1, in which R-3 was absent. The length for r-SIFts in SKF-86002-R1 was 56×3=178 bits.
Analysis and Clustering of r-SIFts
The Tanimoto coefficient was used as the similarity measurement between two r-SIFts. For the 150 1ouk-inh poses ensemble, 1ouk-inh-R1 and 1ouk-inh-R2, the co-crystal structure of the inhibitor was used as the reference structure. For SKF-86002, and Amgen-10, no co-crystal structure was not available. The best docking poses (i.e., with top FlexX scores) for these compounds were examined, and a best pose was selected for each. These two best poses were consistent with the expected binding modes as observed in the co-crystal structures of similar inhibitors (1ouk, 1a9u, 1b16, 1b17, 1 bmk, 1 ove, etc.) and made all the conserved interactions with the target that were observed in other p38 structures. These were used as the reference structures. An agglomerative hierarchical clustering was applied to analyze and reorganize each library of poses, using Tanimoto coefficients as the similarity measurement. Clusters of protein-ligand complex structures were selected based on the dendrogram of their r-SIFts.
Combining SIFt-based approaches and conventional scoring functions, can yield better results in reproducing the true binding modes of the compounds and better library enrichment performance. When docking known p38 inhibitors, the best pose given by a conventional scoring function may not adopt the native binding mode, however, a good placement with correct binding mode usually can be found among the top 10 poses. For p38 inhibitors, retaining the top 10 poses and then selecting the poses with the best binding mode based on SIFt similarities gave much better enrichment performance than using the conventional scoring function alone. Here, a similar strategy was applied to process the docking results of the combinatorial libraries. The r-SIFt patterns were calculated for the best poses (i.e., with best FlexX scores) of each compound. Tanimoto coefficients were calculated against the r-SIFt of the reference structure (either the co-crystal structure or the best predicted pose as described above). The pose with the highest Tanimoto coefficient was selected as the best pose for this compound and used in subsequent ranking or hierarchical clustering. All hierarchical clustering calculation of the r-SIFts were carried out using Spotfire™.
Construction of Decision Tree Classification Models
Hierarchical clustering grouped poses into different clusters according to their binding modes. By visual inspection, the cluster in which compounds adopt the native binding mode was identified. These compounds were classified as native—that is, they were “dockable” hits, because they were predicted by docking program to be able to interact with the target molecule in a way similar to known active inhibitor(s). Compounds having a predicted binding mode different from the native structure were classified as non-native.
After classifying the compounds, decision tree models were generated using CART™(version 5, Salford Systems; see, for example, Steinberg, D.; and Colla, P. CART: tree-structured non-parametric data analysis. San Diego, CA: Salford Systems, 1995). The non-redundant set of 2-D descriptors was used as predictive variables, and the binding mode class (native or non-native) as the target variable. The decision trees were formed with a set of nodes and leaves (end nodes). Each node contains a bifurcation of path based on the value of a particular descriptor. The trees were generated using tenfold cross-validation, randomly assigning 90% of the data points as the training set and 10% as testing set. Equal weights were applied to both native and non-native classes. The performance of the model was measured by prediction accuracies for both classes in the training set as well as in the test set.
Organization of the 1ouk-inh Docking Poses Ensemble
150 poses of 1ouk-inh docked to p38 were generated for r-SIFt analysis.
In addition to its sensitivity to binding mode variations, r-SIFt provided a method for easy visualization and interpretation of how molecules bind to an active site.
Comparison of r-SIFts of Different p38 Inhibitors
Docking experiments were performed using four known p38 inhibitors (1ouk-inh, SB203580, SKF-86002 and Amgen-10) and a compound with no p38 inhibition activity (Cmp-59076). These compounds exhibit different chemical scaffolds (
Not surprisingly, the r-SIFt patterns are first clustered together by each compound. Furthermore, the distance between two clusters in the dendrogram reflects the degree of similarity in the binding mode. In all four p38 inhibitors (1ouk-inh, SB203580, SKF-86002, Amgen-10), the overall positions of the molecular fragments within their r-SIFts were consistent. In most of the cases, the R-2 group (purple bits) was in contact with the hinge region, whereas the R-1 group (blue bits) was highly concentrated in the hydrophobic pocket region (made up of residues from β3-β4 and some residues in β5 immediately proceeding the hinge region). This shows that different p38 inhibitors bound to the target molecule with a very consistent overall interaction pattern. Cmp-59076, which displayed a completely different binding mode, was the most distant from other inhibitors in the dendrogram.
Amore detailed investigation of the r-SIFt patterns revealed some degrees of variation between different known inhibitors. For example, the R2 group of 1ouk-inh (purple bits in
Analysis of Combinatorial Libraries
To search for the rules governing the behaviors of the compounds within a target molecule, four combinatorial libraries were enumerated. r-SIFt was then used to help investigate their “dockability” or hit potentials, that is, whether or not they were able to dock onto the target with expected binding mode. After generating r-SIFts, a hierarchical clustering analysis was carried out to separate different binding modes.
However, no single descriptor alone was able to successfully explain the classification variance. A more complex predictive model that involves combination of different descriptors was required.
The CAR™ decision tree method was used to build classification models. A decision tree model was generated for each of the four combinatorial libraries, using a non-redundant set of 2D molecular descriptors as predictive variables.
The performances of these decision tree models were evaluated by the prediction accuracies for both native and non-native classes. The results are summarized in Table 6.
performances containing the Amgen-10 library showed different accuracies for native and non-native classes—the native classes were predicted more accurately (81-90%) than the non-native molecules (only 50-62%).
To test the expandability of these predictive models, decision trees were regenerated by randomly setting aside 25% of the original library as the evaluation set.
Models were built using the remaining 75% of the data, with exactly the same parameter settings and 10-fold cross validation. Each model was then applied to test its respective evaluation set that was never used in the model building process. The prediction accuracies were all comparable to those shown in Table 7, indicating that the models are fully expandable, and can be applied to filter very large combinatorial libraries.
and applying this predictive model to filter the original large combinatorial library.
r-SIFt is a variation of SIFt designed for dealing with compound library. r-SIFt embeds the binding information of various R-groups of a combinatorial library into a fingerprint. r-SIFt has several desirable features. First, it is extremely sensitive to subtle variations of the placement of ligands within the active site; second, when represented as a heat map, the r-SIFt patterns renders a convenient way for direct visualization of how the ligand molecules interact with various regions of the target molecule; third, calculation of r-SIFt patterns is less time consuming since it only involves simple contact distances between
Tenfold cross-validation was used during the construction process: 90% of the data points (randomly selected) were used each time to build the model while 10% of the data was set aside as test set for validation. The accuracies for the test data sets set aside during the decision tree construction was a better performance indicator. All four models gave reasonably good and balanced performances, with accuracies (for test sets) in the range of 70-80%, for both native and non-native classes of molecules (see Table 7 below).
The three R-1 libraries were derived from different scaffolds. Since the variable R-1 groups in these libraries all target the same hydrophobic binding pocket, it was reasonable to expect that the rules derived from these libraries should be closely related to each other. To test this hypothesis, each decision tree model was used to predict the other two R-1 libraries. The cross-library prediction results are summarized in Table 8.
1ouk-inh-R1 and SKF-86002-R1 were interchangeable, with their cross-library prediction accuracies remain 71-78% for both classes of molecules—a performance comparable to their self-prediction accuracies (Table 7). Interestingly, all prediction heavy atom pairs. The r-SIFt based method offers two advantages that are useful for library design. A modified docking poses triage scheme that combines both traditional scoring function and the SIFt ranking provided much better confidence in generating the true binding placement of the compounds. It therefore gave a superior database enrichment performance over traditional schemes. As one application of the r-SIFt, this method was used to analyze several ensembles of docking results, and accurately differentiate compounds in a library based on their abilities to bind to the target with expected binding mode (virtual hits and non-hits). Based on r-SIFt classification, 2D descriptors of the compounds were used to build general predictive models that can filter large libraries.
SIFt, p-SIFt and r-SIFt can enforce different layers of target molecule-ligand constraints that are valuable in designing chemical library and mining virtual screening results. r-SIFt, as a variation SIFt, incorporates the binding information of different fragment of compound into the fingerprints, thus allowing analysis of the 3-D structures of the compounds in the context of target native site based on how different fragments of these compounds interact with the target molecule. r-SIFt provides flexibility to a user, who can select various types of binding information to incorporate in the fingerprints, depending on the specific needs of analyses.
The r-SIFt based approach provides a method complimentary to other conventional filtering methods, such as the 3-D pharmacophore model. A fundamental difference between these two methods is that in the SIFt-based method, compounds are actually docked them onto the target molecule to see how they behave—whether or not they are able to make the predicted interactions, as expected by the pharmacophore model or known SIFt patterns/profiles.
The r-SIFt pattern, as implemented above, provided information about the overall orientation and position of a ligand molecule related to the binding site of the target molecule. It did not provide, however, more detailed information about what kinds of interaction (hydrophobic, polar, hydrogen bonds, etc.) are involved. Often times such detailed binding information can be highly valuable and can be used as effective constraints in designing a library. The SIFt and p-SIFt contain such details and other perspectives. One can combine SIFt and r-SIFt to construct more constraints and to carry out more careful and in depth analysis of the pilot library in order to classify native and non-native compounds. One can also apply more than one type of SIFt in the library design process. For example, r-SIFt can be used to search for molecules such that a particular R-group occupies a special region of the target molecule, and SIFt can be applied to further search for molecules making specific interactions (e.g., hydrogen bonds, hydrophobic interaction) with particular residues/sub-regions. Such double constraints would generate a pool of native molecules that are more specific and selective.
r-SIFt can offer a sensitive and efficient method to discriminate different binding modes of the ligands and therefore can be used as a powerful filter, especially during the initial filtering steps, to effectively remove compounds with undesirable interaction patterns with the target. SIFt and r-SIFt—based approaches have been proven to be an effective tool for organizing, visualizing, analyzing large library of structures such as docking poses. SIFt-based library design and pruning method described here provides a new strategy that is complimentary to other conventional methods.
A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made. Accordingly, other embodiments are within the scope of the following claims.
This application claims priority to U.S. Application No.60/672,018, filed Apr. 18, 2005, and U.S. Application No. 60/602,852, filed Aug. 20, 2004, and is a continuation-in-part of application Ser. No. PCT/US2004/020992, filed Jul. 1, 2004 and designating the United States, which claims priority to U.S. Application No. 60/524,083, filed Nov. 24, 2003 and U.S. Application No. 60/484,308, filed Jul. 3, 2003, each of which is incorporated by reference in its entirety. The invention relates to methods for representing and analyzing molecule-ligand intermolecular interactions.
Number | Date | Country | |
---|---|---|---|
60672018 | Apr 2005 | US | |
60602852 | Aug 2004 | US | |
60484308 | Jul 2003 | US | |
60524083 | Nov 2003 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US04/20992 | Jul 2004 | US |
Child | 11206034 | Aug 2005 | US |