Structural interaction fingerprint

BACKGROUND

Representing and understanding the three-dimensional structural information of biological molecules is becoming a critical step in the rational drug discovery process. With the advent of massive virtual chemical library screening, as well as the recent advancements in X-ray crystallography, NMR and homology modeling techniques, the amount of structural information is increasing rapidly. The traditional analysis methods are inadequate and inefficient in dealing with such massive structural information.

The past decade has seen an explosion of the three-dimensional structural information of biologically important molecules, due to the recent developments of X-ray crystallography, NMR and molecular modeling techniques. There are currently more than 20,000 structures deposited in the Protein Data Bank, and a significant portion of these structures contain ligands bound to macromolecules. In addition, combinatorial chemistry and virtual library screening are becoming routine procedures in the drug discovery process. This process generates thousands to millions of virtual protein-ligand complex structures, making detailed examination of these structures a daunting task. Representing the three-dimensional structural information of macromolecules efficiently has challenge due to the complexity of identifying residues and atomic interactions. Representing the covalent or non-covalent interactions between molecules poses even more difficult challenges, because in addition to the geometric location of each interaction, the direction, type, and magnitude of the interaction are also important and need to be captured. Understanding the intermolecular interactions between proteins and their ligands provides insights into the functional mechanism of the proteins. It is important for structure-based drug design to understand the key forces between small molecules (SMs) and proteins and to be able to compare different orientations or different small molecules binding to the same receptor site, or different binding sites.

Traditionally, understanding and comparing the interactions between proteins and ligands is achieved by visually inspecting an individual structure with structure-rendering software on a graphic terminal. The inspection is sometimes facilitated by other software tools that generate 2-D or 3-D schematic representations of the interactions (e.g., LIGPLOT™). Such time-consuming processes require human intervention and become more and more tedious as the number of complex structures increases. It is important for successful drug discovery to have a tool that allows this massive amount of structural information to be organized and analyzed.

More recently, structure-based virtual chemical library screening has become a common procedure in the drug discovery process. Virtual library screening typically generates hundreds of thousands of virtual protein-ligand complex structures. Effectively mining this massive structural library becomes a tremendous task, as it is impossible to analyze the structures individually. Traditionally, different types of empirical docking scores and some pharmacophoric filters are used to sift the docking results for tight binders with desired binding interactions. However, these methods have limitations. Correlation between good docking scores and high activity is not always satisfactory. The docking scores are an overall summation of interaction and do not discern differences in binding modes. Therefore, a method that allows accurate representation of the interaction and fast analysis of a large number of structures is in great demand.

Energy based scoring schemes for ranking predicted poses from receptor based virtual screening (docking, or VS) are well known. In order to address the limitations inherent in traditional scoring functions, a variety of “knowledge-based” or “target-biased ” approaches have been developed that impose contraints based on ligand or receptor pharmacophores thought to be required for activity. However, the success of the VS strategy is dependent on the application of constraints derived from knowledge of how small molecule inhibitors bind at the active site. These constraints typically filter virtual libraries based on the presence of known binding motifs, or the ability to satisfy key interactions with the receptor. However, the ability to apply constraints during VS that predict the selectivity of inhibitors for one protein over another is a much more challenging problem that has not been widely addressed.

SUMMARY

A method is provided for generating a structural interaction fingerprint (SIFt). The SIFt is in the form of an information string which includes a plurality of information blocks, and each information block includes a plurality of information units. The method includes the steps of selecting a plurality of positions (selected positions) on a target molecule where each selected position corresponds to an information block in the information string; selecting a plurality of interaction types and calculating a value that is indicative of the characteristic of each interaction type at each selected position of the target molecule; assigning the value to the corresponding information unit thereby indicating the characteristic of that particular interaction type at the corresponding selected position; and

joining the information units of each selected position together to form the corresponding information blocks, which joins together to generate a SIFt.

The SIFt methodology can include an interaction profile based approach termed profile-SIFt, or p-SIFt. The p-SIFt can be derived from a collection of SIFts, and can measure the conservation of interactions observed in clusters of protein-ligand complexes. A p-SIFt can be used to generate target-specific knowledge-based filters for virtual screening as well as provide an understanding of the interaction patterns responsible for inhibitor selectivity.

Interaction profiling and p-SIFt can be a powerful approach to identify and understand interactions that small molecules exploit in order to bind to a target molecule. The information encoded in a p-SIFt can be used to selectively filter virtual libraries for ligands that are inhibitors to a particular target molecule.

SIFts are described, for example, in U.S. Patent Application Nos. 60/484,308, filed Jul. 3, 2003, and 60/524,083, filed Nov. 24, 2003, in PCT application No.

US04/20992, filed Jul. 1, 2004; and U.S. Patent Application No. 60/602,852, filed Aug. 20, 2004, each of which is incorporated by reference in its entirety.

The target molecule can be a protein or a fragment thereof such as a peptide (e.g., polypeptide or oligopeptide). Alternatively, a target molecule can be a nucleic acid. In certain circumstances, the ligand can be a peptide, a nucleic acid, or even a small molecule (e.g., an organic molecule (e.g., molecular weight equal to or less than 1,500 dalton) that is neither a peptide or a nucleic acid). In certain circumstances, both the target molecule and the ligand can be proteins. In this case, the SIFt can be descriptive of protein-protein interactions.

Note that the target molecule is forming a complex with a ligand (i.e., the binary complex), and the selected positions are the positions on the target molecule that participate in intermolecular interaction with the ligand. These positions can be obtained from a three-dimensional structure of a binary complex formed between the target molecule and the ligand. The three-dimensional structure can be derived from an experimental method or a prediction method such as, for example, an in silico prediction method. In one embodiment, a set of selected positions can be obtained from comparing the common positions (e.g., residues or bases) of the target molecule that participate in intermolecular interactions among a set of target molecule-ligand structures. The target molecule can be the same or different in the set of target molecule-ligand structures.

For a protein or peptide target molecule, each selected position can include one or more secondary structure elements (e.g., an α-helix or a β-strand), amino acid residues (e.g., a lysine residue), main chain atom groups (the α-carbon of a particular amino acid residue), side chain atom groups (e.g., the butylamine group of a Lys), or individual atoms of the target molecule. As to a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule.

The value that is assigned to a particular information unit can be a binary value or a numeric value selected from a scale or range of numbers. The binary value indicates whether a particular interaction type is present (1) or absent (0) at the corresponding selected position of the target molecule, whereas the numeric value indicates the magnitude of a particular interaction type at the corresponding selected position of the target molecule (e.g., a value of “3” in a scale that ranges from “0” to “5”).

As mentioned above, the value indicates the characteristic of a particular interaction type at that selected position. Note that the interaction types represent different types of intermolecular interactions between the target molecule and the ligand. For example, the interaction type can be classified as contact interaction. One can detect the presence of contact interaction between a target molecule and a ligand at a selected position (e.g., a protein residue) according to a number of methods. In one embodiment, the target molecule-ligand pair is considered to have established contact interaction at a selected position if the interaction involves a change or reduction in the accessible surface area at that position of the target molecule upon forming a complex with the ligand. Alternatively, one can measure the intermolecular distance between a target molecule and a ligand at a selected position to determine whether contact interaction occurs at that position (i.e., whether the intermolecular distance is within the predetermined distance cutoff limit). In one embodiment, the target molecule-ligand pair is considered to be interacting if the interatomic contact distance between the target molecule and the ligand is equal to or less than 10 Å (e.g., equal to or less than 6 Å, or even 4 Å). The interaction type can be further classified as polar interaction, non-polar interaction, and/or hydrogen bonding interaction, depending on the nature of the interactions. In one embodiment, the hydrogen bonding interaction can involve a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the selected position. In one embodiment, the hydrogen bonding interaction can involve a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the selected position. Note that intermolecular interactions can be characterized by interaction energy-based approach. The interaction type can be characterized by the contribution of the selected position to the interaction energy between a target molecule and a ligand where the total interaction energy between the target and the ligand is a summed over all positions. The interaction energy may be computed by a variety of scoring functions or intermolecular force-fields such as common ligand-receptor docking scoring functions (e.g., Dock, Gold, ChemScore, FlexX score, PMF, Screencore, Drugscore, etc.) or intermolecular potential energy functions or force-fields (e.g., CHARMM, Amber, OPLS, etc.). The interaction energy calculated for each information unit (which corresponds to a selected position) may take the form of a real number (i.e., −43.2 kcal/mol), integer (i.e., −43 kcal/mol), or an integer representing a binned form of the interaction energy. In the latter case, the energy range of the function is divided into bins (e.g., −70 to −50 kcal/mol, −50 to −20 kcal/mol, −20 to 0 kcal/mol, or 0-10 kcal/mol) where the interaction energy is represented as an integer identifying the bin (in this case for example 1, 2, 3, or 4).

In one aspect, a method is provided for generating a profile-structural interaction fingerprint (p-SIFt) in the form of an information string which comprises a plurality of information blocks wherein each information block comprises a plurality of information units. The method includes selecting a plurality of selected positions on a plurality of target molecules, wherein each selected position corresponds to an information block in the information string. Each target molecule forms a complex with a ligand. The method includes selecting a plurality of interaction types and calculating an aggregate value that is indicative of a characteristic of each interaction type at each selected position of the plurality of target molecules. The value is assigned to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The information units of each selected position are joined together to form corresponding information blocks, and the information blocks are joined together to generate a first p-SIFt.

The method can include comparing the first p-SIFt to a SIFt. The method can include generating a second p-SIFt. The first p-SIFt can be compared to the second p-SIFt. Comparing can include subtracting the first p-SIFt and the second p-SIFt.

In another aspect, a method of describing target molecule-ligand interactions includes generating a first plurality of SIFts for a first plurality of target-molecule-ligand complexes, and compiling the first plurality of SIFts to generate a first p-SIFt. The method can include generating a second SIFt or a second p-SIFt, and comparing it to the first p-SIFt. The method can include creating a target molecule-test ligand complex model and generating a SIFt for the model. The SIFt for the model can be compared to the first p-SIFt.

In another aspect, a computer program is provided for generating a profile structural interaction fingerprint (p-SIFt) in the form of an information string which comprises a plurality of information blocks, wherein each information block comprises a plurality of information units. The computer program includes instructions for causing a computer system to select a plurality of selected positions on a plurality of target molecules, where each selected position corresponds to an information block in the information string. Each target molecule forming a complex with a ligand. The computer program also includes instructions for causing the computer to select a plurality of interaction types and calculate an aggregate value that is indicative of a characteristic of each interaction type at each selected position of the plurality of target molecules. The value is assigned to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The information units of each selected position are joined together to form corresponding information blocks, and the information blocks are joined together to generate a first p-SIFt. The computer program can cause the computer system to generate a second p-SIFt, and to compare the first p-SIFt to the second p-SIFt.

In one aspect, a method of predicting the interaction pattern between a target molecule and a test ligand is provided. A test ligand is a ligand whose affinity to the target molecule is under examination. The prediction method involves identifying a plurality of selected positions between the target molecule and a first ligand, wherein the first ligand is known to bind to the target molecule (i.e., the affinity between the first ligand and the target molecule is known). As described above, selected positions are positions on the target molecule that participate in intermolecular interactions with the ligand (here, the first ligand). Based on the selected positions, the method then involves generating a first structural interaction fingerprint (SIFt) as described above (i.e., formation of an information string that includes a plurality of information blocks, where each information block includes a plurality of information units, and where each information unit is assigned a calculated value indicative of the presence/absence or the magnitude of a particular interaction type at the selected position of the target molecule to which the information unit/block corresponds). Using the same selected positions, the method then involves the generation of a second SIFt between the same target molecule and a second ligand (i.e., a test ligand) employing the same steps as described above. Finally, the method involves comparing the first SIFt with the second SIFt to determine the level of overlapping between the first and second SIFts. A pattern of substantial overlapping between the two SIFts predicts that the second ligand interacts with the target molecule in a similar pattern as the first ligand. In one embodiment, the first ligand is the natural ligand of the target molecule. In one embodiment, the first ligand is a ligand of known affinity to the target molecule.

In one aspect, a method of generating a structural interaction fingerprint (SIFt) database is provided. The method involves (1) identifying a plurality of selected positions on a target molecule (which forms a complex with a first ligand) and (2) generating a first SIFt of the database as described above (i.e., formation of an information string that includes a plurality of information blocks where each information block includes a plurality of information units, and where each information unit is assigned a calculated value indicative of the presence/absence or the magnitude of a particular interaction type at the selected position of the target molecule to which the information unit/block corresponds). The method then requires that steps (1) and (2) be repeated using the same target molecule but a different ligand such that another SIFt can be generated and added to the databases. The method then repeats steps (1) and (2) with different ligands and generates more SIFts until the database contains a desired number of SIFts. In one embodiment, the method further involves analyzing the SIFts of the database to generate one or more interaction patterns between the target molecule and the ligands. Typically, ligands that belong to a particular interaction pattern indicate that they bind to the target molecule in a similar manner. In one embodiment, the method further involves comparing one (or more) interaction pattern of the database with a SIFt generated by using the same target molecule and a test ligand. A test ligand is a ligand that was not employed in generating the database. From the degree of similarity between the SIFt generated using the test ligand and the interaction pattern, one can predict whether or not the test ligand binds to the target molecule in a similar manner.

One can even predict whether or not the test ligand belongs to the same family of ligands used to generate the database. In one embodiment, the method further includes the step of storing the database in a computer readable medium.

In one aspect, a method of analyzing the interaction pattern of two or more related target molecules is provided. The method includes conducting sequence and structural alignments among each of the related target molecules resulting to derive a uniform residue or base numbering system. The method then involves identifying a plurality of selected positions on the target molecule of each target molecule-ligand complex using the uniform residue or base numbering system. This is followed by generating a SIFt for each target molecule-ligand complex as described above and comparing different SIFt patterns. The interactions can be conserved or unconserved.

The method can include compiling the SIFts to identify selected interactions that are conserved among the complexes. The method can include calculating a score for each interaction among the target molecule-ligand complexes. The score can include a conservation score. The method can include compiling the SIFts to form a p-SIFt from the calculated conservation score, or comparing a SIFt generated from a test ligand with a p-SIFt generated from a group of target molecule-ligand complexes, thereby predicting whether the test ligand interacts with the target molecule in a similar pattern with the group. The method can include comparing two p-SIFts, thereby predicting whether two groups of structures share conserved binding interactions, and/or have similar binding pattern.

In another aspect, a method is provided for generating an R-group-structural interaction fingerprint (r-SIFt) in the form of an information string which includes a plurality of information blocks where each information block includes a plurality of information units.

The method includes selecting a plurality of selected positions on a first ligand. Each selected position corresponds to an information block in the information string. The first ligand forms a complex with a target molecule. The method includes selecting an interaction type and calculating a value that is indicative of a characteristic of the interaction type at each selected position of the first ligand, and assigning the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The method also includes joining the information units of each selected position together to form corresponding information blocks, and joining the information blocks together to generate an r-SIFt.

The target molecule can be a protein, a peptide, or a nucleic acid. The first ligand can be a small molecule, a peptide, a protein or a nucleic acid. The value that is assigned to an information unit can be a binary value which indicates the presence or absence of a particular interaction type at the corresponding selected position. The interaction type can be contact interaction.

The method can include selecting a plurality of selected positions on a plurality of ligands, where each selected position corresponds to an information block in the information string. Each of the plurality of ligands forms a complex with the target molecule. The method includes calculating a value that is indicative of a characteristic of the interaction type at each selected position of the plurality of ligands, and assigning the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The method also includes joining the information units of each selected position together to form corresponding information blocks, and joining the information blocks together to generate an r-SIFt for each of the plurality of ligands.

The plurality of ligands can be selected from a combinatorial library. The method can include comparing one r-SIFt to a second r-SIFt. The method can include grouping an r SIFt based on the comparison.

The method can include classifying each of the plurality of ligands into a class according to the degree of similarity of their respective r-SIFts to the r-SIFt of the first ligand. The method can include determining a chemical or physical property of the selected positions of the plurality of ligands. The chemical or physical property can be correlated with the class. The method can include determining a chemical or physical property for a part of a compound and classifying the compound into a class. The chemical or physical property can be F—COUNT, P—COUNT, S—COUNT, CL—COUNT, BR—COUNT, ALOGP, MOLECULAR—POLARSURFACEAREA, NUM—H—ACCEPTORS, NUM—H—DONORS, NUM—ATOMS, NUM—HYDROGENS, NUM—POSITIVEATOMS, NUM—ROTATABLEBONDS, NUM—BRIDGEBONDS, NUM—RINGS, NUM—AROMATICRINGS, NUM—RINGASSEMBLIES, NUM—CHAINS, NUM—CHAINASSEMBLIES, NUM—STEREOBONDS, NUM—UNKNOWNSTEREOBONDS, NUM—ATOMCLASSES, LOGD, or MOLECULAR—WEIGHT.

In another aspect, a computer program is provided for generating an R-group structural interaction fingerprint (r-SIFt) in the form of an information st a plurality of information blocks, wherein each information block includes a plurality of information units. The computer program includes instructions for causing a computer system to select a plurality of selected positions on a first ligand, where each selected position corresponds to an information block in the information string, and the first ligand forming a complex with a target molecule. The computer program includes instructions to select an interaction type and calculating a value that is indicative of a characteristic of the interaction type at each selected position of the first ligand, and assign the value to a corresponding information unit, the information unit indicating a characteristic of the interaction type at the corresponding selected position. The computer program also includes instructions to join the information units of each selected position together to form corresponding information blocks, and join the information blocks together to generate an r-SIFt. The computer program can include instructions for causing the computer system to generate a second r-SIFt.

As used herein, the target molecules are related if they exhibit at least 20% sequence similarity or a structural similarity with a root-mean squared deviation over the aligned positions no greater than 4 Å (e.g., 6 Å). In yet another embodiment, the target molecules are related if they exhibit at least 20% protein sequence similarity with a root-mean squared deviation over the aligned positions no greater then 6 Å. For protein target molecules, sequence and structural alignments are commonly applied within the structural biology field. There are databases including the PFAM database that includes protein sequence alignments (http://www.sanger.ac.uk/software/Pfam/index.shtml) and the SCOP database (http://scop.mrc-lmb.cam.ac.uk/scop/) that contains protein structural alignments.

In some embodiments, at least one interaction type includes a chemical or physical property of a part of ligand interacting with each selected position. In other embodiments, each interaction type includes a chemical and physical property of a part of ligand interacting with each selected position. The interaction types can include information bits about the chemical composition of a ligand (e.g., various R groups in a combinatorial library), or an experimentally determined or computed property of the part of the ligand interacting with the selected position. For example, interaction types can include information bits representing varying groups of a combinatorial library. Properties and descriptors of a molecule or part of a molecule can include fragment constant descriptors (e.g., hydrophobic, hydrogen bond acceptor, hydrogen bond donor, hydrophobic aliphatic, hydrophobic aromatic, negative charge, negative ionizible, positive charge, positive ionizible, or aromatic ring), electronic descriptors (e.g., charge, partial positive surface area, partial negative surface area, dipole moment, atomic polarizability, polar surface area), topological descriptors (e.g., Wiener index, Zagreb index, Hosoya index), molecular flexibility index, spatial descriptors (e.g., shadow indices, molecular surface area, density, principal moment of inertia, molecular volume), structural descriptors (e.g., number of chiral centers, molecular weight, number of rotatable bonds), or thermodynamic descriptors (e.g., partition coefficient, desolvation free energies for water and octanol, pKa). The interaction type can also include a chemical fingerprint for a part of the ligand interacting with the selected position of the target molecule. A chemical fingerprint is a string of values (usually an array of binary bits) that contains the unique information about the chemical makeup (e.g., atoms, substructures, chirality) of the molecule. In some embodiments, the interaction types can also include information about the selected position in the target molecule, such as variables measuring the sequence conservation, structural conservation and flexibility of the selected position of the target molecule.

In a further aspect, a computer-readable data storage medium is provided. The medium includes a data storage material encoded with a computer-readable database. The database includes a plurality of SIFts generated from a target molecule and a plurality of ligands. Each SIFt is in the form of an information string that includes a plurality of information blocks, and each information block includes a plurality of information units.

The target molecule interacts with each ligand at a plurality of selected positions on the target molecule via a number of interaction types. As described above, selected positions are positions on the target molecule that participate in intermolecular interaction with the ligand.

The magnitude of each interaction type at each selected position is calculated and represented by a value, which is assigned to a corresponding information unit. The target molecule a be a protein, a peptide, or a nucleic acid, and the ligand can be a small molecule, a peptide, a protein or a nucleic acid. In one embodiment, the value that is assigned to an information unit is a binary value, which indicates the presence or absence of a particular interaction type at the corresponding selected position. In one embodiment, the value that is assigned to an information unit is selected from a range of scaled numeric values, which indicates the magnitude of a particular interaction type at the corresponding selected position. For a protein/peptide target molecule, each selected position can include one or more amino acid residues, main chain atom groups, side chain atom groups, or individual atoms of the target molecule. For a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule. In one embodiment, the interaction type can be a contact interaction. For example, the interatomic contact distance between the target molecule and the ligand can be equal or less than 10 Å (e.g., equal or less than 6 Å, or even 4 Å) for the target molecule-ligand pair to be considered as having contact interaction. As another example, the contact interaction can include a change in the accessible surface area of the target molecule upon forming a complex with the ligand. In one embodiment, the interaction type can be a polar interaction, non-polar interaction, and hydrogen bond interaction. In one embodiment, the hydrogen bond interaction can include a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the corresponding selected position. In one embodiment, the hydrogen bond interaction can include a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the corresponding selected position.

In yet a further aspect, a computer program for generating a SIFt that is in the form of an information string comprising a plurality of information blocks, where each information block includes a plurality of information units is provided. The computer program contains instructions for causing a computer system to select a plurality of positions (selected positions) on a target molecule (which is forming a complex with a ligand). The selected positions are positions on the target molecule that participate in intermolecular interaction with the ligand. Each selected position corresponds to an information block in the information string. The computer program can perform one or more of the following steps: select a plurality of interaction types that exist between the target molecule and the ligand; calculate a value that is indicative of the characteristic of each interaction type at each selected position of the target molecule; assign the value to the corresponding information unit so as to indicate the characteristic of that particular interaction type at the corresponding selected position; join the information units of each selected position together to form the corresponding information blocks; and join the information blocks to generate a SIFt. The target molecule can be a protein, a peptide, or a nucleic acid, and the ligand can be a small molecule, a peptide, or a nucleic acid. In one embodiment, the value that is assigned to an information unit is a binary value, which indicates the presence or absence of a particular interaction type at the corresponding selected position. In one embodiment, the value that is assigned to an information unit is selected from a range of scaled numeric values, which indicates the magnitude of a particular interaction type at the corresponding selected position. In one embodiment, the selected positions are obtained from a three-dimensional structure of a binary complex formed between the target molecule and the ligand. Such a three-dimensional structure may be derived from an experimental method or a prediction method such as, for example, an in silico prediction method. For a protein/peptide target molecule, each selected position can include one or more amino acid residues, main chain atom groups, side chain atom groups, or individual atoms of the target molecule. For a nucleic acid target molecule, each selected position can include one or more bases, functional groups, or individual atoms of the target molecule. The interaction types represent different types of intermolecular interactions between the target molecule and the ligand and can be characterized by binding energy-based approach. In one embodiment, the interaction type can be a contact interaction. For example, the interatomic contact distance between the target molecule and the ligand can be equal or less than 10 Å (e.g., equal or less than 6 Å, or even 4 Å) for the target molecule-ligand pair to be considered as having contact interaction. As another example, the contact interaction can include a change in the accessible surface area of the target molecule upon forming a complex with the ligand. In one embodiment, the interaction type can be a polar interaction, non-polar interaction, and hydrogen bond interaction. In one embodiment, the hydrogen bond interaction can include a hydrogen bond donor in the target molecule and a hydrogen bond acceptor in the ligand at the corresponding selected position. In one embodiment, the hydrogen bond interaction can include a hydrogen bond acceptor in the target molecule and a hydrogen bond donor in the ligand at the corresponding selected position. In one embodiment, the method can further include instructions to store the SIFt in a database. In one embodiment, the computer program can include instructions for generating a plurality of SIFts by the repeating the steps recited above using, e.g., the same target molecule and selected positions, but different ligands. The plurality of SIFts may then be stored in a database. In one embodiment, the computer program can further include instructions to generate a SIFt using the same target molecule and a test ligand, and to compare this SIFt with another SIFt (e.g., generated using the same target and a known ligand) or another group of SIFts (i.e., either one SIFt or a plurality of SIFts forming an interaction pattern). Various methods can be used to compare the generated SIFt with one or more other SIFts. For example, a comparison can be performed using a simple sum of matching bits (units) across the entire SIFT, or by the application of one or more similarity measures (including, e.g., Tanimoto coefficient, Euclidean distance, cosine correlation coefficient, correlation, half square Euclidean distance, and city block distance). Furthermore, a library of SIFts can be compared by, for example, first carrying out all pairwise comparisons using one of the similarity measures mentioned above and then applying hierarchical clustering to group SIFts according to the similarity. The clustering can use, for example, one or more common cluster similarity methods (including, e.g., UPGMA (Unweighted Pair-Group Method with Arithmetic mean), WPGMA (Weighted Pair-Group Method with Arithmetic mean), single linkage, complete linkage, and Ward's method).

As used herein, a target molecule generally refers a biomolecule whose functions are desired to be modulated. A target molecule contains a region (i.e., binding site) that allows it to bind to one or more ligands that satisfy the binding criteria. A target molecule can be a macromolecule such as a protein (or polypeptide) or a nucleic acid. A target molecule is typically a bio-macromolecule whose functions can be altered when it is bound to a molecule (i.e., ligand) that fits its binding or active site.

As used herein, a ligand refers to a molecule that binds to the binding or active site of a target molecule. A ligand is typically a smaller molecule than a target molecule and typically binds to a target molecule with high affinity (e.g., with a K_dof at least 1 mM). A ligand can be a natural ligand or substrate (i.e., naturally occurring in a biological system) to the target molecule, e.g., ATP to certain kinases such as p38. A ligand can also be a small molecule inhibitor, e.g., SB203580 that is a well-known inhibitor of p38.

As used herein, a naturally occurring amino acid is defined as one of the twenty amino acids naturally occurring in proteins. These naturally occurring amino acids are the L-isomers of glycine, alanine, valine, leucine, isoleucine, serine, methinine, threonine, phenylalanine, tyrosine, tryptophan, cysteine, proline, histidine, aspartic acid, asparagine, glutamic acid, glutamine, arginine, and lysine. A so-called “unnatural” amino acids is any amino acid other than the twenty named above. Included are D-isomers of the twenty amino acids named above, D or L isomers or racemic mixtures of selenocysteine and selenomethionine, and the D or L forms (or racemic mixtures) of, e.g., nor-leucine, para-nitrophenylalanine, homophenylalanine, para-fluorophenylalanine, 3-amino-2-benzylproprionic acid, homoarginine, and the like. These unnatural amino acids may be used, e.g., in rational drug design in developing inhibitors and/or binding molecules to modulate a protein's activity.

An amino acid is a molecule having the structure where a central carbon atom (the α-carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of which is referred to herein as a “carboxyl carbon atom”), an amino group (the nitrogen atom of which is referred to herein as an “amino nitrogen atom”), and a side chain group that is linked to the α-carbon atom. For example, the side chain group of alanine is a methyl group. Any atom that is not part of a side chain group is a main chain atom, e.g., the α-carbon atom or the hydrogen that joins this carbon atom.

A positively charged amino acid is any naturally occurring or unnatural amino acid having a side chain that is positively charged under normal physiological conditions. The positively charged, naturally occurring amino acids are arginine, lysine, and histidine. A negatively charged amino acid is any naturally occurring or unnatural amino acid having a side chain that is negatively charged under normal physiological conditions. Examples of negatively charged, naturally occurring amino acids are aspartic acid and glutamic acid. A hydrophobic amino acid is any naturally occurring or unnatural amino acid that contains a hydrophobic side chain group. Examples of naturally occurring hydrophobic amino acids are alanine, leucine, isoleucine, valine, proline, phenylalanine, tryptophan, and methionine. An uncharged, hydrophilic amino acid is any naturally occurring or unnatural amino acid that is contains a hydrophilic side chain group, but is uncharged at physiological pH. Examples of naturally occurring uncharged, hydrophilic amino acids are serine, threonine, tyrosine, asparagine, glutamine, and cysteine.

As used herein, a polypeptide refers to a polymer of two or more amino acids linked via a peptide bond (i.e., amino acid residues), and occurs when the carboxyl carbon atom of the carboxylic acid group bonded to the α-carbon of one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen atom of the amino group bonded to the α-carbon of an adjacent amino acid. A protein can include one or more polypeptide subunits (e.g., DNA polymerase III, RNA polymerase II) or other components (e.g., an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of “polypeptide” as used herein. Similarly, fragments of full-length proteins are also “polypeptides”.

The amino acid sequence of a given naturally occurring polypeptide (i.e., the polypeptide's “primary structure”) can be determined by the nucleotide sequence of the coding portion of a mRNA, which is in turn specified by genetic information, typically genomic DNA (including organelle DNA, e.g., mitochondrial or chloroplast DNA).

The secondary structure of a polypeptide refers to local regular structure of a polypeptide segment, without considering the conformations of the side chain its residues. Common secondary structure elements include α-helix and β-strand. The tertiary structure refers to the three-dimensional arrangement of all atoms in a polypeptide chain.

An amino acid residue of a polypeptide interacts with adjacent residues (e.g., residues that are adjacent in primary, secondary or tertiary structure of a polypeptide) as well as with ligands or substrates based, in part, on the type of side chain g roup present. For example, hydrophobic amino acids are more likely to interact with other hydrophobic amino acids or hydrophobic molecules. Similarly, hydrophilic amino acids are more likely to interact with other hydrophilic amino acids or hydrophilic molecules. These types of interactions can be identified and characterized as discussed herein based upon a residues chemical characteristics as well as its interaction with adjacent atoms or molecules.

As used herein, a nucleic acid refers to DNA and RNA, which are both linear polymers of nucleotide subunits. Each nucleotide unit contains a base, a sugar and a phosphate. In DNA, the sugar is deoxyribose, and there are four types of bases: adenine (A), thymine (T), guanine (G), and cytosine (C). In RNA, the sugar is ribose, and bases are made up of adenine (A), uracil (U), guanine (G), and cytosine (C). In either DNA and RNA, the base is linked to the sugar moiety through a beta-glycosyl linkage, and the nucleotide units are joined together through phosphodiester bonds with phosphates at 03'and 05'of the sugars.

The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1. is a flow chart depicting a method of generating a SIFt.

FIG. 2A is an overlay of 100 different docking poses of SB203580 (shown in cyan stick models) in the vicinity of the target protein human p38 (PDB accession code: 1a9u). p38 is shown as ribbon model, and the shades represent different sub-regions of the 34 ligand binding site residues: R—Gly-rich loop, G—segment from -β3 to β4 (including αC), B—β5 and hinge region, M—catalytic loop, Y—Mg loop, O—activation segment. A color version of this figure can be found in Deng, Z.; Chuaqui, C.; Singh, J. “Structural Interaction Fingerprint (SIFt): A novel method for analyzing three-dimensional protein-ligand binding interaction,” J. Med. Chem, 47: 337-344 (2004).

FIG. 2B is a hierarchical clustering of the SIFts of 100 SB203580 docking poses. A color version of this figure can be found in Deng, Z. et al., J. Med. Chem, 47: 337-344 (2004). Each SIFts is represented as one line in the heat map in the middle of the figure, and only ON-bits (1) are shown as blocks. On the right side of the heat map shows the hierarchical clustering results on the fingerprints, including the dendrogram and the reorganized distance matrix. Colors (represented here as shades of gray) in the distance matrix correspond to the actual pair-wise distance between two SIFts, with dark red (e.g., cutting from top right to bottom left) being the most similar and dark blue (e.g., in the northwest and southeast comers) being the least similar. SIFts in the heat map are rearrange according to the order given by hierarchical clustering. The seven major clusters (labeled 1-7) identified from the dendrogram are marked on the left side of the SIFt heat map. The three lines of blocks above the heat map indicate the locations of the corresponding binding site residues and the bits. In the middle line (alternating shades of gray), each block represents a particular binding site residue, arranged in ascending residue numbers. Within each residue there are seven different binding bits, represented by seven smaller blocks in the third line.

Also, the residues are grouped into six different regions as described in FIG. 2A, as indicated in the first line.

FIG. 2C-2I collectively are overlays of the poses within each of the seven clusters (labeled 1-7), in the same reference frame as FIG. 2A. The crystal structure of SB203580 in the 1a9u structure is also shown in each figure as stick model. Color versions of these figures can be found in Deng, Z. et al., J. Med. Chem, 47: 337-344 (2004). Among the binding site residues, only those in contact with the respective clusters are shaded, using the same scheme as in FIG. 2A.

FIG. 3A is a graph showing the PMF docking scores as a function of SIFt cluster number.

FIG. 3B is a graph showing the Consensus docking score as a function of SIFt cluster number.

FIG. 4A is a representation of ligand binding site residues of protein kinases. Shown are the murine PKA (ribbon model) and the ATP molecule (stick model) of the crystal structure 1 atp, which was used as the reference structure for the kinase SIFt construction. Residues are grouped into five different regions, shown in shades of gray. The grouping and shading scheme are the same as in FIG. 2A. A color version of this figure can be found in Deng, Z. et al., J. Med. Chem, 47: 337-344 (2004).

FIG. 4B is a hierarchical clustering of SIFts of 89 protein kinase crystal structures. On the right are the dendrogram and the corresponding reorganized distance matrix map. SIFts are reorganized according to the order given by the dendrogram. Six different regions are labeled above the SIFt heat map. Three major clusters (1-3) are labeled on the left side of the heat map. A color version of this figure can be found in Deng, Z. et al., J. Med. Chem, 47: 337-344 (2004).

FIG. 4C is a comparison of the structures of the three different binding modes from FIG. 4B. Three representatives are shown for each cluster.

FIGS. 5A and 5B are graphs showing the comparison of database enrichment using SIFt with ChemScore (FIG. 5A) and PMF score (FIG. 5B). Sixteen known p38 inhibitors were diluted in 1,000 diverse compounds. For each compound, 30 different docking poses were retained and their respective ChemScores and Tanimoto coefficients (compared with the crystal structure 1a9u) were calculated. The best Tanimoto coefficient among the 30 docking poses of a compound is plotted against the best ChemScore or PMF score of the same molecule. The dark dots in the figures represents the 16 known inhibitors, and the lighter dots represent the 1,000 random compounds. The dotted lines indicate the corresponding cut-off scores used to filter the docking poses in order to recover 14 out of 16 (87.5%) known inhibitor. Color versions of these figures can be found in Deng, Z. et al., J. Med. Chem, 47: 337-344 (2004).

FIG. 6 is a schematic example of an embodiment (i.e., bit-string) of the method of FIG. 1.

FIG. 7A is a schematic diagram depicting the decomposition of a molecule into a core and variable groups.

FIG. 7B is a hierarchical clustering of the SIFts of 100 docking poses. The SIFts are constructed to represent different R-groups and the core of the molecule. Each selected position of the target molecule is made up of four binary bits, representing core, R1, R2, R3, and R4, respectively. Each SIFts is shown as one line in the heat map in the left of the figure, and only ON-bits are shown. The shades (colors) of the heat map blocks indicate different R-groups: red—core, blue—R1, yellow—R2, green—R3. On the right side of the figure shows the hierarchical clustering results on the fingerprints, including the dendrogram and the reorganized distance matrix. SIFts in the heat map are reorganized according to the order given by the hierarchical clustering. The shaded (colored) bar on top of the SIFt heat map represents five corresponding kinase structural sub-regions in the fingerprints. These sub-regions, each shaded (colored) differently, include the Gly-rich loop (G-loop), the region spanning from β3 to β4 (β3 to β4), β5 and the hinge region, catalytic loop and magnesium loop.

FIG. 8 is a hierarchical clustering of the SIFts of the 100 docking poses. Here the SIFt patterns contain 7 bits per selected position, each representing one of the seven chemical features of the molecule: red -hydrogen bond acceptor (HBA), blue—hydrogen bond donor (HBD), yellow—hydrophobic (HPH), green—polar (POL), cyan—negatively charged (NEG), orange—positively charged (POS), black—aromatic ring (AROM). The hierarchical clustering is based on the new SIFt patterns incorporating the chemical features of the molecules.

FIG. 9A is a p-SIFt generated from the SIFt patterns of four p38 crystal structures—1a9u, 1b16, 1b17, and 1bmk. The X-axis are the p38 residue numbers of the interaction bits;

the Y-axis represents the conservation scores of the interaction bits.

FIG. 9B shows the p38 inhibitor database enrichment performance using the SIFt-based approach. A library comprised of 16 know p38 inhibitors and 1000 random compounds were docked onto p38 target molecule and enriched using the SIFt-based Z score ranking method. The X-axis is the percentage of the whole library collected, and the Y-axis is the percentage of active compounds harvested. For comparison, the enrichment performances by two conventional scoring functions (ChemScore and PMF Score) are also shown.

FIG 10 is a graph depicting a p-SIFt.

FIG. 11 is a graph depicting p-SIFts for different groups of target molecules.

FIGS. 12(a)-12(d) are images showing p-SIFt information mapped on to a structure of a complex between a target molecule and a ligand.

FIG. 13 is a graph depicting difference profiles.

FIG. 14 is a graph depicting a difference profile.

FIGS. 15(a)-15(b) are graphs depicting enrichment of a chemical library.

FIGS. 16(a)-16(c) are graphs depicting Z scores for a p-SIFT with SIFts.

FIG. 17
a is a drawing showing an overlay of 150 poses of 1ouk-inh docked onto the human p38 structure.

FIG. 17
b is a hierarchical clustering of the r-SIFts of 150 1ouk-inh docking poses. Each r-SIFt is represented as one horizontal line in the heat map, and only ON-bits (1) are shown. The interaction bits are colored accordingly to their respective molecular fragments (red—core, blue—R1, purple—R2, green—R3). On the left side of the heat map shows the dendrogram of the hierarchical clustering result r-SIFts in the heat map are rearranged according to the order given by clustering. Four major clusters (labeled 1-4) identified from the dendrogram are labeled on the right side of the r-SIFt heat map. The line of block above the heat map indicates the locations of the corresponding binding site residues in the protein. The residues are grouped into six different regions as described previously.

FIGS. 17
c-f each displays an overlay of the docking poses of within each cluster (1-4), shown in the same reference frame as FIG. 17a.

FIG. 18 shows 2D chemical structures and their R group definitions kinase inhibitors.

FIG. 19
a shows a hierarchical clustering of docking poses of five different compounds docked onto p38 structure (1ouk). The bit-coloring scheme and structure layout are identical to those in FIG. 17b. For each compound, three poses with the best r-SIFt Tanimoto coefficients (see text) were chosen and analyzed. Since SKF-86002 does not contain R3, the r-SIFt patterns on display were constructed by omitting all the R3 bits (if present). For comparison purpose, the r-SIFt of the co-crystal structure of 1ouk is included.

FIG. 19
b shows structures of the best docking pose of each of the five molecules, shown within the same active site of target molecule structure of 1ouk. The co-crystal structure of 1ouk-inh is shown as thin yellow line model for comparison.

FIGS. 19
c-g show structures of the docking poses of each compounds (three poses per molecule) used in FIG. 19a, shown in the same reference frame as in FIG. 19b.

FIG. 20
a shows a classification of the 2208 1ouk-inh R1 library compounds based on their r-SIFt similarities. The coloring scheme is the same as in FIG. 17b. For clarity, the dendrogram as shown in FIGS. 17b and 19a with a Tanimoto coefficients distance matrix of the r-SIFt patterns. The compound order in the distance matrix matches that in the SIFt heat map, and the coloring gradient in the distance matrix corresponds to the values of the Tanimoto similarity score, from dark red (highest similarity) to dark blue (least similar). The Compounds that display native binding mode similar to the co-crystal structure (FIG. 20b), in which the R1 are correctly located the hydrophobic region are labeled as native cluster, and the rest of the compounds are labeled as non-native.

FIG. 20
b shows the 3D structures of 200 example compounds in the native cluster. The co-crystal structure is shown as yellow stick model.

FIG. 20
c shows examples of compounds in native and non-native clusters. The R1 attachment points are labeled.

FIG. 21 is a decision tree predictive model for the 1ouk-inh R1 library.

FIG. 22 is an r-SIFt based library design flowchart.

DETAILED DESCRIPTION

As used herein and in the appended claims, the singular forms “a,” “and,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a protein” includes a plurality of proteins and reference to “the polypeptide” generally includes reference to one or more polypeptides and equivalents thereof known to those skilled in the art, and so forth.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art. Although any methods, devices and materials similar or equivalent to those described herein may be used, the typical methods, devices and materials are now described.

All publications mentioned herein are incorporated herein by reference in full for the purpose of describing and disclosing the databases, proteins, and methodologies described in the publications that might be used in connection with the presently described techniques. The publications discussed above and throughout the text are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention.

Techniques are provided for a simple and robust method for representing and analyzing three-dimensional target molecule-ligand interactions. This method generates a structural interaction fingerprint (SIFt)—a representation of the interactions in the three-dimensional binary complexes, i.e., target molecule-ligand (e.g., protein-ligand or nucleic acid-ligand) complexes. The representation is in the form of an information string (e.g., a binary bit string) containing a plurality of information blocks; each of which, in turn, contains a plurality of information units. Before one constructs a SIFt, one has to select the binary (target molecule-ligand) complexes.

A. Construction and Analysis of SIFts

I. Selection of Three-Dimensional Binary Complex Structures

The SIFt-based method employs a set of three-dimensional binary structures (e.g., the molecular docking results) to generate a set of SIFts. The set of structures can be obtained from different poses of a selected pair of target molecule (e.g., a protein such as a kinase) and ligand (e.g., a natural ligand or an inhibitor). See, e.g., Example 1 wherein the set of structures was obtained from 100 of different poses of a pyridinyl imidazole inhibitor docking onto a single protein kinase p38 structure. In another aspect the set of structures can be obtained from structural data (e.g., docking results) of a number of different ligands interacting with a single target molecule. See, e.g., Example 2 wherein the set of structures was obtained from docking a group of different small molecules (a library of 1,016 small molecules) onto the same target molecule (a protein kinase p38 structure). In a further aspect, the set of structures can be obtained from different target molecules and different ligands (see, e.g., Example 3 wherein both the target molecules (protein kinases) and ligands are different). Using different target molecules requires additional structural and sequence alignment steps, which will be further discussed below. Once a set of structures has been obtained, one can proceed to construct SIFts.

II. Construction of a SIFt

(i) Identification of the Selected Positions of a Target Molecule

The next step involves selection of a set of positions (“selected positions”) on the target molecule of each of the structures where each of these selected positions is commonly involved in interactions (e.g., non-covalent interaction) between the target molecule and the ligand. These positions serve as reference points covering all of the interactions in the target molecule-ligand complex, and are then used as the common reference frame for constructing SIFts.

The selected positions are defined as regions of the target molecule that are in contact with the ligand. Different methods have been developed to determine whether contacts have been made between the target molecule and the ligand in the context of a particular interaction. Below is a description of two exemplary methods.

For example, the program AREAIMOL of the CCP4 suites (which refers to “Collaborative Computational Project, Number 4.” See the CCP4 suite: programs for protein crystallography. Acta Cryst., D50, 760-763, 1994; and Lee et al., J. Mol. Biol. 55:379-400, 1971) can be used to identify the target molecule atoms that are involved in the non-covalent intermolecular interactions with the ligand. AREAIMOL evaluates the covalent accessible area by allowing a probe sphere of 1.4 Å rolling over the Van der Waals surface of the target molecule and the target molecule-ligand complex. Note that solvent molecules can be excluded for the sake of simplicity, although in theory well-ordered solvent molecules can be included and treated in the same way as target molecule atoms. For protein target molecules, if non-hydrogen atoms show that solvent accessibility decreases upon ligand binding and these atoms are also within 4.5 Å of any of the non-hydrogen atoms of the ligand, the residues corresponding to these atoms are identified as selected positions (or ligand binding atoms). The determination of selected positions in nucleic acid can be done in a similar manner.

As to hydrogen bonding interaction between the target molecule and the ligand, one can employ programs such as HBPLUS. See McDonald et al., J. Mol. Biol. 238:777-793, 1994. HBPLUS calculates and list all possible hydrogen bond donor and acceptorpairs in the complex.

For a set of structures using the same target molecule, after all the ligand binding atoms and their respective residues or bases have been identified, these ligand binding positions are computed and defined as the “selected positions” of the target molecule. As mentioned above, different target molecules can be used. In such circumstances, additional structural and sequence alignment steps are required to convert different but related target molecules into a standard residue numbering system so that a common framework can be employed for constructing the SIFts (see, e.g., Example 3). In some cases, the selected positions can be modified after a SIFt is first constructed, for example, if a subset of the selected positions is found to be more important than other positions in the initial SIFt.

(ii) Determination and Calculation of Interaction Types

After identification of the selected positions (i.e., regions of the target molecule where intermolecular interactions take place), one has to determine and calculate the types of interactions present at these positions. In one embodiment, the target molecule can be a polypeptide or a protein and seven interaction types can be employed based on the AREAIMOL and HBPLUS results. The presence or absence of the interaction types can be calculated at each selected position based on the following inquiries: 1) whether or not it is in contact with the ligand; 2) whether or not any peptide backbone atom is involved in the contact; 3) whether or not any side-chain atom is involved in the binding; 4) whether or not polar interaction is involved; 5) whether or not non-polar interaction is involved; 6) whether or not this residue provides hydrogen bond acceptor(s); and 7) whether or not it provides hydrogen-bond donor(s). The answer to each inquiry constitutes an information unit (in this embodiment, a bit) that corresponds to a particular selected position. By joining the information units together, an information block is formed (in this embodiment, a seven-bit-long block). The entire SIFt can then be constructed by sequentially to ascendent position information blocks of each of the selected positions together, according to ascendant position number (e.g., residue number) order.

The SIFts resulting from a set of structures are therefore of the same length, and each information unit (e.g., bit) in the fingerprint represents the strength or the presence/absence of a particular interaction type at a particular selected position. As a result, the SIFts are directly comparable. Once SIFts are generated from a set of structures, one can perform analyses of the SIFts to obtain valuable interaction patterns and information (e.g., the degree of binding conservation among the target molecule-ligand pairs).

The interaction types can be classified in a number of ways. For example, the interaction types can be fragment constants descriptors (e.g., hydrophobicity, hydrogen bond acceptor, hydrogen bond donor), electronic descriptors (e.g., charge, partial positive surface area, partial negative surface area, dipole movement, atomic polarizability), topological descriptors (e.g., Wiener index, Zagreb index, Hosoya index), molecular flexibility indices, spatial descriptors (e.g., shadow indices, molecular surface area, density, principal moment of inertia, molecular volume), structural descriptors (number of chiral centers, molecular weight, number of rotatable bonds), or thermodynamic descriptors (e.g., partition coefficient, desolvation free energies for water and octanol, pK_a).

Hydrophobicity is a measure of the thermodynamics of the partitioning of a molecule or part of a molecule between water and a non-aqueous phase (e.g., an organic solvent), in particular, the free energy change (ΔG⁰_trasfer) associated with transferring a molecule or part of the molecule from a non-aqueous phase to water. In one popular definition (CATALYST™, Accelrys Inc., San Diego, CA 92121, USA), a contiguous set of atoms are defined as hydrophobic if they are not adjacent to any concentrations of charge (charged atoms or electronegative atoms), in a conformation such that the atoms have surface accessibility. Some examples of hydrophobic groups include phenyl, cycloalkyl, isopropyl, and methyl.

III. Analysis of SIFts

(i) Measurement of Similarity of SIFts

As discussed above, each SIFt represents the interaction pattern between a target molecule and a ligand. It follows that similar SIFts reflect similar interaction patterns among the target molecule-ligand pairs.

Different methods can be employed to measure similarity between SIFts. For example, one can use Tanimoto coefficient (Tc, see Willet, Chem. Inf. Comput. Sci. 38:983-996, 1998), which reflects the quantitative measurement of the similarty. Using the bit-string embodiment described above, the Tc between bit-strings A and B is defined as:
$Tc (A, B) = \frac{\langle A ⋂ B \rangle}{\langle A ⋃ B \rangle}$

where |A ∩B| is the number of ON-bits common in both A and B and |A ∪B| is the number of ON-bits present in either A or B.

(ii) Classification of SIFts Based on Similarity

Based on the similarity measurements, one can classify similar SIFts displaying similar interaction patterns for further analysis, using methods such as hierarchical clustering. From the clustering results, structures can be clustered into groups having similar binding modes.

To analyze and compare the interaction patterns within a group or between groups, a p-SIFt can be generated by quantifying the degree of similarity of each information unit at each selected position within the SIFts. One example is to calculate an interaction conservation score for each information unit (e.g., bit) among each group. This score represents the percentage of SIFts that are ON (i.e., occurrence or presence of the interaction type) at this particular selected position. The higher the score, the more conserved this interaction type is within this group. Variations in the conservation scores between two groups reveal the differences of their interaction patterns.

The p-SIFt approach is similar to profile-based techniques that have proven to be very useful in the analysis and database mining of groups of protein sequences and structures. See, for example, Gribskov, M.; et al., Proc. Natl. Acad. Sci. USA 1987, 84, 4355-4358; Gribskov, M.; et al., Methods Enzymol. 1990, 183, 146-159; Wang, G.; and Dunbrack, R. L., Jr. Protein Sci. 2004, 13, 1612-1626; Mehta, P. K.; et al., Proteins 1999, 35, 387-400; Rice, D. W.; and Eisenberg, D. J. Mol. Biol. 1997, 267, 1026-1038; and Koonin, E. V.; et al., Adv. Protein Chem. 2000, 54, 245-275; each of which is incorporated by reference in its entirety. The sequence profile can be constructed from a set of multiply aligned sequences or structures of a probe family and is used to identify distant relationships to a database of target proteins. The profile is essentially a sequence position-specific scoring matrix encoding the probability of finding any of the 20 amino acid residues at that position in the target. In the case of p-SIFt, the SIFts derived from a set of probe structures are used to derive a position-dependent profile encoding the probability that a given interaction at that position is present. The probe set of structures can correspond to members of a gene family, e.g., kinases, or to sub-families of structures representing ligands with a particular activity or selectivity profile.

A structural interaction fingerprint profile (p-SIFt) represents the degree to which interactions are conserved across a set of ligand-receptor complexes. The p-SIFt, P(r), is derived from an array, denoted below as b, of SIFt patterns. The array has length N for the total number of protein ligand-complexes and width K of SIFt fingerprints bits. The value of each element of P(r) is derived by averaging the elements in each column of the SIFt matrix, yielding a numerical interaction frequency that varies from 0 to 1 for unobserved to fully conserved, respectively. The SIFt array, b, and resulting P(r) are given by,
$b = (\begin{matrix} b_{1, 1} & b_{1, 2} & b_{1, 3} & \dots & b_{1, K} \\ b_{2, 1} & b_{2, 2} & b_{2, 3} & \dots & b_{2, K} \\ ⋮ \\ b_{N, 1} & b_{N, 2} & b_{N, 3} & \dots & b_{N, K} \end{matrix})$

and

P(r)=[P₁P₂P₃P₄P_K],

where b_i,ris the binary bit value in the SIFt i=1,N at position r=1,K. The values in the P-SIFt at position r is given by
$P (r) = \sum_{i = 1}^{N} b_{i, r} / N .$

(iii) Measurement of Similarity between SIFts and/or p-SIFts

A Tanimoto coefficient can measure the similarity between two SIFts, between two p-SIFts, between a SIFt and a p-SIFt, or between two r-SIFts (see, for example, Willett, P. J. Chem. Inf Comput. Sci. 1998, 38, 983-996, which is incorporated by reference in its entirety). A set of SIFt patterns can be clustered using the Tanimoto similarity measure by applying standard hierarchical clustering algorithms. See, for example, Deng, Z., et al., J. Med. Chem. 2004, 47, 337-344; Dubes, R., and Jain, A. K. Adv. Comput. 1980, 19, 113-228; and Raymond, J. W. et al., J. Mol. Graph. Model. 2003, 21, 421-433, each of which is incorporated by reference in its entirety.

The statistical Z score can measure how significant the similarity between a SIFt and a target p-SIFt (i.e., a group of structures) is with respect to a certain background. The Z score is an indication of how many standard deviations an observation differs from the mean. The Z score can be defined as:
$Z_{target} = \frac{x_{target} - < x_{b} >}{σ_{b}},$

where target refers to a target molecule, χ_targetis the Tanimoto coefficient of the SIFt against the target p-SIFt, <χ_b> and σ_bare the mean and standard deviation of the Tanimoto coefficients of all the SIFts in the background set, respectively, against the same target p-SIFt. A background set can include dummy SIFts having the same length as the target SIFt or p-SIFt. Each position in the dummy SIFt bit string is randomly 1 or 0, where the probability of being 1 is equal to the value in the target SIFt or p-SIFt. Alternatively, the background set can be a set of SIFts derived from structures.

A convenient way to compare p-SIFts is to calculate a difference profile by the subtraction of one p-SIFt from another. Another way to compare two SIFt, p-SIFt, or r-SIFt patterns a and b is the cosine coefficient, given by:
$\cos = \frac{\sum_{i = 1}^{N} a_{i} \cdot b_{i}}{\sqrt{\sum_{i = 1}^{N} a_{i}^{2} \cdot \sum_{i = 1}^{N} b_{i}^{2}}}$

where N denotes the number of bits in the SIFt patterns. The cosine coefficient can be applied to measure the similarity between a difference profile, d, and a SIFt pattern, c, where
$\cos = \frac{\sum_{i = 1}^{N} d_{i} \cdot c_{i}}{\sqrt{\sum_{i = 1}^{N} d_{i}^{2} \cdot \sum_{i = 1}^{N} c_{i}^{2}}} .$

In this case, the cosine coefficient measures the similarity between the difference profile, d, and a SIFt pattern, c, by varying in value from 1 to −1. If the difference profile is given by d=a-b, then a positive values of the cosine coefficient indicates that c is more similar to a than to b, whereas a negative value indicates that c is more similar to b than to a. The cosine coefficient score is most sensitive to the bits that differentiate a and b. Consequently, the cosine coefficient may be useful in predicting selectivity between inhibitors.

B. A High-Level View of the SIFt-Based Method

FIG. 1 shows a high-level view of an exemplary method for generating a SIFt. The method utilizes entries contained in structural databases containing data from various sources, e.g., X-ray crystallography, NMR, protein modeling, and/or protein/ligand interaction simulations. Three-dimensional data/structures of one or more complexes are retrieved from a database. Using any of a variety of computational methods well known to those in the art, a set of selected positions (e.g., amino acid residues or bases) that interact with a putative ligand or binding molecule are selected.

Once a three dimensional structure has been derived and selected positions (e.g., binding site residues) identified, a plurality of intermolecular interaction types occurring at each selected position is determined and measured, using any computational methods well known in the art. These interaction types can also include chemical and physical properties of the part of a ligand interacting with each selected position, and sequence conservation, structural conservation and flexibility properties of each selected position.

A SIFt for each target molecule-ligand complex structure is generated. The SIFt includes a numeric (e.g., binary) code representation of each interaction type determined/measured for each of the selected positions of the target molecule.

The SIFt containing information regarding characteristic of the interaction types at each selected position is stored within a database for subsequent retrieval and analysis. Alternatively, the SIFt can be used to query a database, generate a p-SIFt comprising possible alternative ligands that fit the SIFt, and/or define a structure based upon the type of SIFt obtained.

In one embodiment, a primary amino acid sequence of a polypeptide target molecule that is encoded by a selected genetic sequence is determined, and a three-dimensional structure is generated by homology modeling techniques. This aspect is generally represented in FIG. 1. As mentioned above, a three-dimensional model of a particular target molecule may be predicted computationally or determined in whole or in part based on experimental information. For example, X-ray crystallographic information may be used to identify a protein structure and provide information for constructing a three dimensional model of the protein target molecule.

In one embodiment, a ligand's three-dimensional structure is also obtained by similar techniques (e.g., modeling techniques and/or experimental crystallization techniques). For example, many protein molecules are co-crystallized with substrates and/or ligands. The three-dimensional ligand binding structure can then be modeled using programs that demonstrate interactions with a putative protein target molecule or binding domain thereof. Thus, one of skill in the art utilizing the 3D-protein structure and/or the 3D-ligand structure can obtain interaction data for the molecules being characterized. The ligand molecule may be any of a number of different types of compositions such as organic molecules, inorganic molecules, ions, proteins, protein fragments, nucleotides, RNA, DNA or other molecules representative of substrates, ligands, co-factors, and the like. In one embodiment, the ligand is obtained from a library of molecules.

Upon formation of the 3D complex structure, the interaction of the target molecule with a ligand is computed. Positions (e.g., amino acid residues) that play a role in the interaction with the ligand are selected. This is generally represented in FIG. 1. Particular atoms in the ligand can be identified as interacting with particular amino acid residues or bases of the target molecule. The criteria for determining an interaction (e.g., distance (e.g., in angstroms) between various atoms) can be adjusted using techniques in the modeling programs as mentioned above or by techniques known to those skilled in the art.

The target molecule-ligand interactions that are modeled result in the identification of certain selected positions (e.g., amino acid residues or bases) as well as the nature of interaction types between the ligand and the target molecule. The interaction types between a ligand and a particular selected position will depend upon the chemical-physical characteristics of the selected position in the target molecule as well as the nature of atoms or groups of atoms present in the ligand. For example, one of skill in the art will recognize that various equilibrium binding constants or binding energy values will be determinative in the type of interactions that will occur. This process is represented in FIG. 1.

The selected positions that play a role in interacting with the ligand as well as the interaction types that occur with each selected position are then used to generate a SIFt (see, e.g., FIG. 1). This SIFt can be represented by a series of numerical values (e.g., binary numbers) corresponding to each selected position and each interaction type. The selected position and interaction type form a SIFt that can be used to compare or distinguish the target molecule (or a family of target molecules) from other target molecules. Using the SIFt as a tool for comparison, target molecules (e.g., proteins or polypeptides) may be structurally or functionally associated when they share commonalities in the SIFts. This latter process is represented in FIG. 1. For example, by aligning the SIFts of two protein target molecules, a functional relationship can be determined based upon the degree of alignment (e.g., homology) between the two information strings or SIFts. Various statistical measurements and limits can be placed upon the alignment to discriminate between random and related alignments. Similarly, a p-SIFt can be generated for a group of target molecules to reveal the interactions that are conserved or variable across the group. Groups of target molecules can be compared by comparing their p-SIFts. In this way, groups of target molecules can be characterized by shared interactions, or distinguished by differing interactions. Accordingly, a powerful tool is provided to associate target molecules in a manner that does not rely on sequence or homology matching/comparisons alone, and to allow for the association of otherwise dissimilar target molecules that can be functionally related by their SIFts.

In certain embodiments, the SIFt fingerprint records the presence or absence of an interaction with a protein. The information unit containing this information can be simple to indicate whether a residue is involved in a particular interaction or not. In other embodiments, the SIFt can also include other chemical information about the ligand. In one example, a SIFt can include an information unit that contains information about a combinatorial library, which can include a core and variable group (in some examples, two, three or more R groups). Specifically, a small molecule library can be converted into a core and variable groups, a SIFt pattern can be created for each library member, information units can be turned on or off at each of the selected positions based on the nature of the contact between the core and variable groups with the protein target. In another example, a SIFt can include an information unit that contains chemical feature information. For example, a series of chemical features can be mapped onto the ligand molecule. Each residue can be represented by an information block of a series of information units, each of which can be turned on or off depending on whether this residue is interacting with a particular chemical feature on the ligand. Examples of suitable chemical features include hydrophobic, hydrogen bond donor, hydrogen bond acceptor, negatively charged, positively charged, etc. In another example, a computed or experimentally determined property can be included in a SIFt.

Information blocks that includes these properties can be used to identify chemical groups that are associated with specific residues of the protein.

ATanimoto coefficient can be used as the similarity measurement between two r-SIFts. When a group of docking poses is generated for a targe-molecule-ligand complex, the best docking poses (i.e., with top FlexX scores) for the compound can be examined, and a best pose selected for each. The selected pose can make conserved interactions with the target. An agglomerative hierarchical clustering can be applied to analyze and reorganize a group of poses, for example using Tanimoto coefficients as the similarity measurement. A dendrogram prepared from the clustering results can reveal clusters of protein-ligand complex structures having. Poses that cluster together can have similar binding interactions.

Combining SIFt-based approaches and conventional scoring functions can yield better results in reproducing the true binding modes of the compounds and better library enrichment performance. When docking known ligands, the best pose given by a conventional scoring function may not adopt the native binding mode, however, a good placement with correct binding mode usually can be found among the top 10 poses.

C. Embodiments and Applications

As discussed above, one embodiment involves the use of a seven-bit information block (e.g., contact, main-chain atom group, side-chain atom group, polar, non-polar, hydrogen bond donor, hydrogen bond acceptor) to represent the interaction pattern of each selected position of the target molecules (e.g., binding site residue of a protein target molecule). In such an embodiment, the interaction pattern represents the binding modes formed from seven different interaction types. Although such an implementation is able to successfully organize, analyze and mine a large structural library in a meaningful way, a 7-bit-long binary string does not represent all the intermolecular interactions occurring at a particular selected position. The richness of information can be improved by incorporating more bits representing other interaction types. For example, one can focus on functional groups instead of the entire residue as the basic unit, or take solvent molecules into consideration, or substitute the Boolean bits with ordinal or continuous data that reflect the strength and energetics of the interaction types. Such an enriched SIFt provides a “higher-resolution” picture of the target molecule-ligand binary complex. In situation where computational speed is a critical issue, “lower-resolution” SIFts using fewer information units may be used. Accordingly, the information units for a particular selected position (i.e., the size of the information block) may range from 1-50 units or more. Simpler SIFts can be constructed in less time at the expense of richness of information. One skilled in the art can design, select, and identify the number of information units (and thus the size of the information block) for a particular selected position based upon the details and speed desired. For example, shorter information strings (containing, e.g., 2-3 information units per information block) may be useful during the initial screening of a huge virtual library. On the other hand, longer information strings (and hence longer SIFts) provide more information at the expense of quick performance and are more useful for detailed structural analysis such as comparing groups of closely related structures. Choosing the right size of SIFt is a matter of finding a proper balance between these two competing considerations, with that balance dictated by the needs of a given situation. Another variable is the relative weight given to each interaction type. In one embodiment, information units reflecting each interaction type can contribute equally to the total similarity score. It is also possible to tailor them in a different way by focusing on one or more particular interaction types, while down-playing other kinds of interactions.

Another embodiment uses an information block to represent positions on a ligand (e.g., the bits represent a core and R groups). The number of bits in the information block can be selected with regard to the structure of a compound, e.g., the number of R-groups present. Each bit can have a 0 or 1 value, for example, to represent the presence of a contact between an atom at that position on the ligand and an atom of the target molecule. r-SIFt is a variation of structural interaction fingerprint (SIFt). r-SIFt incorporates the binding information about different variable R-groups of a compound into the fingerprint. It was specifically designed for processing and analyzing virtual screening results of combinatorial libraries. In SIFt, the interaction bits represent the presence or absence of different types of interactions (contact, polar interaction, hydrogen bonds, hydrophobic interaction, etc.) occurring at each selected residue, whereas in r-SIFt, the interaction bits represent whether or not a certain R-group or core fragment of the compound makes contact interaction (i.e., within a distance threshold) with a particular protein residue.

One advantageous feature of the SIFt-based method is that it is generic. The SIFt method works well for the protein target molecule and small molecule ligand system, and can also work for other systems including protein-protein, nucleic acid-ligand, nucleic acid protein/polypeptide systems, and the like. Indeed, the methods and systems amino acid sequences, as well as nucleotide sequences. For example, the methods can be applied to a nucleotide sequence or an amino acid sequence which corresponds to the nucleotide sequence in question. If the coding sequence is not known, translation from the nucleotide sequence to the amino acid sequence may be performed in all frames of the nucleotide sequence. Programs that can translate a nucleotide sequence are known in the art.

In one embodiment, the method can start by identifying a primary amino acid sequence of a protein. A number of source databases are available, as described below, that contain nucleotide sequences and/or deduced amino acid sequences for use with this step.

The primary direct experimental methods for determining the structure of proteins involved in particular interactions are X-ray crystallography, relying on the interaction of electron clouds with X-rays; and liquid nuclear magnetic resonance (NMR), relying on correlations between polarized nuclear spins interacting via indirect dipole-dipole interactions. X-ray methods provide information on the location of every heavy atom in a crystal of interest, accurate to 0.5-2.0 Å (1 Å=10⁻¹⁰m).

A number of databases are available that contain 3D protein structures and/or structures showing 3D protein-ligand interactions. For example, protein-protein interaction databases include the Biomolecular Interaction Network Database (BIND), which is a database designed to store full descriptions of interactions, molecular complexes and pathways; Database of Interacting Proteins (DIP), which catalogs experimentally determined interactions between proteins; an Object Oriented Database for Protein-Protein Interactions (INTERACT); and Pronet Online, which provides protein-protein interaction data and is maintained by Myriad Genetics. Other structural databases include Cambridge Crystallographic Data Centre; CATH-Protein Structure Classification; SCOP (Structural Classification of Proteins), based upon 3D fold classifications; PARTS LIST, which dynamically performs comparative fold surveys and is built on top of SCOP's fold classification and acts as an accompanying annotation; PDB (Protein Data Bank), which is an international repository for the processing and distribution of 3D macromolecular structure data primarily determined experimentally by X-ray crystallography and NMR; PRESAGE, a database for structural genomics; Structural Biology Software Database, a software database maintained by University of Illinois; BiMSSECOST, a conformational database for amino acid residues in proteins; BioMagResBank, a repository for data on proteins, peptides, and nucleic acids from NMR spectroscopy; SWISS-3DIMAGE 3D, which contains images of proteins and other biological macromolecules; SWISS-MODEL, a repository of structures generated by protein modeling; and the Cambridge Structural Database (CSD) of the Cambridge Crystallographic Data Center (CCDC). Other sources of primary amino acid sequence, modeled 3D structures and other crystallographical data will be apparent to those of skill in the art.

The various techniques, methods, and aspects described above can be implemented in part or in whole using computer-based systems and methods. Additionally, computer-based systems and methods can be used to augment or enhance the functionality described above, increase the speed at which the functions can be performed, and provide additional features and aspects as a part of or in addition to those described elsewhere in this document. Various computer-based systems, methods and implementations in accordance with the above-described technology are presented below.

In one implementation, a general-purpose computer may have an internal or external memory for storing data and programs such as an operating system (e.g., DOS, Windows 2000™, Windows XP™, Windows NT™, OS/2, UNIX or Linux) and one or more application programs. Examples of application programs include computer programs implementing the techniques described herein, authoring applications (e.g., word processing programs, database programs, spreadsheet programs, or graphics programs) capable of generating documents or other electronic content; client applications (e.g., an Internet Service Provider (ISP) client, an e-mail client, or an instant messaging (IM) client) capable of communicating with other computer users, accessing various computer resources, and viewing, creating, or otherwise manipulating electronic content; and browser applications (e.g., Microsoft's Internet Explorer) capable of rendering standard Internet content and other content formatted according to standard protocols such as the Hypertext Transfer Protocol (HTTP).

One or more of the application programs may be installed on the internal or external storage of the general-purpose computer. Alternatively, in another implementation, application programs may be externally stored in and/or performed by one or more device(s) external to the general-purpose computer.

The general-purpose computer includes a central processing unit (CPU) for executing instructions in response to commands, and a communication device for sending and receiving data. One example of the communication device is a modem. Other examples include a transceiver, a communication card, a satellite dish, an antenna, a network adapter, or some other mechanism capable of transmitting and receiving data over a communications link through a wired or wireless data pathway.

The general-purpose computer may include an input/output interface that enables wired or wireless connection to various peripheral devices. Examples of peripheral devices include, but are not limited to, a mouse, a mobile phone, a personal digital assistant (PDA), a keyboard, a display monitor with or without a touch screen input, and an audiovisual input device. In another implementation, the peripheral devices may themselves include the functionality of the general-purpose computer. For example, the mobile phone or the PDA may include computing and networking capabilities and function as a general purpose computer by accessing the delivery network and communicating with other computer systems. Examples of a delivery network include the Internet, the World Wide Web, WANS, LANs, analog or digital wired and wireless telephone networks (e.g., Public Switched Telephone Network (PSTN), Integrated Services Digital Network (ISDN), and Digital Subscriber Line (xDSL)), radio, television, cable, or satellite systems, and other delivery mechanisms for carrying data. A communications link may include communication pathways that enable communications through one or more delivery networks.

In one implementation, a processor-based system (e.g., a general-purpose computer) can include a main memory, preferably random access memory (RAM), and can also include a secondary memory. The secondary memory can include, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive reads from and/or writes to a removable storage medium. A removable storage medium can include a floppy disk, magnetic tape, optical disk, etc., which can be removed from the storage drive used to perform read and write operations. As will be appreciated, the removable storage medium can include computer software and/or data.

In alternative embodiments, the secondary memory may include other similar means for allowing computer programs or other instructions to be loaded into a computer system. Such means can include, for example, a removable storage unit and an interface. Examples of such can include a program cartridge and cartridge interface (such as the found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, and other removable storage units and interfaces, which allow software and data to be transferred from the removable storage unit to the computer system.

In one embodiment, the computer system can also include a communications interface that allows software and data to be transferred between computer system and external devices. Examples of communications interfaces can include a modem, a network interface (such as, for example, an Ethernet card), a communications port, and a PCMCIA slot and card. Software and data transferred via a communications interface are in the form of signals, which can be electronic, electromagnetic, optical or other signals capable of being received by a communications interface. These signals are provided to communications interface via a channel capable of carrying signals and can be implemented using a wireless medium, wire or cable, fiber optics or other communications medium. Some examples of a channel can include a phone line, a cellular phone link, an RF link, a network interface, and other suitable communications channels.

In this document, the terms “computer program medium” and “computer usable medium” are generally used to refer to media such as a removable storage device, a disk capable of installation in a disk drive, and signals on a channel. These computer program products provide software or program instructions to a computer system.

Computer programs (also called computer control logic) are stored in the main memory and/or secondary memory. Computer programs can also be received via a communications interface. Such computer programs, when executed, enable the computer system to perform the features as discussed herein. In particular, the computer programs, when executed, enable the processor to perform the described techniques. Accordingly, such computer programs represent controllers of the computer system.

In an embodiment where the elements are implemented using software, the software may be stored in, or transmitted via, a computer program product and loaded into a computer system using, for example, a removable storage drive, hard drive or communications interface. The control logic (software), when executed by the processor, causes the processor to perform the functions of the techniques described herein.

In another embodiment, the elements are implemented primarily in hardware using, for example, hardware components such as PAL (Programmable Array Logic) devices, application specific integrated circuits (ASICs), or other suitable hardware components. Implementation of a hardware state machine so as to perform the functions described herein will be apparent to a person skilled in the relevant art(s). In yet another embodiment, elements are implanted using a combination of both hardware and software.

In another embodiment, the computer-based methods can be accessed or implemented over the World Wide Web by providing access via a Web Page to the methods described herein. Accordingly, the Web Page is identified by a Universal Resource Locator (URL). The URL denotes both the server and the particular file or page on the server. In this embodiment, it is envisioned that a client computer system interacts with a browser to select a particular URL, which in turn causes the browser to send a request for that URL or page to the server identified in the URL. Typically the server responds to the request by retrieving the requested page and transmitting the data for that page back to the requesting client computer system (the client/server interaction is typically performed in accordance with the hypertext transport protocol (HTTP)). The selected page is then displayed to the user on the client's display screen. The client may then cause the server containing a computer program to launch an application to, for example, perform an analysis according to the described techniques. In another implementation, the server may download an application to be run on the client to perform an analysis according to the described techniques.

The described techniques open up the possibility of using an informatics approach in three-dimensional structure analysis and structure-based drug discovery. One application is in the area of virtual chemical library screening process. As discussed herein, SIFt can serve as a post-docking molecular organizer and filter. Docking poses can be organized based on their overall interaction patterns or binding modes. Furthermore, any previously acquired knowledge can be applied as structural constraints to filter out unwanted poses, giving a smaller and better pool of lead compounds. Compared to pharmacophore-based filters, the SIFt-based method is far more generic, flexible and easy to apply. In combination with other pre-existing approaches such as empirical docking scores, the SIFt-based method can weed out more false-positive compounds with undesirable properties, leaving a smaller but better pool of lead compounds, and thus significantly improve the hit rate.

In addition, the SIFt-based approach can be applied in designing, refining and pruning target-focused chemical libraries. As shown in examples 4 and 11, different embodiments of SIFt (e.g., r-SIFt) can be very effective tools for discriminating compounds with different binding modes. With r-SIFt, one can easily distinguish compounds that bind to the target molecule with desirable binding mode(s) and others that do not. Based on this compound classification result, we can then generate prediction models (e.g., decision tree, neural network, support-vector machine) to predict the binding modes of compounds using their chemical properties as predictors. Such prediction models can be applied in the early stage of virtual library screening to filter out undesirable compounds in order to generate a smaller, target-specific pool of compounds.

Besides processing the virtual structures generated during chemical library screening, the SIFt-based method can be used to analyze experimentally determined structures. Furthermore, the methods are not limited to structures involving one particular target molecule; the method is generic enough to work for structures of a family of target molecules (e.g., the kinase family). The prerequisite is that these target molecules are structurally related, so that a common framework of the ligand-binding site can be constructed. By using this method, distinct sub-groups of target molecule-ligand (e.g., enzyme-inhibitor) complex structures, each of which represents a distinct overall interaction pattern, can be identified. The identified sub-groups of these target molecule-ligand complexes can also be classified according to other grouping criteria, such as grouping by different target molecule, by different types of ligands, or by different conformations.

Quantitative comparisons of these clusters would reveal interaction patterns specific for a particular group and thus could provide structural insight into the mechanism of binding activity and selectivity. In addition, the p-SIFt can capture the common features among a group of ligand-target molecule structures. It can be used to compare different groups of structures, and to correlate the differences or commonality in their SIFt profiles to their activities.

In sum, the methods of characterization and generation of information strings representing SIFts provided by the described techniques are an improvement over conventional characterization methodologies that typically rely on sequence-based comparisons. The SIFt facilitates and integrates several desirable functionalities including structural data visualization, organization, analysis, and mining together, making it a powerful tool for analyzing and profiling three-dimensional binding interactions. As mentioned above, a particularly useful feature of this method is that it compares and reveals associations (e.g., binding similarities) between dissimilar target molecules (e.g., proteins that may have functional or behavioral analogies that are not otherwise apparent due to differences in the protein sequence).

The described techniques (including SIFt-based methods, computer implementations, systems, and databases) disclosed herein translate three-dimensional intermolecular interactions into simple, linear information strings, thereby making it possible to efficiently analyze large libraries of structures using mathematics and informatics methods described herein. Although conceptually simple, the described techniques provide a novel method of visualizing, organizing, analyzing, and mining 3D structural information. The SIFt method organizes target molecule-ligand complex structures into groups based on their interaction patterns. Intermolecular interactions between target molecules and ligands are visualized and can be easily comprehended using the heat-map of the SIFts for data visualization. Specifically, each line representing one fingerprint (or SIFt), and each bit in the SIFt colored or shaded according to its value. Using the described techniques, conserved/unconserved interactions within or among different sub-groups of structures (data analysis) can be compared and quantified. In addition, by representing the target molecule ligand complex structures using SIFts, a query can be perfor interactions to select complexes (or ligands) that satisfy predefined criteria (e.g., a certain interaction pattern or binding mode, or even a particular interaction type occurring at a selected position), in a way similar to querying a database (data mining).

EXAMPLES

The following examples are provided to illustrate the practice of the described techniques, and in no way limit the scope of the claims.

Color versions of FIGS. 2A-5B can be found in Deng, Z.; Chuaqui, C.; Singh, J. “Structural Interaction Fingerprint (SIFt): A novel method for analyzing three-dimensional protein-ligand binding interaction,” J. Med. Chem, 47: 337-344 (2004). Color versions of FIGS. 10-16(c) can be found in Chuaqui, C.; Deng, Z.; Singh, J. “p-SIFt: Interaction profiles of protein kinase-inhibitor complexes and their application to virtual screening”. J. Med. Chem., 2005, 48, 121-133.

The protein kinase family exemplifies the challenges presented by the large amount of structural data being generated not only on specific drug targets, but also at the gene family level. See, for example, Cohen, P. Nat. Rev. Drug Discov. 2002, 1, 309-315; ter Haar, E.; et al. Mini. Rev. Med. Chem. 2004, 4, 235-253; Manning, G.; et al. Science 2002, 298, 1912-1934; and Vieth, M.; et al. Biochim. Biophys. Acta 2004, 1697, 243-257, each of which is incorporated by reference in its entirety. For example, there exist over 100 structures of protein kinase small molecule complexes that have been deposited in the public domain including 34 different kinase family members. In the Examples below, p-SIFt is applied to analyzing the similarities and differences between ATP, p38 and CDK2 inhibitors binding to the protein kinase family. p-SIFt was able to not only enrich p38 and CDK2 inhibitors, but also importantly show how it can be selective in its enrichment.

Since the majority of kinase inhibitors bind to a conserved ATP site on the enzyme, the ability to understand the selectivity profile for an inhibitor is critical. In silico approaches to predict which kinase inhibitors may cross-react would help avoid downstream toxicity issues as well as enable “target-hopping”, where an inhibitor to a given kinase is used to discover a lead inhibitor for a new target (see, for example, Singh, J.; et al. Bioorg. Med. Chem. Lett. 2003, 13, 4355-4359, which is incorporated by reference in its entirety). Knowledge-based filters applied to virtual libraries during VS preferably enrich libraries with ligands that are likely kinase inhibitors, and are target specific, biasing hits away from undesired “anti-targets” while selecting for ligands that satisfy particular specificity conferring interactions in the target. p-SIFt is a useful tool to complexes from X-ray and NMR, and for analyzing and database mining for the selective enrichment of compounds against specific drug targets.

Examples 1-3

In Example 1, a set of molecular docking results was generated employing the crystal structure of p38 in complex with a pyridinyl imidazole inhibitor SB203580 (PDB accession code: 1a9u). See, e.g., Wang et al. Structure, 1998, 6(9), 1117-1128. The docking program FlexX (see Rarey et al. J. Mol. Biol., 1996, 261, 470-489) in Sybyl (version 6.8, Tripos, Inc., St. Louis, MO) was used to dock SB203580 onto the crystal structure of p38. In this single ligand study, 100 poses of SB203580 generated by FlexX were retained for subsequent analyses. The ligand binding site was defined using a cutoff radius of 12 Å from the SB203580 ligand (i.e., the conformation in the crystal structure) combined with a core sub-pocket cutoff distance of 4 Å. The FlexX scoring function was used for scoring the docking. For each ligand being studied, ChemScore, Gscore, PMF Score, Dscore, and Consensus Score were evaluated using the Cscore utility in Sybyl. For references of the just-mentioned applications, see, e.g., Eldridge et al. J. Comput.-Aided Mol. Des. 1997, 11 425-445; Jones et al. J. Mol. Biol. 1997, 267, 727-748; Muegge et al. J. Med. Chem., 1999, 42(5), 791-804; Gohlke et al. J. Mol. Biol., 2000, 295, 337-356; and Charifson et al. J. Med. Chem., 1999, 42(25), 5100-5109. FIG. 2A shows the 100 poses generated in this experiment, which adopted different orientations and positions in the ATP binding site of the kinase.

In Example 2, the experiment described was designed to evaluate the database enrichment potential of SIFt by docking a diverse set of compounds spiked with known actives onto the same target protein structure. To this end, 16 known p38 inhibitors were combined with 1,000 small molecules with diverse chemical structures compiled internally.

These inhibitors were pyridinylimidazoles and analogs, covering the majority of the p38 inhibitor families reported thus far, as previously discussed by Adams and Lee (see Adams and Lee. Current Opinion Drug Discovery & Development. 1999, 2, 96-109). These 1,016 compounds were docked onto the p38 structure (1a9u) using FlexX distributed across 50 dual processor nodes of a Linux computing farm. For each ligand, 30 different poses generated from the docking experiment were retained, generating a library of 30,480 (30×1,016) docked ligand structures for subsequent interaction fingerprints analysis. The performance of database enrichment was measured by the enrichment factor (EF), calculated based on the ability of recovering 14 out of 16 (87.5%) known inhibitors. For reference, see, e.g., Pearlman et al. J. Med. Chem. 2001, 44, 502-511. In both docking experiments, three-dimensional conformers of the ligands were generated using OMEGA (OpenEye Sicentific Software, Inc., Santa Fe, NM).

In Example 3, the SIFt-based method was also used to analyze a family of experimentally determined structures. Specifically, a panel of 89 X-ray crystal structures of protein kinase-ligand complexes was selected from the PDB. The selection criteria included:

1) the structures must contain ligands (either ATP, GTP or other inhibitors) present in their ATP-binding pockets; 2) most of the ATP binding site residues are visible and present in the crystal structures. These 89 protein kinase-inhibitor complexes include 25 different kinases, covering 14 different protein kinase subfamilies as classified by Hanks and Quinn. See Hanks and Hunter FASEB J. 1995, 9, 576-596 and Hanks and Quinn Methods Enzymol., 1991, 200, 38-62. In all, the kinase structures contain 54 unique compounds representing a variety of chemical structures (see Table 1).

TABLE 1List of 89 Crystal Structures of Protein Kinase-Ligand ComplexesPDBProtein Kinaseaccession codeLigandBovine PKA1ydtH891ydrH71ydsH81stcStaurosporineMurine PKA1l3rADP1fmoAdenosine1jbpADP1atpATP1bx6BalanolPorcine PKA1cdkAMPPNPHuman CDK21jvpPKF049-3651finATP1jstATP1elvNU20581b38ATP1jsvU551hckATP1gy3ATP1b39ATP1gij2PU1gii1PU1gih1PU1fqlATP1aqlStaurosporine1ckpPurvalanol1e1xNU60271g5sH7171fvt4-(5-BROMO-2-OXO-2H-INDOL-3-YLAZO)-BENZENESULFONAMIDE1ke5LS11ke9LS51dm2Hymenialdisine1fvv4-[(7-OXO-7H-THIAZOLO[5,4-E]INDOL-8-YLMETHYL)-AMINO]-N-PYRIDIN-2-YL-BENZENESULFONAMIDE1di84-[3-HYDROXYANILINO]-6,7-DIMETHOXYQUINAZOLINE1e9hINDIRUBIN-5-SULPHONATE1ke8LS41ke7LS31ke6LS2S. pombe Ck-1 alpha1csnATP2csnCKI71eh4IC261Human c-src1bygStaurosporine1kswNBS2srcAMPPNPHuman CK-2 alpha1jwhAMPPNPHuman DAP1jkkAMPPNP1jklAMPPNP1ig1AMPPNPHuman ERK21pmeSB202190Human FGFR2fgiPD1730741agwSU49841fgiSU5402Human HCK1ad5AMPPNPlqcfPP12hckQuercetinHuman IGFR1jqhACP1k3aACPHuman INSR1i44AMPPNP1ir3AMPPNP1gagFull-nameHuman JNK31jnkAMPPNPHuman LCK1qpdStaurosporine1qpePP21qpjStaurosporine1qpcAMPPNPHuman P381kv1BMU1kv2BIRB-7961bmkSB2186551bl7SB2200251di94-[3-METHYLSULFANYLANILINO]-6,7-DIMETHOXYQUINAZOLINE1a9uSB2035801bl6SB216995Murine ABL1iepSTI-571Murine ABL1fpuSTI-571Murine CHAK1iahADP1ia9AMPPNPMaize CK-2 alpha1lp4AMPPNP1dawAMPPNP1j91TBS1ds5AMP1dayGNP1f0qEmodinMurine NUK1jpaAMPPNPHuman P38-gamma1cm8AMPPNPRat ERK21golATP4erkOlomoucine3erkSB220025Rabbit PHK1phkATP1ql6ATP2phkATP

In each of Examples 1-3, the first step in the construction of SIFts is to identify a list of selected positions or binding site residues that are common in all complex structures being studied. The resulting panel of ligand binding site residues, which covered all of the interactions occurring between the target protein and the ligands, was then used as the common reference frame to construct the interactions fingerprints.

For a group of structures involving the same target protein (experiments such as those described in Examples 1 and 2), the ligand binding site is defined as the list of residues comprising the union of all residues involved in ligand binding over the entire library of structures. For a group of structures involving different target molecules (such as the experiment described in Example 3), additional structural and sequence pre-alignment steps were required as described immediately below.

In Example 3, the crystal structure of murine PKA complexed with ATP and a peptidic inhibitor PKI (PDB accession number: 1ATP; see Zheng et al. Acta Cryst. 1993, D49, 362-365) was used as the reference model for structural and sequence alignment. Initial amino acid sequence alignment of the catalytic cores of these kinases was taken from the Protein Kinase Resources (see Smith et al. TIBS, 1997, 22(11), 444-446). Structural alignment of the kinase structures was carried out manually and focused primarily on the vicinity of the ATP binding sites. Based on the structural alignment results, sequence alignments were carefully checked and adjusted if necessary, so that all structurally equivalent residues match each other in the sequence alignment. After the sequence and structural alignments, the residues of the non-murine PKA protein kinases were renumbered and tallied to the murine PKA residue numbering system, resulting in a uniform residue numbering system for all kinases analyzed. Identification of the list of ligand binding sites was carried out as previously described using the new PKA-equivalent residue numbers.

In each of Examples 1-3, after all the ligand binding site residues were identified and all the protein-ligand intermolecular interactions were calculated, the next step was to classify these interactions, as described previously in the “Detailed Description” Section.

Seven different types of interactions occurring at each binding residue were extracted and classified from the AREAIMOL and HBPLUS results. The inquiries were: 1) whether or not it is in contact with the ligand; 2) whether or not any main-chain atom is involved in the contact; 3) whether or not any side-chain atom is involved in the binding; 4) whether or not a polar interaction is involved; 5) whether or not a non-polar interaction is involved; 6) whether or not the residue provides hydrogen bond acceptor(s); 7) whether or not it provides hydrogen-bond donor(s). By doing so, each residue was represented by a seven-bit-long bit string. The whole interaction fingerprint of the complex was finally constructed by sequentially concatenating the binding bit string of each binding site residue together, according to ascendant residue number order. Therefore, interaction fingerprints are of the same length and each bit in the fingerprint represents presence or absence of a particular interaction at a particular binding site.

As described above in Example 1, the SIFt-based method was applied to analyze the result of a typical docking study. The docking study resulted in 100 docking poses of a small molecule inhibitor (SB203580) of p38, for which the crystal structure was known (PDB entry 1a9u). The poses adopted diverse binding modes, varied in their orientations and positions relative to the target protein and were complex to interpret visually (see FIG. 2A). A total of 34 protein residues in the vicinity of the ATP binding pocket were identified as the ligand binding site. These binding site residues were located in different sub-regions of the kinase structure. SIFts were generated for all complexes, each of which was composed of 238 (7×34) binary bits. The hierarchical clustering result of these fingerprints is shown in FIG. 2B with the fingerprint Tanimoto similarity matrix represented as a heat-map. The dendrogram revealed seven major clusters, labeled 1 to 7, respectively. FIG. 2B shows that the clustering by their SIFt patterns has separated the poses into different groups with distinct binding interactions. FIGS. 2C-2I depict the structures of each major cluster, each of which was put in the same reference frame. Interestingly, each of these seven clusters was comprised of poses having similar binding modes with the receptor. Cluster 1 contained molecules similar to the known X-ray crystal structure. Clusters 2-5 were similar in position but represented distinct binding modes that resulted in dissimilar interactions with the Gly-rich loop and the catalytic loop of p38. Finally, clusters 6and 7where outside the ATP binding site. Reassuringly, the degree of variation between clusters observed visually in their binding interactions appears to correlate to their distance in the dendrogram. For example, groups 1, 4, 6 and 7 each showed very little structural variation, as represented by tight clusters in the dendrogram, whereas group 3 and 5 showed relatively more diversity in their structures as well as in their fingerprints. Furthermore, clusters 1 and 7 had very little in common and were farthest from each other in the dendrogram. In summary, visual inspection confirms that SIFt is useful in separating docking poses into distinct clusters that reveal distinct binding interactions.

Traditionally, various scoring functions have been used to rank poses from docking studies. Scoring function scores provide an estimate of the binding strength of the compounds in order to identify the potential “good binders” from a large pool of poses, such that a selection of top scoring compounds derived from a rank ordered list of docked ligands will be enriched with active compounds. Scoring functions can be useful in discriminating the poses in the different SIFt clusters (i.e., different binding modes). In FIG. 3A, the first SIFt cluster, which is the closest to the true binding conformation, showed a wide range in PMF scores, spanning from the best score (−70) to the worst (−4). In fact, the majority of the poses in this cluster was no better in their PMF scores than those in other SIFt clusters. In addition, the PMF scores for SIFt cluster 2 were just as good as those for cluster 1, even though they adopt different, crystallographically unobserved, interactions with the receptor.

Other different clusters also overlap with each other in their docking scores. Clearly, PMF score is a poor scoring function for discriminating compounds with true binding mode and irrelevant poses in the experiment. In an attempt to broaden the analysis of scoring functions, consensus scoring function that consists of five commonly used scoring functions was also examined (see FIG. 3B). Many of the poses in clusters 1-3 had high Cscores (3-5), while clusters 3-7 overlapped significantly in the score range 0-2. This example further demonstrates the fact that across a range of scoring functions, the energy-based approaches alone were insufficient in distinguishing different binding modes, and in isolating those poses corresponding to the observed binding mode.

The application of the SIFt-based method was extended to other ensembles of structures involving different proteins and a diverse set of small molecules. In Example 3, 89 known crystal structures of the protein kinase family that had been deposited in the Protein Databank were chosen. As mentioned above, they represent 14 different protein kinase subfamilies and 54 unique kinase small molecule ligands/inhibitors. The structure and sequence homology among protein kinases enabled us to analyze these structures using the SIFt-based approach.

A total of 56 residues were identified as the ligand binding site (see FIG. 4A).

The heat-map and the results from hierarchical clustering are shown in FIG. 4B. These interaction fingerprints were diverse, reflecting a high degree of variability in their binding interactions. Nevertheless, three major clusters can be identified from the dendrogram (see FIG. 4B). Although the results indicate that within each cluster there existed considerable variation in their interaction patterns, these three groups represented three distinct binding modes, as confirmed by careful inspections of their structures (see FIG. 4C). The first cluster has 4 members, containing structures of human p38 in complex with four different pyridinyl imidazole inhibitors: SB203580, SB216995, SB220025 and SB218655. The second cluster had 16 members, mostly human CDK2 in complex with different compounds with diverse chemical properties. The third cluster, which does not have a clear-cut boundary, is comprised of approximately 36 structures, and almost all of them are structures of different kinases in complex with ATP or ATP-analogs inhibitors (GTP, AMPPNP, AMPPCP, AMP, ADP, etc.). Besides these three major clusters, about one-third of the 89 structures are either singletons or form tiny clusters. Interestingly, the three major clusters represent different grouping examples of protein-ligand complexes—the first one is made up of the same protein and chemically similar compounds; the second group contains the same protein but with a variety of ligands; the third cluster contains different proteins in complex with chemically similar ligands.

Comparison of these fingerprints also revealed interactions that are conserved or highly variable among the structures. For instance, contact interactions with residue 57 (in PKA numbering, within the Gly-rich loop) and residue 70 (also in PKA numbering), are strictly conserved among all of the 89 protein kinase-ligand structures. Other highly conserved interactions include contacts with residue 49, 72, 120, 121, 123, 173, 184, etc. (see FIG. 4B). In contrast, many other interactions are not conserved or only conserved within a particular group. Detailed and systematic comparison of these structural profiles of the ATP binding sites of protein kinases will be presented elsewhere (Deng et al. manuscript in preparation).

The SIFt-based method provides a new and powerful tool for lead discovery and lead optimization, enabling the search for molecules in a chemical database on the basis of expected interaction patterns to a target molecule. This application was specifically tested in Example 2, where a virtual screen for a set of 16 known p38 inhibitors spiked into a diverse library of 1,000 commercially available compounds was performed. These p38 inhibitors were all ATP-competitive inhibitors, and despite representing varied chemical templates had similarities to the pyridinylimidazole series (i.e., SB203580-like) for which the crystal structure of the complex was known (1a9u).

These inhibitors and the random collection of chemical compounds were docked using FlexX onto the crystal structure of p38 (1a9u), and how well these known inhibitors could be enriched using commonly used scoring functions was assessed. These were then compared with the results from a SIFt-based enrichment involving filtering of the compounds based on their similarities in interaction patterns (measured by Tanimoto coefficient) to SB203580, a known pyridinylimidazole inhibitor of p38 for which the X-ray crystal structure was known. The rationale for SIFt-based enrichment is that these 16 known inhibitors, being analogs of the pyridinylimidazole series, are expected to bind to p38 with similar overall binding modes.

FIG. 5A, 5B and Table 1 show the comparison of the database enrichment performances of the scoring functions with SIFt. ChemScore gave a modest enrichment factor of 5.4, and 166 compounds were harvested in order to identify 14 of the 16 known p38 inhibitors. PMF was slightly worse than ChemScore, with an enrichment factor of 2.0. In addition, an analysis of the binding modes of the poses of the enriched p38 inhibitors identified using these scoring functions showed that some of them were highly variable to the known crystal structure of SB203580, despite similarities in functionalities, suggesting that their binding modes obtained by ChemScore or PMF score were incorrect. This implies that the scoring functions were probably performing worse than the enrichment factors were indicating. In comparison, SIFt scored quite well, having to harvest only 24 compounds to be able to identify 14 of the 16 inhibitors, giving an enforcement factor of 37.0. Reassuringly, the highest scoring compound recovered by SIFt was SB203580 upon which the interaction fingerprint used to probe the database was based. Visual inspection of the binding modes of the p38 inhibitors identified using SIFt showed that all of their binding modes were similar to that of SB203580. A combination of SIFt and ChemScore led to a modest increase in enrichment (EF=42.3).

TABLE 2Comparison of the database enrichment performancesof SIFt with ChemScore and PMF ScoreFiltering MethodEnrichment Factor (EF)*PMF Score2.0ChemScore5.4SIFt37.0SIFt + ChemScore42.3
*EF is defined as: EF = {Hits_sampled/N_sampled}/{Hits_total/N_total}, where Hits_sampledis the number of known inhibitors recovered the sampled fraction Of N_sampledposes; Hits_totalis the number of known inhibitors present in the whole library of N_totalcompounds. Here each EF was calculated based on the ability of recovering 14 out of 16 known p38 inhibitors spiked into a random library of 1,000 compounds.

Example 4 and 5

These two examples illustrate two other embodiments of SIFt implementation that include the chemical information about the ligands into their SIFt patterns. In Example 4, the information about core and variable groups (R-groups) of a compound is embedded into the SIFts (e.g., r-SIFts); in Example 5, the pharmacophoric features of the compound are used.

In Example 4, the same set of 100 docking poses of SB203580 docked onto p38 used in Example 1 and 2 was also used. The SB203580 molecule was decomposed into core, R1, R2 and R3 groups as shown in FIG. 7A. Each non-hydrogen atoms were assigned to one of these four different groups. Four binary bits were used for each binding site residue, representing the core, R-1, R-2, R-3, respectively. If this residue was in contact with (i.e., distance <=4.0 Angstrom) a non-hydrogen atom belonging to a particular group, then the corresponding bit is turned ON (1); otherwise the bit remains OFF (0). The final SIFt pattern was constructed by concatenating all the bit strings of all the binding site residues together, according to the same ascendant residue number order, as used in Example 1.

Grouping of the SIFt patterns was carried out using the same hierarchical clustering method as described in Example 1.

FIG. 7A is the decomposition of molecule SB203580 into core (1) and three different R-groups, R-1 (2), R-2 (3) and R-3 (4).

FIG. 7B is a hierarchical clustering of the SIFts of 100 SB203580 docking poses. The SIFts were constructed to represent different R-groups and the core of the molecule. Each selected position of the target molecule is made up of four binary bits, representing core, R1, R2, R3, and R4, respectively. Each SIFt was shown as one line in the heat map in the left of the figure, and only ON-bits are shown. The shades of gray, or colors, of the heat map blocks indicated different R-groups: red—core, blue—R-1, yellow—R-2, green—R-3. On the right side of the figure showed the hierarchical clustering results on the fingerprints, including the dendrograrn and the reorganized distance matrix. SIFts in the heat map were reorganized according to the order given by the hierarchical clustering. The shaded, or colored, bar on top of the SIFt heat map represents five kinase sub-regions in the fingerprints. These sub-regions, each shaded or colored differently, include the Gly-rich loop (G-loop), the region spanning from β3 to β4 (β3 to β4), β5 and the hinge region, catalytic loop and magnesium loop.

FIG. 7C and 7D show the structures of the poses in cluster 1 (7C) and cluster 2 (7D), respectively, as identified by the hierarchical clustering of their r-SIFts (FIG. 7B), in the context of the p38 crystal structure (1a9u). The poses are shown in gray or cyan, and the co-crystal structure of SB203580 is shaded or colored according to atom types. The five kinase sub-regions that are in contact with the poses within the group are shaded or colored using the same shading or coloring scheme as described in FIG. 2B and FIG. 7B. Compared to Example 1, the 7 r-SIFt groups are more tightly clustered, indicating r-SIFt is more sensitive to the different binding mode than the original SIFt comprised of 7 interaction bits that were used in Example 1. In addition, since different bits in the r-SIFt correspond to different segments of the molecule, it is very straightforward to tell from the r-SIFt which part of the molecule interacts with which part of the target molecule. Therefore, r-SIFt can be used in virtual screening as a convenient tool to separate poses of different binding modes.

In Example 5, the same set of SB203580 docking poses were used. This time, however, each atom of the molecule was assigned to seven different chemical features, including hydrogen bond acceptor, hydrogen bond donor, hydrophobic, polar, negatively charged, positively charged, or aromatic ring atom. Some atoms fell into more than one category of these chemical features. When constructing the new SIFt patterns, seven binary bits were used to represent a binding site residue, each indicating one of the above seven chemical features. If this residue was within 4.0 Angstroms from any atom that belongs to a particular chemical feature category, then this bit was turned ON (1); otherwise it remained OFF (0). The final SIFt was constructed by concatenating all the binary strings for all binding site residue together, in the same order as used in Examples 1 and 4.

FIG. 8 is the hierarchical clustering of the SIFts of the same 100 docking poses of SB203580. Here the SIFt patterns contained 7 bits per selected position, each representing one of the seven chemical features of the molecule: red—hydrogen bond acceptor, blue—hydrogen bond donor, yellow—hydrophobic, green—polar, cyan—negatively charged, orange—positively charged, black—aromatic ring. These colors are represented in shades of gray in FIG. *. The hierarchical clustering was based on the new SIFt patterns incorporating the chemical features of the molecules.

In both Examples 4 and 5, the two different constructions of SIFt pattern provided richer information about the chemical environment around the binding site. Hierarchical clustering results of these two set of new SIFts both gave similar performance, in terms of separating different binding modes of the poses, and the results were comparable with that given by the previous construction of SIFt described in Example 1. This indicates that both the SIFt patterns incorporating the information about the R-group and chemical features were very useful ways of representing the structural information, complimentary to the previous construction of SIFt.

Example 6

This example demonstrates one of many potential applications of the p-SIFt. A p-SIFt represents the degree of similarity for an interaction occurring at a particular binding site among a group of structures. In this example, the value at each position is the average of all the interaction bit values occurring at this particular position within a group of SIFts.

FIG. 9A shows the p-SIFt generated from the SIFt patterns of four p38 crystal structures—1a9u, 1b16, 1b17, and 1 bmk, each of which contains a different potent p38inhibitor. The X-axis represents the p38 residue numbers of the interaction bits; the Y-axis represents the conservation scores of the interaction bits. The more conserved an interaction, the higher the value at this position.

The above p-SIFt was used to enrich p38 inhibitors from a large library. The idea behind the approach is that if a compound adopts an interaction pattern similar to that of previously known inhibitors (i.e., a p-SIFt), then it is more likely to be a true inhibitor. The statistical Z score was used to measure how significant the similarity between a SIFt and a target profile is above a certain background. Z score is defined as
$Z = \frac{x - < x_{b} >}{σ_{b}}$

where x is the Tanimoto coefficient of the SIFt against the target profile, <x_b> and σ are the mean and standard deviation of the Tanimoto coefficients of all the SIFts in the background set, respectively, against the same target profile. The background set was used to construct a reference distribution upon which the comparisons were based.

A library comprised of sixteen known p38 inhibitors and 1000 random compounds were docked onto p38 target molecule. For each compound, 10 poses were retained for subsequent analysis. Poses were ranked according to their SIFt Z scores against the p38 p-SIFt, generated from four co-crystal structures. The background set used in Z score calculation included all of the docking poses. For each compound, the pose with the highest Tanimoto coefficient against the p38 profile was selected, and then all 1016 best poses were ranked according to their Z score. The database enrichment curves are shown in FIG. 9B. The X-axis is the percentage of the whole library collected, and the Y-axis is the percentage of active compounds harvested. For comparison, the enrichment performances by two conventional scoring functions (ChemScore and PMF Score) are also shown.

From FIG. 9B it is clear that the enrichment obtained by applying SIFt-based Z score to select the best pose for each compound provided markedly superior results over those obtained using standard scoring using the ChemScore and PMF Score.

Example 7

A panel of 93 X-ray crystal structures of protein kinase-ligand complexes was selected from the PDB. The selection criteria included the following: (i) the structures were complexed with small molecules (either ATP, ATP-analogs or inhibitors) present in their ATP binding pockets; and (ii) most of the ATP binding site residues were visible and present in the crystal structures.

The crystal structures of p38 in complex with a pyridinyl imidazole inhibitor SB203580 (PDB code 1a9u) and of CDK2 complexed with 4-[3-Hydroxyanilino]-6,7-Dimethoxyquinazoline (PDB code 1di8) was used for docking studies. In each case the ligand-binding site was defined from the bound ligand using a cut-off of 10 Å. Bound waters were removed from the binding sites and the receptors were protonated at pH 7.4.

The set of known inhibitors of p38 were chosen to span several major p38 inhibitor chemotypes (see, for example, Adams, J.; and Lee, D. Curr. Opin. Drug Discovery Dev. 1999, 2, 96-109, which is incorporated by reference in its entirety). Inhibitors of CDK2 were 54 active compounds collected from the literature (see, for example, Claussen, H.; et al. Current Drug Discovery Technologies 2004, 1, 49-60, which is incorporated by reference in its entirety). These known inhibitors for p38 and CDK2 were combined with 1000 small molecules compiled internally. To ensure diversity, the decoy set was selected on the basis of structural and property diversity using the extended connectivity fingerprints (ECFP), molecular weight, and LogP in PipelinePilot. A 3D version of the ligand database was generated with the program Corina, with options set to generate flexible ring conformers and stereoisomers.

The docking program FlexX in Sybyl was used to dock onto the crystal structures of p38 and CDK2 (see, for example, Rarey, M.; et al. J. Mol. Biol. 1996, 261, 470-489; and Kramer, B.; et al. Proteins 1999, 37, 228-241, each of which is incorporated by reference in its entirety). In each study 30 ligand poses generated by FlexX were retained for subsequent analyses. The FlexX scoring function was used for scoring the docking.

A background set of SIFt patterns was used to define a reference distribution upon which the comparisons were based. For the kinase crystal structures analysis, a background set of dummy SIFts around an all kinase p-SIFt was generated. The p-SIFt from all 93 kinase crystal structures was first calculated (see FIG. 10). The background set of 2,000 bit-strings of the same length was generated such that each position within these bit-strings was randomly assigned either 1 or 0, with a probability of assigning a value of 1 equal to the value in the all kinase p-SIFt. For the database enrichment experiment, all the docking poses were used as the background set.

For the 93 structures, 56 ligand binding site residues were used to construct SIFts. Those playing a significant role in interactions with ligands are listed in Table 3, along with their uniform PKA residue numbering.

TABLE 3Raw Interaction Frequencynon-PKA#AllATPCDK2p38ATP2-StructureAnnotation490.90.90.90.40.9Gly-Rich LpATP; Hydrophobic contact with Adenine500.60.90.30.20.5Gly-Rich LpATP; Ribose510.50.70.30.10.4Gly-Rich LpATP; Ribose520.50.90.40.00.3Gly-Rich LpATP; Phosphate530.40.70.20.10.1Gly-Rich LpATP; Phosphate540.30.50.20.90.2Gly-Rich LpATP; Phosphate550.20.50.10.00.1Goly-Rich LpATP; Phosphate571.01.00.70.81.0Gly-Rich Lp*ATP; Hydrophobic contact with Adenine, Ribose, Phosphate701.01.01.01.01.0b3*ATP; Hydrophobic contact with Adenine720.80.90.71.00.8b3*ATP; Phosphate950.10.00.00.60.2acHydrophobic pocket1040.70.70.70.80.8Lp-ac-a4ATP; Hydrophobic contact with Adenine1060.10.00.00.40.1Lp-ac-a4Hydrophobic pocket1180.20.00.01.00.3b5Hydrophobic pocket1190.00.00.00.70.1b5Hydrophobic pocket1200.90.90.91.01.0b5*Gatekeeper1210.80.91.01.00.8b5*ATP; Hydrogen bond with Adenine1220.70.61.01.00.8b5ATP; Hydrophobic contact Adenine1231.01.01.01.01.0hinge*ATP; Hydrogen bond Adenine1240.30.00.60.40.5hingeATP; Adenine water mediated interaction1250.20.00.50.40.4hinge1270.70.80.90.30.6hingeATP; Ribose1300.30.20.50.00.4hingeATP; Ribose water mediated interaction1680.20.50.20.00.1Lp-b6-b71700.60.80.40.20.4Lp-b6-b7ATP; Ribose1710.30.40.40.00.3Lp-b6-b71730.90.91.00.30.9Lp-b6-b71820.00.00.00.00.0b8ATP; contact with Mg-Loop region1830.60.50.30.40.7b8ATP; Hydrophobic contact with Mg-Loop region1840.80.90.80.70.8b8ATP; contact with Mg-Loop region

Table 3 presents a summary of the raw frequencies observed for contact interactions. Only those residues having a frequency greater than 0.4 for any subgroup are listed. Residues having an interaction frequency of ≧0.7 are considered to be conserved. those less than 0.7 but greater than or equal to 0.4 are considered to be intermediate, and less then 0.4, variable. Entries in the annotation columns including * indicate that the frequency was defined as conserved (≧0.7) for all subgroups independently. Wherever possible, information on the context of the interaction in binding ATP or inhibitors is included as an annotation.

The results of the hierarchical clustering of SIFts computed for the 93 kinases is described above and revealed three major clusters representing three dominant interaction patterns present in the ligand-kinase complexes. Cluster 1 is composed of 9 structures of small molecule inhibitors interacting with p38 kinase (herein referred to as the p38 cluster).

Similarly, Cluster 2 is composed of 20 structures for complexes involving inhibitors of CDK2 kinase (denoted as the CDK2 cluster). The largest distinct group, Cluster 3, is made up of 9 ATP and 16 ATP-analogs complexed with different kinases, which will be termed the ATP-group (ATPg) cluster. The remaining roughly one third of the structures do not belong to any particular cluster. It is noteworthy that the hierarchical clustering procedure, based solely on ligand-receptor interaction features, is able to group structures into meaningful clusters where variable ligands have similar interactions with a fixed receptor (p38 and CDK2 clusters) and where very similar ligands interact in a highly conserved way with a diverse set of kinase receptors (ATPg cluster).

The p-SIFts may be derived using a reduced set of interaction features to represent each interaction. Thus, while the a SIFt can utilize 7 bits to characterize the interaction at each residue, a simplified p-SIFt can be derived from only the interaction frequencies of the contact bit at each residue. In order to simplify the analyses, results presented in Examples 7-10 were based on contact-only p-SIFts.

As an initial application, p-SIFts provided a useful tool to overview the interaction patterns observed between ligands and protein kinases. For this purpose, it can be convenient to define categories from the contact-only p-SIFts to characterize the observed interactions, e.g., conserved ≧0.7, 0.4 ≦ intermediate <0.7, variable <0.4, as denoted by dashed lines on the plot in FIG. 10. The p-SIFt generated from the 93 kinase structures using all 7 bits to compute the SIFts is shown in FIG. 10. The p-SIFt is annotated with a topmost bar delineating the general kinase structural features for that portion of the fingerprint; the bar below consists of alternating blocks corresponding to each residue (site in the uniform PKA numbering scheme) in the kinase used to construct the fingerprint; and the third bar consists of blocks for each bit representing the interaction features at that site. It should be noted that the p-SIFts themselves were not particularly sensitive to minor variations in the cut-offs used for binning the interaction frequencies. The overall distribution of conserved, intermediate, and variable interactions observed overall and for the ATPg, p38, and CDK2 clusters are summarized in Table 3.

The 25 members of the ATPg cluster consisted of9 structures of ATP complexed with 3 different kinases and 16 structures of ATP analogs complexed with 6 kinases. The ATPg p-SIFt computed from the ATPg cluster SIFts is shown in the top panel of FIG. 11. For comparison, the p-SIFt derived using only the 9 ATP structures in the ATPg cluster is also plotted. FIG. 11 shows the contact-only p-SIFts for ATPg (top panel), p38 (middle panel), and CDK2 (bottom panel), plotted as a function of PKA residue numbering. The unshaded outline shown in the ATPg panel corresponds to the p-SIFt derived from the 9 ATP-only structures. The increase in variability when ATP analogs are introduced is clearly visible.

The green blocks below the p38 p-SIFt denote residues making up the hydrophobic pocket of the kinase. For the 9 ATP complexes, 18 out of 23 contacts were classified as conserved between the kinases and the ribose, triphosphate and adenine moieties. Moreover, there were no completely variable positions. Interestingly, even for these ATP-only structures, four interactions fell in the intermediate conservation range. Interactions between the y-phosphate and residues 54 and 55, making up the tip of the glycine rich loop in the kinases, were dependent on the conformation of this flexible region of the binding site and were observed only in approximately half of the structures. Contact between the β-phosphate of ATP and residue 171 was primarily determined by the conformation of the ATP phosphate groups. In approximately 60% of the structures, the α-β-phosphate pyrophosphate bond was rotated such that the β-phosphate was oriented away from residue 171 and towards the glycine-rich loop (see FIG. 12; PDB core 1 atp). It is noteworthy that in several of these structures a water molecule was observed to take the place of the rotated β-phosphate and formed a water-mediated interaction between ATP and residue 171 (see the structures of PDB entries 1 atp 1 phk, 2 phk, and 1q16 for examples). Finally, contact between the adenine ring of ATP and residue 183 was largely a function of the side-chain identity. No contact was observed for the ATP structures that have Ala at this position (50%) whereas Thr and Val side-chains were able to contact the adenine ring either directly or via water-mediated interaction.

When the ATP-analogs were considered in addition to the ATP complexes, the degree of variability increases. In particular, interactions with residues 104, 122, and 168shifted from conserved to variable. The extent of variability is clear when the ATPg p-SIFt is compared to the ATP only p-SIFt, as shown in FIG. 11. The contacts that were not fully conserved for the ATPg cluster are colored yellow in FIG. 12(a), and fall into the intermediate (˜21%) and variable (˜33%) ranges. In FIGS. 12(a)-12(c), the binned contact-only p-SIFts for ATPg, CDK2, and p38, respectively, are mapped onto the structure of the complex between ATP and PKA (PDB code 1 atp) using the values in Table 3. Conserved interactions are colored green, intermediate interactions are colored yellow, and variable interactions are colored red. FIG. 12(d) highlights key areas of difference in the interaction patterns observed for ATP, CDK2, and p38 identified from the difference profiles. The ATPg p-SIFt reveals a high degree of interaction conservation as annotated in Table 3 and colored green in FIG. 12(a). Of the 33 contacts observed across the ATPg, ˜46% were classified as conserved (see Table 4). The patterns of conserved interactions for ATPg ligands defined an ATP-like binding signature and provided a baseline for comparison when analyzing non-ATP small molecule inhibitors.

TABLE 4ClusterContactALLATPP38CDK2Residues56332629Conserved1119.6%1545.5%1038.5%1241.4%Intermediate610.7%721.2%830.8%517.2%Variable3969.6%1133.3%830.8%1241.4%Unique5 8.9%927.3%415.4%620.7%Conserved

Table 4 shows a summary of conserved, intermediate, and variable interactions observed across all of the 93 kinase structures and for each of the ATPg, p38, and CDK2 structure clusters. The total number of residues interacting with ligands in each group is denoted as “Contact Residues”. The number of conserved interaction beyond the canonical set observed for all ligands appears in the row labeled “Unique Conserved”.

The contact p-SIFts derived for the ATPg, CDK2, and p38 clusters plotted in FIG. 11 measured the degree of interaction conservation for each group of structures. From the p-SIFts, it was evident that CDK2 and p38 inhibitors shared some common binding interactions as observed between ATP and some regions of the kinase domain while displaying marked differences in others. The difference profiles provided insight into how the interaction patterns observed for known kinase inhibitors differed from those detailed above for ATP. Difference profiles p38-ATPg, p38-CDK2, and CDK2-ATPg are plotted in FIG. 13. In FIG. 13, contact-only difference profiles are shown between p38-ATPg (top panel), p38-CDK2 (middle panel), and CDK-ATPg (bottom panel). The difference plots range from −1 to 1, where a value of 0 indicates that the interaction is conserved to the same degree in the two sets of structures, whereas a value of −1 or 1 denotes that a conserved interaction in one set of structures is not conserved in the other.

For the p38-ATPg and p38-CDK2 difference profiles, the key distinctions were determined in part by the identity of the residue at position 120. Referred to as the “gatekeeper” residue, it controlled the relative access to the hydrophobic pocket of the ATP site, a region not occupied by ATP. Bulky residues at position 120, such as the Phe in CDK2, restricted access to the hydrophobic pocket, limiting the contacts available to a putative inhibitor. The small Thr “gatekeeper” in p38 rendered the residues making up the hydrophobic pocket accessible to small molecule inhibitors. That small molecule inhibitors of p38 exploit these interactions was clearly evident from the p38 p-SIFt (FIG. 11), which indicated a set of intermediate and conserved interactions corresponding to hydrophobic pocket residues colored magenta in FIG. 12(c). The contrast in interaction with the hydrophobic pocket observed between p38, ATPg, and CDK2 were clearly delineated by the distinct positive differences visible in the p38-ATP and p38-CDK2 difference profiles.

In contrast, the CDK2 p-SIFt was more similar to the ATPg p-SIFt as can be observed in the CDK2-ATP difference profile. Unlike p38, in CDK2 the Phe “gatekeeper” residue blocked access to the hydrophobic pocket. As a result, many of the residues accessible to CDK2 inhibitors were those that also interact with ATP. In fact, all of the conserved residues observed in the CDK2 p-SIFt were also conserved in the ATPg p-SIFt.

The main positive difference regions of the CDK2-ATP difference profile, corresponding to intermediate level conserved interactions in the CDK2 p-SIFt that occur with low frequency in the ATP p-SIFt, are colored white in FIG. 12(d).

Unlike contacts with the hydrophobic pocket, several interactions conserved in the p38 cluster were common to CDK2, as well as other non-ATP inhibitors, and are colored red in FIG. 12(d). Finally, several interactions were conserved for ATPg and were observed with relatively low frequency for CDK2 and p38. These ATPg specific contacts are colored yellow in FIG. 12(d), and involved residues at positions 50-55, which interacted with the ribose and phosphate moieties of ATP, and with residues at positions 168, 170, and 171, in the vicinity of the catalytic loop.

Approximately 20% of the contact interactions were conserved in each of the ATPg, CDK2, and p38 p-SIFts as well as over the 93 structures as a whole. These are denoted in Table 3 by the highlighted annotations and form a canonical set of interactions that were evidently fundamental for kinase binding at the ATP site. Further analysis of the full length SIFts revealed that among this set are interactions with residues at positions 121 and 123, which were involved in hydrogen bonding to the adenine moiety of ATP, the “gatekeeper” residue, position 57 in the glycine rich loop, position 70 that for ATP involved hydrophobic interactions between adenine and β3, and position 72 involving the ATP phosphates interacting with β3. The residues involved in the canonical set of interactions are colored green in FIG. 12(d).

The canonical interactions comprise an essential kinase-binding signature for compounds targeting the ATP binding site. Although as noted in Table 3, additional conserved interactions existed for the ATPg, p38, and CDK2 clusters, the canonical interactions were common to all inhibitors and may be used as a basic kinase-like binding filter in virtual screening.

Example 8

Hierarchical clustering of the SIFts computed from the 93 kinase x-ray structures resulted in the identification of the p38 and CDK2 clusters because they represent two fundamentally different sets of small molecule inhibitors in terms of interactions with the ATP binding site. However, the SIFts within each cluster were not homogenous. In particular, the p38 cluster revealed interesting details about the relationship between interaction patterns and inhibitor selectivity.

Clustering of the nine structures of the p38 cluster identified three distinct SIFt sub-clusters representing two distinct classes of inhibitors (shown in FIG. 14) displaying overall similar yet unique binding signatures at the ATP binding site. FIG. 14 is a difference profile plot derived from the clustering of p38 inhibitors. Sub-cluster 1 (SC1) corresponds to the well-known pyridinyl imidazole class of inhibitors, whereas sub-cluster 2 (SC2) contains several more recently reported inhibitors. Residues showing key interaction differences between the two classes of inhibitors are labeled on the plot. The contact-only difference profile clearly showed that while the two structurally different classes of inhibitors shared a common set of interactions with the kinase, each class had marked regions where the p-SIFts differed. In particular, sub-cluster 2 had additional contacts with β5, the hinge region, Mg loop, and catalytic regions of the kinase, clearly visible in the difference profile. Of particular interest are members of sub-cluster 2 that were more potent inhibitors of p38 and have been reported to exhibit improved selectivity against the kinases Erk and Jnk (see, for example, Fitzgerald, C. E.; et al. Nat. Struct. Biol. 2003, 10, 764-769; and Scapin, G. Drug Discov. Today 2002, 7, 601-611, each of which is incorporated by reference in its entirety). The reason for the improved selectivity of the sub-cluster 2 inhibitors has been proposed to be a peptide bond flip between Met109 and Glyl110 in the hinge induced by the inhibitors and accommodated by the small side chains in p38 relative to Erk and Jnk. The difference profile clearly showed the resulting additional interactions at positions 110, 111, and 112 (p38 numbering) exploited by the sub-cluster 2 inhibitors. In addition to the interactions reported,. previously as the structural basis for improved sub-cluster 2 inhibitor potency and selectivity, the p-SIFts also revealed additional contacts with the Mg-loop and catalytic loop regions of p38.

Results from the analysis of the p38 cluster illustrated the power of the p-SIFt approach, namely, the ability to quantify the similarities and differences in the interaction patterns of inhibitors to a given target. Moreover, the ability to derive p-SIFts and difference, profiles that quantify key conserved interactions can aid in inferring the structural basis for inhibitor potency and selectivity. The detailed binding signature information encoded in the p-SIFts make them ideal filters for screening virtual libraries, as discussed below.

Example 9

In the preceding examples, clear conservation patterns of interactions for ATPg, p38, and CDK2 clusters have been identified. A canonical set of conserved interactions common to all ligands bound to kinases at the ATP binding site was also identified. These binding signatures can be applied to virtual screening for protein kinase inhibitors.

The success of VS methodologies is typically cast in terms of enrichment studies designed to measure the percentage of known actives identified as a function of the fraction the database screened. Often, the results of these studies indicate that the performance of scoring functions is target specific, for example, leading to significant enrichment of actives for docking against the estrogen receptor but performing poorly against kinase targets (see, for example, Halgren, T. A.; et al. J. Med. Chem. 2004, 47, 1750-1759, which is incorporated by reference in its entirety). Unfortunately, knowledge of the optimal scoring function to apply in a virtual screen against a novel target is not available a priori. As a result, it is often necessary to undertake lengthy validation studies to select a suitable scoring function, or alternately, construct a customized scoring scheme optimized for the target of interest. These problems are compounded when VS screening is carried out against multiple targets.

Some of these difficulties can be addressed by applying p-SIFts to the ranking and filtering of VS results. p-SIFts can be used as target-specific molecular filters encoding binding signatures that are consistent with a particular target specific group of known active inhibitors. Moreover, by comparing the SIFt for each docked solution with a kinase-specific, or binding mode specific, p-SIFt, each p-SIFt is in effect a target specific scoring function. A p-SIFt can be applied in a VS workflow tailored to a specific target without having to rely on the ambiguities of energy based scoring.

To this end, the performance of p-SIFt based scoring in a typical database enrichment application using p38 and CDK2 as targets was tested. In addition, the degree to which the ATP, CDK2, and p38 p-SIFts were selective toward observed kinase inhibitor binding modes was also assessed. Finally, for the generation of enrichment curves and selectivity assessment tests, full-length p-SIFts derived from 7-bit SIFts were used.

Three strategies for virtual screening post-processing were explored. For all three strategies a list of docking poses was generated using the program FlexX, and the top 30 poses were retained using the FlexX scoring function. The output obtained from docking N ligands is then an N×M matrix consisting of M poses (here, M=30) for each docked ligand.

The aim of post-processing the docking results was to arrive at a rank ordered list of N ligands, consisting of a single pose per ligand, which is enriched with actives. In general, the degree of enrichment depends on the success of the post-processing strategy.

All three strategies required the selection of a single pose per ligand and then subsequent ranking of those ligands. One approach utilized energy-based scoring functions for both selecting the top pose per ligand, and for ordering the ligand list. This strategy is referred to as Traditional scoring. The second approach involved using a p-SIFt (instead of an energy-function) to both select the single best pose per ligand and order the ligand list, an approached referred to as p-SIFt scoring. The final approach was a hybrid of the two approaches, in which the p-SIFt was used to filter out undesirable poses, and then an energy-based scoring function was used to select the best pose per ligand and to create an ordered list of ligands. This strategy is called Hybrid scoring. Other strategies that make use of a p-SIFt can be used.

In all three cases, the overall post-processing scheme used to select a single poses generated from docking consisted of four general steps, namely,

(a) re-scoring: each pose generated (N×30) is scored using standard scoring functions and p-SIFts;

(b) filtering: unrealistic poses are removed;

(d) ranking: the N ligands are rank ordered.

The three scoring and post-processing strategies applied utilized different strategies to carry out steps (a)-(d) as summarized in Table 5. For the Traditional scoring and Hybrid scoring schemes, docked poses were re-scored using several widely applied scoring functions computed using the Cscore utility in Sybyl. For the Interaction scoring and Hybrid scoring protocols, a value of Z between the SIFt for the pose and the target p-SIFt, Z_target, (where target is CDK2 or p38) was also computed.

TABLE 5Post-processing Methodp-SIFtTraditional ScoringScoringHybrid ScoringA. Re-scoringChemScore, Gscore, PMFZ_targetChemScore, Gscore, PMFScore, Dscore, and ConsensusScore, Dscore, and ConsensusScoreScoreZ_targetB. FilteringnonenoneZ_CDK2≧ 4.5Z_p38≧ 5.0Canonical InteractionsC. Final Posescoring functionZ_targetscoring functionSelectionD. Ligandscoring functionZ_targetscoring functionRanking

The post-processing schemes applied in this paper to score the ligand poses generated from the docking experiments. In Table 5, scoring function refers to one of ChemScore, Gscore, PMF Score, Dscore, or Consensus Score. See, for example, Eldridge, M.; et al. J. Comput. Aided Mol. Des. 1997, 11, 425-445; Jones, G.; et al. J. Mol. Biol. 1997, 267, 727-748; Muegge, I.; and Martin, Y. C. J. Med. Chem. 1999, 42, 791-804; Meng, C.; et al. J. Comp. Chem. 1992, 13, 505-524; and Charifson, P. S.; et al. J. Med. Chem. 1999, 42, 5100-5109, each of which is incorporated by reference in its entirety. For both the Traditional scoring and Hybrid scoring schemes, the same scoring function was used for final pose selection (step (c)) and Ligand Ranking (step (d)).

The filtering step (b) applied in the Hybrid scoring scheme involved filtering out any poses having Z_CDK224.5 and Z_p38≧5.0, for VS against CDK2 or p38, respectively. The Z_targetcutoffs were chosen to be at the lowest value of the Z_targetdistribution observed for the CDK2 and p38 X-ray structures. In addition, a canonical interaction filter was applied to each pose such that SIFts not satisfying the subset of interactions having an interaction frequency of 1 in the all-kinase p-SIFt.

Incorrect ligand poses can be eliminated from the pool of poses that will be considered for final selection during the filtering step. The aim is to reduce the number of false positive poses while retaining all plausible true positive poses. The filtering step is optional and for comparison purposes was omitted in order to generate results based only on scoring fuictions (Traditional scoring) and only on p-SIFTs (p-SIFt scoring).

Step (c) involves selecting a single pose per ligand from the set of poses that have passed all of the filters, if any, applied in step (b). Enrichment curves and factors were computed by rank ordering (step (d)) the final set of ligand poses using the schemes outlined in Table 5.

A database containing known inhibitors of both p38 and CDK2 and a background of 1000 diverse commercially available compounds was docked against the X-ray structures of CDK2 (PDB code ldi8) and p38 (PDB code 1a9u). The ability of the p-SIFt VS protocol to identify known actives was quantified by computing enrichment curves. The enrichment curves plot the percentage of actives recovered as a function of the percentage of the database screened. Enrichment curves and cumulative enrichment factors for p38 are presented in FIG. 15(a) comparing the Traditional and p-SIFt scoring approaches. The ChemScore and PMF curves were obtained using the Traditional Scoring scheme, using the ChemScore and PMF scoring functions, respectively, for both final pose selection and ligand ranking. Other scoring functions performed similarly under the Traditional Scoring scheme.

From FIG. 15(a) it was clear that the enrichment obtained by applying p-SIFt scoring provided markedly superior results over those obtained using Traditional Scoring using the Chemscore and PMF functions. Moreover, there was little difference between these functions over the first 15% of the database. In contrast, p-SIFt scoring performed close to the ideal enrichment curve over the first 2% of the database. In other words, 14 of the 16 known p38 actives were in the top 20 ranked ligands. Upon examination of the docking poses it was discovered that for two inhibitors correct poses were never generated in the initial pose pool. The p-SIFt scoring method required a pose having a correctly docked binding mode to generate a high Z_p38value, unlike Traditional scoring which can generate high scores even for poses that bind incorrectly. Generating enrichments for the right reasons is an advantage of the p-SIFt scoring approach. For p38, the Hybrid scoring scheme offered no improvement in enrichment over that obtained from using p-SIFt scoring.

Enrichment curves were derived using Traditional scoring, p-SIFt scoring, and Hybrid scoring, and are presented in FIG. 15(b) for docking against CDK2. The Traditional PMF, ChemScore, GScore, and DScore curves were obtained using the Traditional scoring scheme, using the indicated scoring function, respectively, for both final pose selection and ligand ranking. The Hybrid PMF, ChemScore, GScore, and DScore curves were obtained by applying the Hybrid scoring scheme using the indicated function for both final pose selection and ligand ranking. Strikingly, all of the Hybrid scoring scheme variants performed better than the Traditional and p-SIFt schemes irrespective of what scoring function was used for pose selection and ranking. It appeared that once the majority of the incorrect poses that contribute to false positive scores are filtered out, the differences between scoring functions visible in the results using these fictions alone (Traditional scoring) was factored out. Enrichments obtained using p-SIFt scoring were comparable to Traditional scoring up to 6% of the database screened and significantly better at higher levels.

Attaining database enrichments for CDK2 comparable to those obtained for p38 was a considerably more challenging task for VS. The large gatekeeper residue in CDK2 restricted the number of residues accessible in the ATP binding site. The p-SIFt for CDK2 sampled fewer residues compared to the p-SIFt for p38 and conserved interactions were distributed over a relatively small spatial region. As a result, for CDK2 there were fewer constraints to generate ligand placements and it was therefore easier to generate poses that satisfied conserved interactions in CDK2. In effect, the CDK2 p-SIFt was less selective against false poses as evidenced by the poorer performance of p-SIFt scoring for CDK2 versus p38.

Example 10

The difference profiles presented in FIG. 13 revealed clear regions where ATPg, CDK2, and p38 inhibitors bound to kinases in unique ways. These observations suggested that p-SIFts can be used to model the selectivity of inhibitors based on the types of interactions they are able to satisfy when binding to the kinase. In order to validate the use of p-SIFts as selectivity filters, a self-recognition experiment using the set of 93 X-ray structures as a test data set was carried out. For this purpose, p-SIFts were derived for p38, CDK2 and ATP where ˜50% of the structures for each group were set aside and not used to derive the p-SIFt. For each p-SIFt, Z_targetvalues were then computed against all 93 kinase structures to assess the ability of p-SIFts to recognize members of their own group. For the p-SIFts to serve as effective molecular filters, the p38 p-SIFt needed to generate statistically significantly higher Z_p38against the X-ray structures of the p38 cluster relative to the remaining structures, whereas the CDK2 and ATPg p-SIFts should perform similarly against the CDK2 and ATPg structures, respectively.

The results of the self-recognition experiment are shown in FIG. 16. FIG. 16 presents box plots of Z_targetdistributions obtained for the ATPg, p38, and CDK2 cluster subsets generated against all kinases in the 93 X-ray structure set in panels (a)-(c), respectively. The right and left arrows indicate the mean and the median, respectively, of the distribution; the vertical error bars delineate the upper and lower bounds of the data; the horizontal bars represent individual data points. The box outlines the upper and lower quartiles of the distribution.

It was clear from FIG. 16 that for each p-SIFt, the distribution for the corresponding set of target structures was shifted towards higher Z-scores. Considering ATPg first, the top scoring 26% of the total 93 structures were ATPg cluster members, making up approximately 65% of all of the ATPg. The remaining ATPg structures fell into a region of the distribution that overlapped with the distribution for CDK2, p38, and the remaining structures. Interestingly, the overlap in the distributions can be rationalized in terms of the p-SIFt similarities discussed above. Because of the similarity between the ATPg and CDK2 p-SIFts, 90% of the CDK2 structures overlapped in Z with the lowest scoring 35% of the ATPg. This overlap existed primarily because the ATPg p-SIFt was in essence derived from a subset of the interactions sampled by CDK2. However, the differences between the ATP and CDK2 interaction patterns were captured in the CDK2 p-SIFt. Consequently, the highest segment in the distribution shown in FIG. 16(c) contained 19 of 20 CDK2 structures and overlapped with only 2 ATPg structures.

The greatest separation in Z-score distributions was obtained for p38 (FIG. 16(b)), primarily due to p-SIFt features reflecting conserved residues in the hydrophobic pocket of the ATP binding site. The p38 structures fell into two groups in the distribution, dividing neatly between the highest scoring sub-cluster 1 structures, used to derive the p-SIFt, and slightly lower scoring sub-cluster 2 examples. The latter set overlapped in Zp₃₈with the X-ray structures 1 qpe, 1 pme, and 3erk. Both 1 pme and 3 erk are structures of complexes between pyridinyl imidazole compounds complexed with variants of the kinase ERK2, whereas 1 qpe is a structure of PP2 complexed with the kinase Lck. As in p38, the “gatekeeper residue” in Lck was also Thr, and the two kinases had relatively similar ATP binding sites. These examples highlight the fact that the p-SIFts were able to capture similarities in interaction patterns arising from ligand similarity (e.g., 1 pme and 3erk), and from binding site similarity. Finally, using a multi-structure p-SIFt rather than a single structure p-SIFt (derived from the 1a9u structure) yields improved separation between Z-score distributions.

Example 11

Several chemical libraries and ensembles of docking poses were generated for analysis. The crystal structure of MAP kinase p38 (PDB accession code: 1ouk) was used as the target molecule in all of the virtual screening experiments (see, for example, Fitzgerald, C.E.; et al., Nat. Struct. Biol., 2003, 10, 764-769, which is incorporated by reference in its entirety). The first library of docking poses was used to demonstrate the ability of r-SIFt to efficiently organize and visualize various binding modes. The pyridinyl imidazole inhibitor co-crystallized with p38 in the 1ouk structure, “1ouk-inh”, which has been identified as a very selective and potent p38 inhibitor, was docked with p38. 150 poses with the highest CScores were retained for subsequent analysis. Docking experiment was carried out with FlexX in Sybyl. The ligand binding site was defined using a cutoff radius of 10 Å from the 1ouk-inh ligand (i.e., the conformation in the crystal structure) combined with a core sub-pocket cutoff distance of 4 Å. The FlexX scoring function was used carried out with docking. Five difference scoring functions, including Fscore, ChemScore, Gscore, PMF Score, Dscore, and Consensus Score were used as voting scores in the Cscore utility in Sybyl. FIG. 17a shows these 150 1ouk-inh poses generated from the docking experiment. The poses displayed a variety of binding modes in the active site of p38.

In order to compare and contrast the r-SIFt patterns of different compound structures, docking experiments were performed for five chemically distinct compounds (FIG. 18), using the same FlexX docking procedure. These compounds were: 1) 1ouk-inh, whose co-crystal structure has been available (PDB code: 1ouk); 2) SB203580, a well-known pyridinyl imidazole p38 inhibitor (see also Examples 1-3 above); 3) SKF-86002, a compound discovered by SmithKline Beecham; 4) Amgen-10, an Amgen compound with a 2-methyl-6-carboxy-pyrimidine ring as the core; 5) Cmp-59076 (2-[1,3]Dithietan-2-pyridin-4-yl-1-(4-trifluoromethoxy-phenyl)-ethanone), a molecule that exhibits no p38 inhibition activity (see, e.g., Adams, J. L.; and Lee, D. Curr. Opin. Drug Disc. Dev. 1999, 2, 96-109; and WO 00/31063A1, each of which is incorporated by reference in its entirety). Except for Cmp-59076, the other compounds are known to be potent p38 inhibitors. FIG. 18 shows the 2D chemical structures of these compounds, including our definition of their cores and the variable R-groups. For each docking experiments, the 10 poses of each compound with the highest Cscores were retained for subsequent analysis.

In addition, to test the r-SIFt based library filtering strategy, four different combinatorial libraries were enumeratued using three distinct p38 inhibitors as template scaffolds: 1ouk-inh, SKF-86002, and Amgen-10. In order to simplify the analysis and to make the results more interpretable, only one R-group in each library was varied. A common set of monomer library containing 10,000 aryl bromides was used as reactants in the enumeration of these libraries (see, for example, ACD: Available Chemical Directory (version 2004.2), MDL Information Systems: San Leandro, CA). Three libraries were generated by varying the R-1 group of the templates, respectively. The fourth library was enumerated by varying the R-2 group of 1ouk-inh. Based on the co-crystal structures of 1ouk and other similar inhibitors (1a9u, 1b16, 1b17, 1 bmk, 1ouk, etc.), in “native binding mode”, the R-1 groups were expected to interact with the hydrophobic pocket of p38 (see, for example, Radzio-Andzelm, and E. Taylor, S. S.; Structure, 1994, 2, 345-355, which is incorporated by reference in its entirety). The R-2 portion of 1ouk-inh, on the other hand, was positioned in the vicinity of the adenine binding site in the hinge region. These four libraries were named 1ouk-inh-R1, SKF-86002-R1, amgen-10-R1 and 1ouk-inh-R2, respectively. Library enumeration processes were carried out using Pipeline Pilot (Pipeline Pilot™ (version 3.0), Scitegic Inc., San Diego, Calif., U.S.A.). All the reaction products were pre-filtered by removing salts, inorganic compounds as well as molecules with molecular weight less than 400. From the remaining library, a subset of molecules (maximum number 2,000) with maximal chemical diversity were sampled for further analysis. The total number of selected compounds of each library were: 1ouk-inh-R1, 2208; 1ouk-inh-R2, 2450; SKF-86002-R1, 2442; amgen-10-R1, 1750.

These four libraries were docked onto the p38 target molecule (1ouk), using the same docking procedure. The docking experiments were able to reproduce the native co-crystal structure of 1ouk-inh, with an RMSD less than 0.4 Å, confirming the validity of the docking procedure.

Calculation of 2-D descriptors of the Ligands

Molecular descriptors of the R-group monomers (after substituting the bromide with a hydrogen atom) were calculated using Pipeline Pilot™. In order to make the method more amenable to large libraries, the time-consuming calculation of 3D descriptors was omitted. A total of 37 2D descriptors were generated.

The molecular descriptors set was further processed by removing variables with little or no variance across the whole library. In addition, descriptors with high redundancy and multicolinearity were removed. This cleaning step was carried out using the unsupervised forward selection (UFS) algorithm with the stopping criteria of R_max²(i.e. the squared multiple correlation coefficient, SMCC) cutoff=0.95 and the minimum standard deviation of variables =0.05 (see, for example, Whitley, D. C.; et al., J. Chem. Inf Comput. Sci., 2000, 40, 1160-1168, which is incorporated by reference in its entirety). The final non-redundant set of descriptors contains 26 descriptors, including: F—COUNT, P—COUNT, S—COUNT, CL—COUNT, BR—COUNT, ALOGP, MOLECULAR—POLARSURFACEAREA, NUM—H—ACCEPTORS, NUM—H—DONORS, NUM—ATOMS, NUM—HYDROGENS, NUM—POSITIVEATOMS, NUM—ROTATABLEBONDS, NUM—BRIDGEBONDS, NUM—RINGS, NUM—AROMATICRINGS, NUM—RINGASSEMBLIES, NUM—CHAINS, NUM—CHAINASSEMBLIES, NUM—STEREOBONDS, NUM—UNKNOWNSTEREOBONDS, NUM—ATOMCLASSES, LOGD, and MOLECULAR—WEIGHT.

Generation of r-SIFts

The panel of 56 residues of p38 previously identified as the kinase ligand binding site were used as the reference frame for r-SIFt construction. These residues are located in the vicinity of the ATP binding pocket in the cleft of the N-terminal and C-terminal domains, as well as at the substrate-binding site. See above.

The implementation of r-SIFt used was based on contact distance between heavy atoms of a residue and different fragments of the ligands. A four-bit-long binary bit string was used to represent the interactions involved in each binding site residue, each bit representing whether or not a particular fragment (core, R-1, R-2 or R-3) is within a certain distance cutoff to the particular residue. In the case of SKF-86002 and Cmp-59076, three bits were used, as these compounds do not have an R3. The distance cutoff was set to 3.5 Å. If any heavy atom of a particular fragment was within 3.5 Å from any heavy atom of the residue, then this particular bit was turned on (1), otherwise this bit remained off (0). The final fingerprints were constructed by concatenating all these 56 small bit-strings together in ascending residue number order. The total length for each r-SIFt pattern was 56×4=224 bits, except for SKF-86002-R1, in which R-3 was absent. The length for r-SIFts in SKF-86002-R1 was 56×3=178 bits.

Analysis and Clustering of r-SIFts

The Tanimoto coefficient was used as the similarity measurement between two r-SIFts. For the 150 1ouk-inh poses ensemble, 1ouk-inh-R1 and 1ouk-inh-R2, the co-crystal structure of the inhibitor was used as the reference structure. For SKF-86002, and Amgen-10, no co-crystal structure was not available. The best docking poses (i.e., with top FlexX scores) for these compounds were examined, and a best pose was selected for each. These two best poses were consistent with the expected binding modes as observed in the co-crystal structures of similar inhibitors (1ouk, 1a9u, 1b16, 1b17, 1 bmk, 1 ove, etc.) and made all the conserved interactions with the target that were observed in other p38 structures. These were used as the reference structures. An agglomerative hierarchical clustering was applied to analyze and reorganize each library of poses, using Tanimoto coefficients as the similarity measurement. Clusters of protein-ligand complex structures were selected based on the dendrogram of their r-SIFts.

Combining SIFt-based approaches and conventional scoring functions, can yield better results in reproducing the true binding modes of the compounds and better library enrichment performance. When docking known p38 inhibitors, the best pose given by a conventional scoring function may not adopt the native binding mode, however, a good placement with correct binding mode usually can be found among the top 10 poses. For p38 inhibitors, retaining the top 10 poses and then selecting the poses with the best binding mode based on SIFt similarities gave much better enrichment performance than using the conventional scoring function alone. Here, a similar strategy was applied to process the docking results of the combinatorial libraries. The r-SIFt patterns were calculated for the best poses (i.e., with best FlexX scores) of each compound. Tanimoto coefficients were calculated against the r-SIFt of the reference structure (either the co-crystal structure or the best predicted pose as described above). The pose with the highest Tanimoto coefficient was selected as the best pose for this compound and used in subsequent ranking or hierarchical clustering. All hierarchical clustering calculation of the r-SIFts were carried out using Spotfire™.

Construction of Decision Tree Classification Models

Hierarchical clustering grouped poses into different clusters according to their binding modes. By visual inspection, the cluster in which compounds adopt the native binding mode was identified. These compounds were classified as native—that is, they were “dockable” hits, because they were predicted by docking program to be able to interact with the target molecule in a way similar to known active inhibitor(s). Compounds having a predicted binding mode different from the native structure were classified as non-native.

After classifying the compounds, decision tree models were generated using CART™(version 5, Salford Systems; see, for example, Steinberg, D.; and Colla, P. CART: tree-structured non-parametric data analysis. San Diego, CA: Salford Systems, 1995). The non-redundant set of 2-D descriptors was used as predictive variables, and the binding mode class (native or non-native) as the target variable. The decision trees were formed with a set of nodes and leaves (end nodes). Each node contains a bifurcation of path based on the value of a particular descriptor. The trees were generated using tenfold cross-validation, randomly assigning 90% of the data points as the training set and 10% as testing set. Equal weights were applied to both native and non-native classes. The performance of the model was measured by prediction accuracies for both classes in the training set as well as in the test set.

Organization of the 1ouk-inh Docking Poses Ensemble

150 poses of 1ouk-inh docked to p38 were generated for r-SIFt analysis. FIG. 17a shows the placements of these poses, which vary considerably in their binding modes. A hierarchical clustering of the r-SIFt patterns is shown in FIG. 17b. The dendrogram clearly reveals four major clusters (clusters 1-4), each of which represents a distinct binding pattern, as shown in FIGS. 17c-17f.

In addition to its sensitivity to binding mode variations, r-SIFt provided a method for easy visualization and interpretation of how molecules bind to an active site. FIG. 17b displays the re-organized r-SIFt patterns as a heat-map. Different types of interaction bits pertinent to different fragments of the compounds (core, R-1, R-2 and R-3) are colored differently in the heat map. Since the bits in fingerprints were arranged in the same ascending residues number order, from this r-SIFt heat map, one can visually reconstruct the overall orientation and position of the molecule at the active site, that is, which fragment of the molecule interacts with which region of the target molecule. Cluster 2 is the native cluster (FIG. 17d). Within this cluster, the R-1 groups (blue bits in the heat map) occupied and interacted extensively with the hydrophobic pocket of p38, located at the back of the ATP binding site and is comprised of some residues in a sequence region spanning from β3 to β5 (including αC). This binding information was revealed in the fingerprint heat map, as blue bits (representing R-1 fragment) appeared in the region pertinent to the hydrophobic pocket. The R-2 groups, on the other hand, interacted with the adenine binding site in the hinge region, therefore the majority of the purple bits (representing R-2) appeared in the hinge region. The R-3 group (green bits) in this cluster touched the catalytic loop and the Mg-loop regions. Similarly, using this heat map as a guide, one can reconstruct the binding modes of other clusters and easily appreciate the differences among various groups, even without looking at their 3D structures.

Comparison of r-SIFts of Different p38 Inhibitors

Docking experiments were performed using four known p38 inhibitors (1ouk-inh, SB203580, SKF-86002 and Amgen-10) and a compound with no p38 inhibition activity (Cmp-59076). These compounds exhibit different chemical scaffolds (FIG. 18). r-SIFt patterns were calculated for all the poses, and for each compound, three poses that display the best possible similarity scores against either the co-crystal structure or the respective best pose (i.e., the native binding mode) were selected. For Cmp-59076, three poses with the highest Tanimoto coefficients against the 1ouk-inh co-crystal structure were selected, as it was difficult to predict the true binding mode of this non-inhibitor. Hierarchical clustering results of these r-SIFt patterns are shown as a heat map in FIG. 19a. The r-SIFt generated from the co-crystal structure of 1ouk is also displayed as comparison. FIGS. 19b-19g show the 3D structures of the poses of each compound, within the same reference structure frame.

Not surprisingly, the r-SIFt patterns are first clustered together by each compound. Furthermore, the distance between two clusters in the dendrogram reflects the degree of similarity in the binding mode. In all four p38 inhibitors (1ouk-inh, SB203580, SKF-86002, Amgen-10), the overall positions of the molecular fragments within their r-SIFts were consistent. In most of the cases, the R-2 group (purple bits) was in contact with the hinge region, whereas the R-1 group (blue bits) was highly concentrated in the hydrophobic pocket region (made up of residues from β3-β4 and some residues in β5 immediately proceeding the hinge region). This shows that different p38 inhibitors bound to the target molecule with a very consistent overall interaction pattern. Cmp-59076, which displayed a completely different binding mode, was the most distant from other inhibitors in the dendrogram.

Amore detailed investigation of the r-SIFt patterns revealed some degrees of variation between different known inhibitors. For example, the R2 group of 1ouk-inh (purple bits in FIG. 19a) showed more extensive interactions in the second half of the hinge region (around residue 100) than other inhibitors. Such extensive contact between 1ouk-inh and hinge has been previously observed and rationalized. 1ouk-inh also displayed more interaction points than other compounds in the hydrophobic region. This can be rationalized by the fact that it has a bulkier tri-fluoro phenyl R-1 group as opposed the smaller 3-fluorophenol R-1 of the others. In addition, the interaction between the R-2 of Amgen-10 and the hinge region were relatively sparser than other molecules. As seen from the structures, the Amgen-1 poses predicted by the docking experiments moved slightly away from the hinge so that the carbonyl at the core can hydrogen bond with Lys-53. The relative distance between different compounds correlated well with their chemical similarity, with SKF-86002 and SB203580 being very close to each other (their R-1 and R-2 groups are identical, and cores are similar), while 1ouk-inh and Amgen-10 (chemically more dissimilar) were farther apart in the dendrogram.

Analysis of Combinatorial Libraries

To search for the rules governing the behaviors of the compounds within a target molecule, four combinatorial libraries were enumerated. r-SIFt was then used to help investigate their “dockability” or hit potentials, that is, whether or not they were able to dock onto the target with expected binding mode. After generating r-SIFts, a hierarchical clustering analysis was carried out to separate different binding modes. FIG. 20a shows the organization of these r-SIFt patterns of library 1ouk-inh-R1. The first major cluster (illustrated as green) was the native cluster, in which the relative positions and orientations of the molecules in this cluster were the same as observed in the 1ouk co-crystal structure. Examples of the compounds in this native cluster are shown in FIG. 20b. The rest of the library were labeled as non-native (shown as red). FIG. 20c shows some examples of the molecules in both native and non-native clusters. Because all of these examples had high PMF docking scores, conventional docking score alone would not have been able to separate those with a native binding mode from those with a non-native binding mode. Several molecular descriptors showed some modest correlation with the r-SIFt classification. For example, in the native cluster, molecular surface areas of the R-1 groups in general tended to be smaller than in the non-native cluster. This can be rationalized by the fact that the size of the p38 hydrophobic pocket precluded very large R-1 group from occupying the pocket.

However, no single descriptor alone was able to successfully explain the classification variance. A more complex predictive model that involves combination of different descriptors was required.

The CAR™ decision tree method was used to build classification models. A decision tree model was generated for each of the four combinatorial libraries, using a non-redundant set of 2D molecular descriptors as predictive variables. FIG. 21 shows the optimal decision tree model for library 1ouk-inh-R1. The CART method also produced a sorted list of descriptors based on their levels of importance. Descriptors that were pertinent to the size, shape and hydrophobility of the R-1 group, such as total number of atoms, total surface area, polar surface area, molecular weight and LogD were among the most important factors. This was consistent with the restrictions on size and the hydrophobicity of R-1 imposed by the nature of the hydrophobic pocket, such that only those compounds with the right size, shape and hydrophobicity were able to bind to the target molecule with the desired binding mode.

The performances of these decision tree models were evaluated by the prediction accuracies for both native and non-native classes. The results are summarized in Table 6.

TABLE 6Distribution of Native and Non-native CompoundsTotalNativeNon-native1ouk-inh-R122081428 (64.7%) 780 (35.3%)skf86002-R12442266 (10.9%)2176 (89.1%)amgen10-R11745478 (27.4%)1267 (72.6%)1ouk-inh-R22450352 (21.8%)1917 (78.2%)

performances containing the Amgen-10 library showed different accuracies for native and non-native classes—the native classes were predicted more accurately (81-90%) than the non-native molecules (only 50-62%).

To test the expandability of these predictive models, decision trees were regenerated by randomly setting aside 25% of the original library as the evaluation set.

Models were built using the remaining 75% of the data, with exactly the same parameter settings and 10-fold cross validation. Each model was then applied to test its respective evaluation set that was never used in the model building process. The prediction accuracies were all comparable to those shown in Table 7, indicating that the models are fully expandable, and can be applied to filter very large combinatorial libraries.

FIG. 22 illustrates an r-SIFt based strategy for designing of target-focused chemical libraries. In particular, r-SIFt is used for compound classification and library filtering. This method takes the advantage of the ability of SIFt (including r-SIFt) to quickly analyze and organize large amount of structural data and to efficiently identify compounds consistent with known binding modes from a large set of docking data. The strategy involves the following steps: selectiny with maximized diversity; calculating 2D descriptors of the whole library of compounds; docking the small library onto the target molecule structure; calculating SIFt or r-SIFt for the docking poses; analyzing and clustering the poses based on their SIFt or r-SIFt patterns; classifying compounds into native and non-native groups based on the SIFt analysis, according to whether or not they are able to bind to the target molecule with desired binding mode, or satisfy some pre-defined interactions; building predictive models based on the above classifications, using the 2D descriptors of the molecules as predictive variables;

and applying this predictive model to filter the original large combinatorial library.

r-SIFt is a variation of SIFt designed for dealing with compound library. r-SIFt embeds the binding information of various R-groups of a combinatorial library into a fingerprint. r-SIFt has several desirable features. First, it is extremely sensitive to subtle variations of the placement of ligands within the active site; second, when represented as a heat map, the r-SIFt patterns renders a convenient way for direct visualization of how the ligand molecules interact with various regions of the target molecule; third, calculation of r-SIFt patterns is less time consuming since it only involves simple contact distances between

Tenfold cross-validation was used during the construction process: 90% of the data points (randomly selected) were used each time to build the model while 10% of the data was set aside as test set for validation. The accuracies for the test data sets set aside during the decision tree construction was a better performance indicator. All four models gave reasonably good and balanced performances, with accuracies (for test sets) in the range of 70-80%, for both native and non-native classes of molecules (see Table 7 below).

TABLE 7Prediction Accuracies (% Correct) of Decision Tree ModelTraining SetTest SetLibrarynativenon-nativenativenon-native1ouk-inh-R180848077skf86002-R178807779amgen10-R1837473701ouk-inh-R271727071

The three R-1 libraries were derived from different scaffolds. Since the variable R-1 groups in these libraries all target the same hydrophobic binding pocket, it was reasonable to expect that the rules derived from these libraries should be closely related to each other. To test this hypothesis, each decision tree model was used to predict the other two R-1 libraries. The cross-library prediction results are summarized in Table 8.

TABLE 8Cross-library (R1 only) Prediction Accuracies (% Correct)Target Library1ouk-inh-R1skf-86002-R1amgen10-R1non-non-non-Model Librarynativenativenativenativenativenative1-ouk-inh-R1——78719050skf-86002-R17474——8356amgen10-R181618262——

1ouk-inh-R1 and SKF-86002-R1 were interchangeable, with their cross-library prediction accuracies remain 71-78% for both classes of molecules—a performance comparable to their self-prediction accuracies (Table 7). Interestingly, all prediction heavy atom pairs. The r-SIFt based method offers two advantages that are useful for library design. A modified docking poses triage scheme that combines both traditional scoring function and the SIFt ranking provided much better confidence in generating the true binding placement of the compounds. It therefore gave a superior database enrichment performance over traditional schemes. As one application of the r-SIFt, this method was used to analyze several ensembles of docking results, and accurately differentiate compounds in a library based on their abilities to bind to the target with expected binding mode (virtual hits and non-hits). Based on r-SIFt classification, 2D descriptors of the compounds were used to build general predictive models that can filter large libraries.

SIFt, p-SIFt and r-SIFt can enforce different layers of target molecule-ligand constraints that are valuable in designing chemical library and mining virtual screening results. r-SIFt, as a variation SIFt, incorporates the binding information of different fragment of compound into the fingerprints, thus allowing analysis of the 3-D structures of the compounds in the context of target native site based on how different fragments of these compounds interact with the target molecule. r-SIFt provides flexibility to a user, who can select various types of binding information to incorporate in the fingerprints, depending on the specific needs of analyses.

The r-SIFt based approach provides a method complimentary to other conventional filtering methods, such as the 3-D pharmacophore model. A fundamental difference between these two methods is that in the SIFt-based method, compounds are actually docked them onto the target molecule to see how they behave—whether or not they are able to make the predicted interactions, as expected by the pharmacophore model or known SIFt patterns/profiles.

The r-SIFt pattern, as implemented above, provided information about the overall orientation and position of a ligand molecule related to the binding site of the target molecule. It did not provide, however, more detailed information about what kinds of interaction (hydrophobic, polar, hydrogen bonds, etc.) are involved. Often times such detailed binding information can be highly valuable and can be used as effective constraints in designing a library. The SIFt and p-SIFt contain such details and other perspectives. One can combine SIFt and r-SIFt to construct more constraints and to carry out more careful and in depth analysis of the pilot library in order to classify native and non-native compounds. One can also apply more than one type of SIFt in the library design process. For example, r-SIFt can be used to search for molecules such that a particular R-group occupies a special region of the target molecule, and SIFt can be applied to further search for molecules making specific interactions (e.g., hydrogen bonds, hydrophobic interaction) with particular residues/sub-regions. Such double constraints would generate a pool of native molecules that are more specific and selective.

r-SIFt can offer a sensitive and efficient method to discriminate different binding modes of the ligands and therefore can be used as a powerful filter, especially during the initial filtering steps, to effectively remove compounds with undesirable interaction patterns with the target. SIFt and r-SIFt—based approaches have been proven to be an effective tool for organizing, visualizing, analyzing large library of structures such as docking poses. SIFt-based library design and pruning method described here provides a new strategy that is complimentary to other conventional methods.

A number of embodiments have been described. Nevertheless, it will be understood that various modifications may be made. Accordingly, other embodiments are within the scope of the following claims.

Number	Date	Country
60672018	Apr 2005	US
60602852	Aug 2004	US
60484308	Jul 2003	US
60524083	Nov 2003	US

	Number	Date	Country
Parent	PCT/US04/20992	Jul 2004	US
Child	11206034	Aug 2005	US

Structural interaction fingerprint

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Parent Case Info

Provisional Applications (4)

Continuation in Parts (1)