1. Field of the Invention
The present invention relates generally to molecular analysis, and more specifically, to characterizing a molecule.
2. Related Art
Characterizing or distinguishing molecules has many practical benefits. For example, some molecules are known to react with a protein in a certain way. Being able to identify those molecules, researchers and practitioners can influence the migration of proteins within a living organism as well as develop new medications or treatments for diseases.
For instance, if a particular molecule is known to bind to specific residue sites on a protein, the protein may fold or enter a dormant or harmless state. As a result, the folded or dormant protein will be unable to bind to areas of a human heart or other organs, and cause damage to the heart or other organs.
Therefore, a need exists to develop a technology that can quickly and conveniently characterize, distinguish, and/or cluster molecules based on their interaction with a protein or similar structure.
The present invention provides a method, system and computer program product for developing a residue fingerprint for a molecular structure (such as a ligand). Based on the residues of a reference structure (such as a protein), a residue fingerprint defines a set of residues that interacts with the molecular structure. Residue fingerprints can be used to compare different poses of the molecular structure with a reference pose on the same molecular structure, poses of different molecular structures, and/or a different reference three-dimensional structure.
In an embodiment, a list of molecular structures is generated and stored for characterization. Each molecular structure compared to a reference structure to characterize its binding mode with the reference structure.
In an embodiment, the binding mode is determined by measuring the inter-atomic distance between the molecular structure and residues on the reference structure. Interacting residues are identified as those having an inter-atomic distance that does not exceed an inter-atomic threshold. In an embodiment, the inter-atomic threshold is based on the van der Waals radii of the two atoms.
A residue fingerprint for the molecular structure is produced from interacting residues. In an embodiment, the residue fingerprint is expressed as a list of interacting residues. In another embodiment, the residue fingerprint is represented as a bit string whose length is the number or residues in the reference structure. The bit string can be a binary representation with a “1” designating positions corresponding to interacting residues and a “0” designating positions corresponding to non-interacting residues.
According to embodiments of the present invention, residue fingerprints are used to define the similarity of molecular structures in terms of binding mode, identify molecules with similar binding modes, and/or select a subset of molecules that represent the full diversity of binding modes in a larger set. In an embodiment, a Tanimoto score is computed to measure the similarity.
The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the present invention and, together with the description, further serve to explain the principles of the invention and to enable one skilled in the pertinent art(s) to make and use the invention. In the drawings, generally, like reference numbers indicate identical or functionally or structurally similar elements. Additionally, generally, the leftmost digit(s) of a reference number identifies the drawing in which the reference number first appears.
According to embodiments of the present invention, a residue fingerprint is developed to characterize, distinguish, and cluster large numbers of three-dimensional molecular structures (such as, a ligand), based on their binding mode with a reference structure. A binding mode represents the three-dimensional interactions that a molecular structure makes with the reference structure. The reference structure can be a protein or any other type of macromolecule.
Based on the residues of the reference structure, a residue fingerprint defines a set of residues that interacts with the molecular structure. As discussed below, residue fingerprints can be used to define the similarity of structures in terms of binding mode, identify molecular structures with similar binding modes, or select a subset of molecular structures that represent the full diversity of binding modes in a larger set.
Referring to
The control flow of flowchart 100 begins at step 101 and passes immediately to step 103. At step 103, a molecular structure is accessed for characterization. In an embodiment, the molecular structure is selected from a list of molecular structures, which are stored on a storage medium. In an embodiment, a software application is used build the list of molecular structures. For example, a software application can be used to design a group of molecular structures, which are based on a caspase protein structure. The molecular structures would be stored and selected individually to be characterized in accordance with the present invention.
At step 106, a reference structure is accessed. As discussed in greater detail below, the molecular structure selected at step 103 is compared to the reference structure to characterize its binding mode. As discussed above, the reference structure can be a protein or another macromolecule. If the selected molecular structure is generated by a software application from a caspase protein structure, as discussed at step 103, the caspase protein structure can be selected as the reference structure.
At step 109, a residue is selected from the molecular structure. The reference structure typically includes a plurality of residues, and one of the residues is selected for further examination. Each residue is processed in turn.
At step 112, the binding mode for the molecular structure is characterized for the selected residue. In other words, the selected residue is examined to determine whether it is an interacting residue. A residue is denoted as being an interacting residue if the residue has at least one atom that is close to an atom in the molecular structure. An interacting threshold determines the requisite degree of closeness for denoting an interacting threshold. If the inter-atomic distance is less than the interacting threshold, the residue is denoted as being an interacting residue. The interacting threshold can be based on the van der Waals radii of the atoms being used to measure the inter-atomic distance. In an embodiment, the interacting threshold is the product of a scaling factor and the sum of the van der Waals radii of the two atoms. In an embodiment, the value 1.2 is chosen to be the scaling factor.
In an embodiment, a C++ program is executed to calculate the interacting threshold and determine whether the selected residue is an interacting residue. If an interacting residue is detected, the residue is marked or added to a list of interacting residues. In addition to the C++ programming language, other programming languages can be used to code the software for detecting interacting residues.
At step 115, the reference structure is examined to detect any additional residues that are to be characterized. If another residue is detected, the control flow returns to step 109 and the detected residue is examined. If no other residues are detected, the control flow passes to step 118 because all residues have been examined and measured for interactivity with the molecular structure.
At step 118, a residue fingerprint for the molecular structure is produced from the interacting residues. Therefore, a residue fingerprint identifies and/or characterizes a molecular structure by identifying all residues on a reference structure that interact with the molecular structure. In an embodiment, the residue fingerprint is expressed as a list of interacting residues. In another embodiment, the residue fingerprint is represented as a bit string whose length is the number of residues in the reference structure. Positions corresponding to interacting residues receive a “1”, and positions corresponding to non-interacting residues receive a “0” value.
After the residue fingerprint has been produced, the fingerprint is outputted to a storage medium or a display. The residue fingerprint can also be provided as input to another process, computation, or the like. Afterwards, the control flow ends as indicated at step 195.
In another embodiment of the present invention, the nature of atom-to-atom interactions is taken into consideration to provide finer granularity to the computation of a residue fingerprint. This can be described with reference to flowchart 112 in
The control flow of flowchart 112 begins at step 201 and passes immediately to step 203. At step 203, the atoms of the molecular structure are examined to detect the different types of atoms that are present. The different types can include an H-bond donor, H-bond acceptor, pi, hydrophobic-aromatic, hydrophobic-aliphatic, or the like.
At step 206, the types of atoms are detected at the selected residue for the reference structure. As discussed, the atoms can be an H-bond donor, H-bond acceptor, pi, hydrophobic-aromatic, hydrophobic-aliphatic, or the like.
At step 209, one of the atom types detected at step 206 is selected for the reference structure. At step 212, one of the atom types detected at step 203 for the molecular structure is selected.
At step 215, the atoms corresponding to the selected atom types are examined to determine if the atom from the molecular structure is an interacting atom with respect to the reference structure. As discussed above with reference to step 112, in an embodiment, the inter-atomic distance is measured to determine if the inter-atomic distance is less than an interacting threshold.
At step 218, the molecular structure is examined to detect any additional atom types that have not been examined. If another atom type is detected, the control flow returns to step 212 and the detected atom type is selected. If no other atom types are detected, the control flow passes to step 221 since all detected atom types have been measured for interactivity with the reference structure.
At step 221, the reference structure is examined to detect any additional atom types that have not been examined. If another atom type is detected, the control flow returns to step 209 and the detected atom type is selected. If no other atom types are detected, the control flow passes to step 295 since all detected atom types have been measured for interactivity with the molecular structure. As a result, if five atom types are detectable for both structures, a five-by-five matrix of possible interaction types is defined, and/or a bit can be marked for each interaction that exists between the molecular structure and the reference structure. Afterwards, the control flow ends as indicated at step 295.
In another embodiment of the present invention, only the types of atoms for the reference structure are taken into consideration to provide finer granularity to the computation of a residue fingerprint. This can be described with reference to flowchart 112 in
The control flow of flowchart 112 begins at step 301 and passes immediately to step 303. At step 303, the types of atoms are detected at the selected residue for the reference structure. As discussed above with reference to flowchart 200, the atoms can be an H-bond donor, H-bond acceptor, pi, hydrophobic-aromatic, hydrophobic-aliphatic, or the like.
At step 306, one of the atom types is selected. At step 309, the atoms corresponding to the selected atom type and the atoms from the molecular structure are examined to determine if any atom from the molecular structure is an interacting atom. As discussed above with reference to step 112, in an embodiment, the inter-atomic distance is measured to determine if the inter-atomic distance is less than an interacting threshold.
At step 312, the reference structure is examined to detect any additional atom types that have not been examined. If another atom type is detected, the control flow returns to step 306 and the detected atom type is selected. If no other atom types are detected, the control flow passes to step 395 since all detected atom types have been measured for interactivity with the molecular structure. Afterwards, the control flow ends as indicated at step 395.
In another embodiment of the present invention, the quantity of each type of interaction with each residue is taken into consideration to increase the granularity for a residue fingerprint. This can be described with reference to flowchart 112 in
The control flow of flowchart 112 begins at step 401 and passes immediately to steps 303-312, as described above with reference to
In another embodiment of the present invention, finer granularity to a residue fingerprint is provided to distinguish specific atoms on a residue. This can be described with reference to flowchart 112 in
The control flow of flowchart 112 begins at step 501 and passes immediately to steps 303-312, as described above with reference to
As discussed, the control flows depicted in
As discussed with reference to step 112 in
The present invention also includes methodologies and/or techniques for quantifying the similarity of two molecular structures and selecting a subset of maximally dissimilar (i.e., representative) molecular structures. This can be described with reference to
The control flow of flowchart 600 begins at step 601 and passes immediately to step 603. At step 603, the residue fingerprints for two molecular structures are accessed. The residue fingerprints can be calculated by one or more of the control flows described above with reference to
At step 606, one of the residue fingerprints is selected and the number of items in the selected fingerprint is computed. This number is denoted by the variable “N1.” At step 609, the other residue fingerprint is selected and the number of items is computed. This number is denoted by the variable “N2.”
At step 612, the number of items shared by both fingerprints is computed. This number is denoted by the variable “NS.”
At step 615, a Tanimoto score is computed from the information computed from steps 606-612. In an embodiment, the Tanimoto score is computed by summing the number of items from the first and second fingerprints and subtracting the number of shared items from this value. Afterwards, the reciprocal of this value is multiplied by the number of shared items. In other words, “Tanimoto Score=NS/(N1+N2−NS).” After the Tanimoto score is computed, the control flow ends as indicated at step 695.
Computing the Tanimoto score between two residue fingerprints gives a measure of the similarity of the three-dimensional binding modes of the two molecular structures, without regard to their chemical compositions. This similarity measure between two fingerprints forms the basis for various clustering methods.
Thus, in an embodiment, the present invention enables molecular structures to be clustered by binding mode. The Tanimoto score is used to classify a large set of molecular structures into a set of clusters. Molecules within each cluster of molecular structures have a high Tanimoto score to each other and, therefore, a similar binding mode. A representative molecular structure is selected from each cluster. Thus, a small subset of molecular structures can be selected to represent the full diversity of binding modes in a larger set of molecular structures.
In an embodiment, a software application is used to select representative subsets of molecular structures based on their diversity of binding modes. The software application can be the SUBSET program written by Bruno Bienfait and described in the article written by Reynalds et al., entitled “Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds,” Journal of Chemical Information and Computer Sciences (1998), vol. 38(2), pp. 305-312, which is incorporated herein by reference in its entirety. The software application can present a small number (e.g., a dozen) of representative molecular structures that reflect the binding modes of the larger set. Then, another software application would select molecular structures similar in binding mode to interesting looking molecular structures. Another software application can also select molecular structures that have interactions with at least a specified set of residues.
The residue fingerprints of the present invention enable comparisons to be made among the binding modes in symmetrical sites in the same protein complex, or across different but related proteins.
First, a software application, as discussed above, is used to generate two sets of molecules, one set for each of the two binding sites 702 and 704. Next, residue fingerprints are produced to compare the molecules designed for each site 702 and 704. For each of the two sets of molecules, a molecule is selected having thee-dimensional coordinates that are different from the three-dimensional coordinates of the molecule selected from the other set. Afterwards, a list of interacting residues is assembled for the two molecules from their respective residue fingerprints. For the first molecule, the list of interacting residues includes “A121, A161, A163, A62, A63, A64, A65, E204, E205, E206, E207, E209, and E256.” For the second molecule, the list of interacting residues includes “B121, B161, B162, B163, B62, B64, F204, F205, F206, and F207.” The Tanimoto score for these two molecules is zero, which suggests that the molecules are dissimilar.
However, by discarding the first character (e.g., A, E, B, F) at each site in the residue fingerprint, a list of interacting residues can be prepared that is independent of chain. The Tanimoto score for the independent list is 0.64, which indicates that the molecules are similar despite having different three-dimensional coordinates. The molecules are binding the same way although, by happenstance, they bind to different sites by design. Thus, their similarities can be detected despite being bound at different sites. Accordingly, the residue fingerprints of the present invention enables molecules to be compared across different, yet theoretically equivalent, sites within the same protein complex.
By mapping the coordinates of protein structure 802 onto protein structure 804, or vice versa, a merged protein structure can be created to indicate the structural correspondence of the residues between protein structure 802 and protein structure 804. The residue fingerprints for the two molecules would, likewise, indicate the structural correspondence of the residues. For instance, the residue fingerprint for the molecule selected for binding site 806 includes interacting residues “120—316, 121—317, 161—358, 162—359, 163—360, B61, B62, 64—260, 205—411, and 207—413.” The underscores in the residue fingerprint identify the residue sites that are structurally equivalent in the two protein structures 802 and 804. For example, residue site “B120” in protein structure 802 and residue site “C316” in protein structure 804 are structural equivalents, and are, therefore, expressed as a “merged” residue site “120—316” in the residue fingerprint for the merged protein. Residue site “B61” in protein structure 802 does not have a corresponding site in protein structure 804, and therefore, is listed as residue site “B61” in the merged protein.
As for the molecule selected for binding site 808, the residue fingerprint for this molecule includes interacting residues “C258, 64—260, 120—316, 121—317, 161—358, 162—359, 163—360, 205—411, and 207—413.” Once again, the underscores in the residue fingerprint identify the residue sites that are structurally equivalent in the two protein structures 802 and 804. For example, residue site “B64” in protein structure 802 structurally corresponds to residue site “C260” in protein structure 804. However, residue site “C258” in protein structure 804 has no corresponding residue site in protein structure 802.
A Tanimoto score of “0.73” is computed from the “merged” residue fingerprints. The merged Tanimoto score indicates that the two molecules are similar despite having different three-dimensional coordinates and despite being bound to different, but related, protein structures 802 and 804. Therefore, residue fingerprinting, produced in accordance with the present invention, can be extended to allow a comparison to be made among the binding modes of molecules against different, but related, protein structures. By mapping the protein structures to a common location, as discussed above, a protein-neutral list of interacting residues can be generated to compare the binding modes of the molecules designed for different protein structures. The results from the comparison reveal the degree of similarity even though the molecules have different three-dimensional coordinates and bind to different protein structures.
The present invention can be implemented in one or more computer systems capable of carrying out the functionality described herein. Referring to
The computer system 900 includes one or more processors, such as processor 904. The processor 904 is connected to a communication infrastructure 906 (e.g., a communications bus, crossover bar, or network).
Computer system 900 can include a display interface 902 that forwards graphics, text, and other data from the communication infrastructure 906 (or from a frame buffer not shown) for display on the display unit 930.
Computer system 900 also includes a main memory 908, preferably random access memory (RAM), and can also include a secondary memory 910. The secondary memory 910 can include, for example, a hard disk drive 912 and/or a removable storage drive 914, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive 914 reads from and/or writes to a removable storage unit 918 in a well-known manner. Removable storage unit 918, represents a floppy disk, magnetic tape, optical disk, etc. which is read by and written to removable storage drive 914. As will be appreciated, the removable storage unit 918 includes a computer usable storage medium having stored therein computer software (e.g., programs or other instructions) and/or data.
In alternative embodiments, secondary memory 910 can include other similar means for allowing computer software and/or data to be loaded into computer system 900. Such means can include, for example, a removable storage unit 922 and an interface 920. Examples of such can include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 922 and interfaces 920 which allow software and data to be transferred from the removable storage unit 922 to computer system 900.
Computer system 900 can also include a communications interface 924. Communications interface 924 allows software and data to be transferred between computer system 900 and external devices. Examples of communications interface 924 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface 924 are in the form of signals 928 which can be electronic, electromagnetic, optical, or other signals capable of being received by communications interface 924. These signals 928 are provided to communications interface 924 via a communications path (i.e., channel) 926. Communications path 926 carries signals 928 and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, free-space optics, and/or other communications channels.
In this document, the terms “computer program medium” and “computer usable medium” are used to generally refer to media such as removable storage unit 918, removable storage unit 922, a hard disk installed in hard disk drive 912, and signals 928. These computer program products are means for providing software to computer system 900. The invention is directed to such computer program products.
Computer programs (also called computer control logic or computer readable program code) are stored in main memory 908 and/or secondary memory 910. Computer programs can also be received via communications interface 924. Such computer programs, when executed, enable the computer system 900 to implement the present invention as discussed herein. In particular, the computer programs, when executed, enable the processor 904 to implement the processes of the present invention, such as the various steps of methods 100 and 600, for example, described above. Accordingly, such computer programs represent controllers of the computer system 900.
In an embodiment where the invention is implemented using software, the software can be stored in a computer program product and loaded into computer system 900 using removable storage drive 914, hard drive 912, interface 920, or communications interface 924. The control logic (software), when executed by the processor 904, causes the processor 904 to perform the functions of the invention as described herein.
In another embodiment, the invention is implemented primarily in hardware using, for example, hardware components such as application specific integrated circuits (ASICs). Implementation of the hardware state machine so as to perform the functions described herein will be apparent to one skilled in the relevant art(s).
In yet another embodiment, the invention is implemented using a combination of both hardware and software.
The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art (including the contents of the documents cited and incorporated by reference herein), readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance presented herein, in combination with the knowledge of one skilled in the art.
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example, and not limitation. It will be apparent to one skilled in the relevant art(s) that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Thus, the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application claims the benefit of U.S. Provisional Application No. 60/514,008, filed Oct. 27, 2003, by Mosenkis et al., entitled “Computing a Residue Fingerprint for a Molecular Structure,” incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60514008 | Oct 2003 | US |