The present invention relates to the field of drug development, and more particularly to systems and methods for identifying structurally or functionally significant amino acid sequences.
Pathogenic bacteria are bacteria which may infect a host organism and thereby cause disease or illness. Infection with pathogenic bacteria may be treated with antibiotics drugs designed to target and kill certain pathogenic bacteria. Recent years have seen an increasing number of antibiotic-resistance pathogenic bacterial strains appear in the public domain. In this same time frame, the introduction of new antibiotic drugs has declined. Therefore, there is a need for new antibiotic drugs to target the increasing number of pathogenic bacteria, and consequently a need for new research strategies for developing such drugs.
Aspects of the present invention are embodied in systems, methods, and computer readable storage mediums for identifying structurally or functionally significant amino acid sequences encoded by a genome. At least one structurally or functionally significant amino acid sequence encoded by a genome may be identified by compiling an observed frequency for each of a plurality of amino acid words encoded by the genome, calculating with a computer an expected frequency for each of the plurality of amino acid words encoded by the genome, and identifying at least one structurally or functionally significant amino acid sequence encoded by the genome based at least in part on the observed and expected frequencies for each of the plurality of amino acid words encoded by the genome.
In accordance with another aspect of the present invention, a structurally or functionally significant amino acid sequence in the protein of a pathogen may be targeted by compiling an observed frequency for each of a plurality of amino acid words encoded by the genome of the pathogen, calculating with a computer an expected frequency for each of the plurality of amino acid words encoded by the genome of the pathogen, identifying at least one structurally or functionally significant amino acid sequence encoded by the genome of the pathogen based at least in part on the observed and expected frequencies for each of the plurality of amino acid word encoded by the genome of the pathogen, and developing a drug configured to interact with the at least one structurally or functionally significant amino acid sequence encoded by the genome of the pathogen.
The invention is best understood from the following detailed description when read in connection with the accompanying drawings. Included in the drawings are the following figures:
As used herein, a genome for bacteria refers to the complete genetic sequence of the bacteria. Each genome includes multiple genes that encode various polypeptide sequences. Some of the polypeptide sequences encoded by the genome include protein sequences. Each protein sequence encoded by the genome is comprised of a sequence of amino acids.
As a general overview, system 100 includes one or more input device(s) 102, a data processor 104, a data storage device 106, and one or more output device(s) 108. System 100 may optionally include an external processing system 110. Additional details of system 100 are provided below.
Input device(s) 102 is/are coupled to data processor 104 and may be used to provide electronic data from a user or electronic device to data processor 104. In one exemplary embodiment, the electronic data may include data relating to one or more genomes. In another exemplary embodiment, the electronic data may include the observed frequency of each amino acid word in the protein sequences encoded by the genome. Additionally, an input device 102 may be used to provide user instructions to data processor 104. Input device(s) 102 may include a server, database, keyboard and/or other computer peripheral devices capable of providing electronic data to a data processor.
Data processor 104 receives electronic data from input device 102 and processes the electronic data. Data processor 104 may store received electronic data or processed electronic data in data storage device 106 (described below). In one exemplary embodiment, data processor 104 receives electronic data including data relating to one or more genomes. In another exemplary embodiment, data processor 104 receives electronic data including an observed frequency of each amino acid word in the protein sequences encoded by a genome.
Data processor 104 is configured to process electronic data. Data processor 104 may transform the electronic data into another format. In one exemplary embodiment, the transformed electronic data may include an amino acid word dictionary for a genome. In another exemplary embodiment, the transformed electronic data may include one or more selection scores (described below) for a genome. The transformed electronic data may be stored in data storage device 106 (described below), or transmitted to output device 108 (described below).
Data storage device 106 stores electronic data received from data processor 104. In one exemplary embodiment, data processor 104 may store electronic data including data relating to one or more genomes on data storage device 106. In another exemplary embodiment, data processor 104 may store electronic data including one or more amino acid word dictionaries for one or more genomes on data storage device 106. In yet another exemplary embodiment, data processor 104 may store electronic data including one or more selection scores for one or more genomes on data storage device 106. Data processor 104 may access the electronic data stored on data storage device 106. A suitable data storage device for use with the present invention will be understood by one of skill in the art from the description herein.
An exemplary system including suitable processors and data storage devices for use with the present invention includes a Sun Microsystems SunFire V60x cluster, featuring 128 dual processor 2.8 GHx Xeon CPUs, 7 quad-processor Sunfire X4100M2 nodes, a 48 node Myrinet Switch, 160 GB of memory, and over a terabyte of disk storage. Other suitable data processors and data storage devices will be understood by one skilled in the art from the description herein.
Output device(s) 108 is/are coupled to data processor 104 and may be used to present electronic data received from data processor 104 to a user. In one exemplary embodiment, the electronic data may include one or more amino acid word dictionaries for one or more genomes. In another exemplary embodiment, the electronic data may include one or more selection scores for one or more genomes. Output device(s) 108 may include a computer display, printer, or other computer peripheral device capable of generating output to a user from received electronic data.
An optional external processing system 110 is configured to exchange electronic data with data processor 104 and may perform one or more of the functions performed by data processor 104. Additionally, external processing system 110 may provide electronic data to data processor 104 for further processing. A suitable external processing system for use with the present invention will be understood by one skilled in the art from the description herein.
In step 202, an observed frequency of amino acid words in the protein sequences encoded by a genome is compiled. In an exemplary embodiment, data processor 104 receives data relating to a genome from input device(s) 102. Data processor 104 may then count the number of times each amino acid word occurs in each protein sequence encoded by the genome, and compile a list of the observed frequencies for each amino acid word. The list of the observed frequencies of amino acid words may be stored in data storage device 106.
In step 204, an expected frequency of amino acid words in each protein sequence encoded by a genome is calculated, e.g., with a general or specific purpose computer. The expected frequency of each amino acid word may be calculated based at least in part on the observed amino acid word frequency list compiled in step 202. In an exemplary embodiment, data processor 104 calculates an expected frequency of an amino acid word based on the observed frequencies of two or more amino acid subwords that make up the amino acid word. As used herein, an amino acid subword is an amino acid word occurring within another amino acid word. Data processor 10410 may then compile a list of the expected frequencies for each amino acid word. The list of the expected frequencies of amino acid words may then be stored in data storage device 106.
In step 206, a structurally or functionally significant amino acid sequence is identified. The structurally or functionally significant amino acid sequence may be identified based at least in part on the observed and expected amino acid word frequencies compiled in steps 202 and 204. In an exemplary embodiment, data processor 104 generates a selection score for each amino acid sequence in each protein sequence encoded by the genome based on the difference between the expected and observed word frequencies for each amino acid in the sequence. The maximum selection scores correspond to amino acid sequences occurring more frequently in all of the protein sequences encoded the genome than is expected from its expected frequency, which indicates that it is structurally or functionally significant to the bacteria.
The identification of the structurally or functionally significant amino acid sequence may be additionally based on a comparison of the amino acid word frequencies in the protein sequences encoded by the genome (e.g., a genome of a pathogenic bacteria) to the amino acid word frequencies in protein sequences encoded by a related genome (e.g., a genome of a non-pathogenic bacteria related to the pathogenic bacteria). In accordance with this embodiment, differences between the amino acid frequencies of the pathogenic genome and the non-pathogenic genome may be used to identify amino acid words that are significant to the pathogenic bacteria but not to the non-pathogenic bacteria, e.g., amino acid words having a greater frequency in the pathogenic bacteria than the non-pathogenic bacteria. This may provide further information on the different effects of natural selection on the genome of a pathogen as opposed to the effects of natural selection on the genome of a non-pathogen.
In step 208, the structurally or functionally significant amino acid sequence is stored and/or presented. In one exemplary embodiment, the selection scores for one or more structurally or functionally significant amino acid sequences may be stored in data storage device 106. In another exemplary embodiment, data processor 104 may transmit electronic data to output device(s) 108. The electronic data may include the selection scores for one or more structurally or functionally significant amino acid sequences in the genome. Output device(s) 108 may then present the selection scores to a user by, for example, a chart or graph indicating the comparative height of the selection scores for the one or more structurally or functionally significant amino acid sequences presented on a monitor or printed on paper. Electronic data transmitted to output device(s) 108 may be at least temporarily stored, e.g., in a video buffer (not shown).
Identifying one or more structurally or functionally significant amino acid sequences of a pathogen may be useful for designing drugs to target structurally or functionally significant parts of the pathogen. However, identifying structurally or functionally significant amino acid sequences may have other uses. Such uses may include identifying patterns of gene structure and organization, identifying critical genes/pathways in a pathogen, identifying latent pathogen genes in environmental genomes, identifying potential new or emergent pathogen diseases, or identifying patterns of emergent pathogen evolution. It will be understood by one skilled in the art that in these applications, the following step 210 may be omitted.
In step 210, an antibiotic drug is developed to interact with the structurally or functionally significant amino acid sequence. The antibiotic drug may be configured to target one or more structurally or functionally significant amino acid sequences of a pathogen. In an exemplary embodiment, an antibiotic drug is designed to target an amino acid sequence having a high selection score in a pathogen. In a further exemplary embodiment, an antibiotic drug is designed to target an amino acid sequence having a high selection score in multiple pathogens, to increase the effectiveness of the drug. The development of a drug to target a selected amino acid sequence will be known to one of skill in the art.
In step 302, a genome target list is read. In an exemplary embodiment, data processor 104 receives a genome target list from input device(s) 102. The genome target list may include one or more genomes identified by a user for which amino acid word dictionaries are desired to be created. For example, a user doing research on human pathogenic bacteria may identify particularly virulent pathogens for inclusion in the genome target list.
In step 304, the protein sequences in each genome on the genome target list are read. As noted above, each genome encodes multiple polypeptide sequences, of which a number of sequences are protein sequences. In an exemplary embodiment, data processor 104 may read a genome to determine what protein sequences it encodes in order to separately analyze each protein sequence.
In step 306, word lists are written for each protein sequence. In an exemplary embodiment, data processor 104 splits each protein sequence into amino acid words having a length of between one and twelve amino acids, although other lengths are contemplated. For example, the invention has been applied to pathogens having relatively large genomes such as eukaryotic pathogens (e.g., protozoa like Trypansoma (Chagas disease) and Plasmodia (malaria)). For these large genomes, the amino acid word dictionaries can be extended to 24 amino acids or more, while having enough depth to provide relevant information. Data processor 104 may write a list containing each amino acid word occurring in the protein sequence, e.g., to data storage device 106.
In step 308, the list of the words occurring in each protein sequence is compiled. In an exemplary embodiment, data processor 104 may compile the list of each amino acid word occurring more than once in the protein sequences encoded by a genome. The compiled amino acid word list may be stored in data storage device 106.
In step 310, the observed frequency of each amino acid word in the protein sequence is counted and written to a count list. In an exemplary embodiment, data processor 104 may count the observed occurrences of each amino acid word in the compiled list. Data processor 104 may calculate the frequency of each amino acid word in each protein sequence encoded by the genome by dividing the observed number of occurrences for each amino acid word by the number of amino acids in the protein sequence or genome. Data processor 104 may then write a list including the frequency for each amino acid word in the protein sequences. The list containing the observed amino acid word frequency may be stored in data storage device 106.
In step 312, the expected frequency of each amino acid word in each protein sequence is calculated. In an exemplary embodiment, the expected frequency of each amino acid word in a protein sequence may be derived from the probability of each amino acid word in the protein sequence occurring. Data processor 104 may calculate the probability of an amino acid word based on the probability of the occurrence of two or more amino acid subwords making up the amino acid word.
An exemplary algorithm for determining the probability of the occurrence of an amino acid word in the protein sequence may involve calculating the probability from the observed frequency of each amino acid word in the protein sequence. The probability of a 1-long amino acid word (i.e. a single amino acid) occurring within the protein sequence is equal to the frequency of the amino acid, i.e. the number of occurrences of that amino acid in a protein divided by the total number of amino acids in the protein. For example, if the amino acid “A” (for alanine) occurs 11 times in a protein of 100 amino acids, then the probability of the 1-long amino acid word p(A) is 11%. For a 2-long amino acid word, the probability may be determined to be one half of the probability of the first 1-long amino acid subword multiplied by the probability of the second 1-long amino acid subword. For example, if p(A) is 11%, and p(L) (for the 1-long amino acid word for leucine “L”) is 8%, then p(AL) (for the 2-long amino acid word “AL”) would be equal to one half of 0.11*0.08, or 0.44% (with the same probability existing for p(LA)). For N-long amino acid words (where N>2), the probability may be determined based on the probability of a 1-long amino acid subword and a (N−1)-long amino acid subword. For example, the probability of the amino acid word “VALK” occurring may be equal to the average of p(VAL)*p(K) and p(V)*p(ALK).
Using this algorithm, data processor 104 may calculate the probability of any amino acid word occurring based on the probability of two or more subwords of the amino acid word, which may be obtained using the list of observed frequencies of amino acid words in each protein. Data processor 104 may calculate the expected frequency of an amino acid word in a protein by multiplying the probability of the amino acid word occurring with the total number of amino acids in the protein. The expected amino acid word frequency for each amino acid word in each protein sequence encoded by the genome may be stored in data storage device 106.
In step 314, a genome word dictionary is output, e.g., stored to data storage device 106 and/or transmitted to output device 108. In an exemplary embodiment, data processor 104 generates an amino acid word dictionary for each genome. The amino acid word dictionary may contain an entry for each amino acid word in each protein sequence encoded by the genome. Each entry for the amino acid word may include the word's observed frequency, expected frequency, and/or the difference between the observed and expected frequencies. After generating the amino acid word dictionary for each genome, data processor 104 may then store the amino acid word dictionary on data storage device 106 for later access. Additionally, data processor 104 may transmit electronic data including amino acid word dictionaries for each amino acid word in the genome to output device(s) 108. Output device(s) 108 may then present the amino acid word dictionaries to a user via a chart or graph, for example.
In step 316, a genome target list is read. Data processor 104 may receive the genome target list from input device(s) 102. The genome target list may be generated by a user. In an exemplary embodiment, the genome target list may be the same list of genomes read in step 302. In an alternative exemplary embodiment, the genome target list may be a list including genomes for which amino acid word dictionaries have been created, as described above in steps 304-314.
In step 318, the amino acid word dictionaries for each genome on the genome target list are read. In an exemplary embodiment, data processor 104 accesses amino acid word dictionaries stored by data storage device 106. Data processor 104 then reads the amino acid word dictionaries for each genome on the genome target list.
In step 320, the protein sequences for each genome in the genome target list are read. In an exemplary embodiment, data processor 104 may read each genome on the genome target list to determine what proteins sequences it encodes in order to separately analyze each protein sequence.
In step 322, an amino acid sequence selection score is determined for the amino acid sequences in each protein sequence. In an exemplary embodiment, data processor 104 calculates an amino acid sequence selection score based on the amino acid word dictionaries for each amino acid word in the protein sequence. Data processor 104 may assign an amino acid selection score to each amino acid occurring in the protein sequence. The amino acid selection score may be calculated by summing the distances between the observed and expected frequencies for each 4-long, 5-long, and 6-long word containing the amino acid. Data processor 104 may then examine all 13-long amino acid sequences in each protein. Data processor 104 may determine an amino acid sequence selection score for each 13-long amino acid sequence in each protein sequence encoded by the genome by summing the amino acid selection scores for each amino acid contained in the amino acid sequence. The amino acid sequence selection score may be stored in data storage device 106.
In step 324, a protein selection score is determined. In an exemplary embodiment, data processor 104 may calculate a protein selection score for each protein encoded by a genome by summing the amino acid sequence selection scores for each 13-long amino acid sequence in the protein. The protein selection score may be stored in data storage device 106.
In step 326, a genome selection score is determined. In an exemplary embodiment, data processor 104 may calculate a genome selection score for the genome by summing the protein selection scores for each protein sequence encoded by the genome. The genome selection score may be stored in data storage device 106.
In step 328, a genome selection score database is output. In one exemplary embodiment, the amino acid sequence selection score, the protein selection score, and the genome selection score are stored to data storage device 106. In another exemplary embodiment, data processor 104 transmits electronic data to output device 108. The electronic data may include the amino acid sequence selection score, the protein selection score, and the genome selection score. Output device 108 may then present the selection scores to a user via, for example, a chart or graph indicating the comparative height of the selection scores for the one or more structurally or functionally significant amino acid sequences.
In step 402, a distance between the observed and expected frequencies of each amino acid word is calculated. In an exemplary embodiment, data processor 104 compares the observed frequency for each amino acid word in each protein encoded by the genome with the expected frequency for each amino acid word in each protein encoded by the genome. Data processor 104 may utilize a standard Euclidean distance calculation in order to plot a point in a two-dimensional space corresponding to the observed and expected frequencies of an amino acid word. The two dimensions may be the observed frequency and the expected frequency for amino acid words, with each plotted point corresponding to those frequencies for an amino acid word. The two dimensions may vary linearly or logarithmically. Data processor 104 may then compute a linear distance between the plotted point and a hypothetical 1:1 reference line in the two-dimensional space. The 1:1 reference line may correspond to points on the graph where the observed frequency is equal to the expected frequency for an amino acid word. The calculated distance may be the perpendicular distance between the observed vs. expected frequency point for an amino acid word and the 1:1 reference line, and may be calculated using Euclidean geometry.
In an alternative exemplary embodiment, data processor 104 may calculate a distance between the observed and expected frequencies for each amino acid word by determining the difference between the two frequencies through subtraction. The calculated distance between the observed and expected frequencies may be stored in data storage device 106.
In step 404, an amino acid word dictionary is compiled for each genome. In an exemplary embodiment, data processor 104 compiles an amino acid word dictionary for each amino acid word in each protein sequence encoded by the genome. The amino acid word dictionary may include an entry for each amino acid word in each protein sequence encoded by the genome. Each entry may include the observed frequency, expected frequency, and calculated distance between the two frequencies for the amino acid word.
In step 406, the amino acid word dictionary for each genome is stored and/or presented. In one exemplary embodiment, the amino acid word dictionary for each genome may be stored in data storage device 106. In another exemplary embodiment, data processor 104 may transmit electronic data to output device(s) 108. The electronic data may include the amino acid word dictionary for each genome. Output device(s) 108 may then present amino acid word dictionary to a user by, for example, a chart or graph depicting the calculated distance between observed and expected frequencies for each amino acid word in each protein sequence encoded by a genome presented on a monitor or printed on paper. Electronic data transmitted to output device(s) 108 may be at least temporarily stored, e.g., in a video buffer (not shown).
The selection score for an amino acid sequence in a protein sequence may be determined based on the selection score for each amino acid in the sequence. Illustration 500 depicts a sample sequence of amino acids 502a-502l in a protein sequence. In an exemplary embodiment, data processor 104 examines every 4-long, 5-long, and 6-long amino acid word in each protein sequence. Example 500 depicts a series of 4-long amino acid words 504a-504e. For example, amino acid word 504a includes amino acids 502a-502d; amino acid word 504b includes amino acids 502b-502e; and so on.
Each amino acid word 504a-504e has a corresponding calculated distance between the word's observed and expected frequency, as contained in the amino acid word dictionary generated in step 314. For each examined word 504a-504e, the calculated distance for the amino acid word is added to each amino acid in the amino acid word to generate a selection score for each amino acid. For example, assume amino acid word 504a has a calculated distance of 5; word 504b has a calculated distance of 6; word 504c has a calculated distance of 4; word 504d has a calculated distance of 6; and word 504e has a calculated distance of 7. In this example, the selection score for amino acid 502d would be the sum of the calculated distances for amino acid words 504a-504d, or 21 (5+6+4+6); the selection score for amino acid 502e would be the sum of the calculated distances for amino acid words 504b-504e, or 23 (6+4+6+7).
In an exemplary embodiment, data processor 104 performs this summation for each amino acid in the protein sequence using all 4-long amino acid words (e.g. 504a-504e), 5-long amino acid words (not shown), and 6-long amino acid words (not shown). Data processor 104 may then examine all 13-long amino acid sequences in the protein. Data processor 104 may determine a selection score for each 13-long amino acid sequence in each protein sequence encoded by the genome by summing the selection scores for each amino acid contained in the amino acid sequence. For example, the selection score for 13-long amino acid sequence 506 would be the sum of the selection scores for amino acids 502a-502k. Data processor 104 may store the selection score for the amino acid sequence in data storage device 106.
Each graph further includes a line 606 corresponding to points where the observed and expected frequencies of each amino acid word in the protein sequences encoded by the genome are equal. For example, points falling to the right of line 606 correspond to amino acid words having an observed frequency greater than their expected frequency; points falling to the left of line 606 correspond to amino acid words having an observed frequency less than their expected frequency.
Region 608 on both graphs represents an exemplary location on each graph where amino acid words having substantially higher observed frequencies than would be expected. Amino acid sequences containing the amino acid words falling within region 608 may be sequences having high selection scores, as described above. Accordingly, amino acid sequences containing amino acid words falling within region 608 of graph 602 may be structurally or functionally significant to E. coli str. K12 bacteria, and amino acid sequences containing amino acid words falling within region 608 of graph 604 may be structurally or functionally significant to E. coli str. O157 bacteria.
Further, comparison of graphs 602 and 604 may demonstrate the differences in the genomes of non-pathogenic E. coli str. K12 and pathogenic E. coli str. O157. For example, if an amino acid word falls within region 608 of graph 604, but not within region 608 of graph 602, this may indicate that amino acid sequences containing the amino acid word are structurally or functionally significant to the pathogenic bacteria but not to the non-pathogenic bacteria. This comparison may provide further information on the different effects of natural selection on the genome of a pathogen as opposed to the effects of natural selection on the genome of a non-pathogen.
One or more of the steps described above may be embodied in computer-executable instructions stored on a computer readable storage medium. The computer readable storage medium may be essentially any tangible storage medium capable of storing instructions for performance by a general or specific purpose computer such as an optical disc, magnetic disk, or solid state device, for example.
Although the invention is illustrated and described herein with reference to specific embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention.
This application is related to and claims the benefit of U.S. Provisional Application No. 61/208,513 entitled Systems and Methods for Identifying Structurally or Functionally Significant Amino Acid Sequences filed on Feb. 25, 2009, the contents of which are incorporated fully herein by reference.
Number | Date | Country | |
---|---|---|---|
61208513 | Feb 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12546285 | Aug 2009 | US |
Child | 13591743 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13591743 | Aug 2012 | US |
Child | 15076454 | US |