This invention relates to a method of annotation of genome sequences.
Many genomes, including the human genome have now been sequenced. A genome sequence provides a list of bases (A, T, G, C) in the order in which they appear in a length of DNA, however, the sequence per se tells one very little about the genome that is useful and easily or immediately comprehensible. For example in the study of a disease causing bacteria it would be useful in searching for a cure for the disease to determine the location of that part of the bacterium's genome which expressed a particular protein. However, it can be difficult to predict where proteins of interest may be located in a genome sequence. It cannot always be done simply by looking at the sequence per se.
There are a number of known processes for attempting to determine the location of proteins in genome sequence data. The most widely used method for annotation are pattern searching and sequence comparison techniques. One other known method uses computer programs to locate recognisable regions such as start codons and stop codons in a DNA sequence. Other programs attempt to locate proteins by locating regions of high complexity within a DNA sequence which typically indicates the location of a protein.
However, these approaches are far from perfect as in order to implement these programs, various assumptions and hypotheses have to be made about the location of a protein of interest in the DNA sequence, in particular, the potential start and stop positions of the protein. A detection method that requires such assumptions or hypotheses may produce incorrect results if the assumptions/hypotheses are incorrect. For example these procedures are unlikely to locate non-typical sequences, which ironically may be of more interest than other proteins having more typical sequences identified using existing techniques.
Thus, it is one object of the present invention to provide a method for annotating genome sequences, which is hypothesis independent and does not make assumptions for the detection of a protein from nucleic acid sequences.
Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is solely for the purpose of providing a context for the present invention. It is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present invention as it existed in Australia before the priority date of each claim of this application.
A first broad aspect of the present invention, provides a method of identifying one or more proteins in an unannotated DNA sequence, the method comprising:
(a) dividing the DNA sequence into a plurality of sequence fragments each fragment being of substantially the same length and from about 300 to 5000 bases long;
(b) performing a six frame translation of each of the DNA sequence fragments to obtain six translated amino acid sequence fragments for each DNA sequence fragment;
(c) subjecting each of the translated sequence fragments to theoretical digestion to obtain a plurality of cleaved peptide sequences;
(d) comparing experimental empirical data for peptide fragments from a protein digested in the same manner as the theoretical digestion at step (c) with the theoretical data generated in step (c) for each of the translated sequence fragments to identify one or more translated sequence fragments which include a significant number of peptides present in the digested protein.
Thus the present invention identifies a region of a genome that encodes a protein and optimally defines the open reading frame and therefore the sequence of the protein from the genome. An advantage of the present invention is that no assumptions need to be made about the location of proteins in the DNA sequence data. DNA sequences with non-typical stop and or start codons may be located. The results are hypothesis independent.
Typically the theoretically generated peptide masses are compared to the masses of the peptides experimentally generated by the digested protein and the sequence fragment which has the greatest number of theoretical peptide masses correlating to the empirical data indicates the likely location of the protein of interest in the DNA sequence. The masses of the peptides experimentally generated from the digested protein will typically be determined by mass spectrometry.
It is preferred that the DNA sequence is duplicated and the original and duplicate are split in such a manner that the sequence fragments from the original overlap the cuts in the original genome sequence.
It is important that the sequence fragments are approximately the same length as one another and are sized to equate to the length of a typical protein. Hence, each fragment is, as discussed above, about 300-5000 bases long. Proteins vary in size, most proteins being 10 to 100 kDa i.e. about 300-3000 bases long. Most preferably, the sequence fragments will be around 1000 or 1050 bases long, the latter translating to 350 amino acids which is approximately equivalent to a 33 to 37 kDa protein, which is a common size for a protein.
Using DNA sequences of approximately that length produce about 12 to 20 peptide matches against a background number of matches of commonly around 1 or 2, and up to around 4 for sequences which do not contain a protein.
In a related aspect of the present invention, the step of dividing the DNA sequence and the step of performing the six frame translation can be reversed. Hence, a second broad aspect of the present invention provides a method of identifying one or more proteins in unannotated DNA sequence, the method comprising:
(a) performing a six frame translation of a DNA sequence to provide six translated amino acid sequences;
(b) dividing the six translated amino acid sequences into a plurality of fragments, each fragment comprising 100-1666 amino acids;
(c) subjecting each of the fragments to theoretical digestion to obtain a plurality of cleaved peptide sequences;
(d) comparing experimental empirical data for peptide fragment for peptide fragments from a protein digested in the same manner as the theoretical digestion at step (c) with theoretical data generated in step (c) for each of the fragments to identify one or more fragments which include a significant number of peptides present in the empirically digested protein.
A specific embodiment of the present invention will now be described by way of example with reference to the accompanying drawings.
Referring to the drawings,
The segments are overlapped to facilitate the process of identifying the region of the genome coding for the protein of interest. In some cases, the peptide masses from the protein of interest could be distributed across two adjacent segments, with a portion of the peptide masses at the end of one segment and a second portion at the start of the next segment. This means the number of peptide masses on each of the two segments will be closer to the background number of random, “noise” matches found on the remaining segments making it harder to identify the hit. However, by using overlapping segments, the peptide t the end of one segment and the start of the next will all be located on the common, overlapping segment. This means the number of peptides on the common, overlapping segment will be further from the background number of random, “noise” matches making it easier to identify this segment as the correct location of the protein-coding region in the genome.
In principle, the overlap is not absolutely necessary for the method to work but it is significant in distinguishing a hit from background “noise”, particularly in the case of relatively small proteins. For example if overlapping were not used and a relatively small protein fell equally between two adjacent segments, only three or four hits might be obtained for each segment. This would not be distinguishable over the background “noise” of typically about 4 hits, so it would not identify the protein. Using overlapping segments, there is a good chance the smaller protein would fall in a single fragment, and the number of hits would be maximised and so facilitate the identification.
Typically, the genome will be cut into sequence fragments which are 1050 bases long. This approximates to 350 amino acids which will be found in a protein of around 33 to 37 kDa which is a common protein size. A bacterium such as Mycobacterium tuberculosis (Tb) will have around 4.4 million bases in its genome. Duplicating and cutting that genome will result in approximately 8400 sequence fragments.
The genome is segmented to enable easier identification of the protein-coding region of the genome. The genome is segmented into fixed sections, regardless of the length or possible location of the protein coding regions. Hence, the number of background or random matches to the peptide masses is reasonably constant and this then helps to identify the protein coding regions. When the number of matches against a region exceeds the number of random matches on other segments, a protein-coding region is indicated.
If the genome were not segmented, it would be difficult to determine when a concentration of hits was indicative of a protein-coding region. It would be necessary to look for a certain number of hits in a certain length region, but the exact value of these parameters would need to be pre-determined and may affect the results.
Each segment of the genome simulates a protein (the translation of a certain region of a genome). By segmenting, the peptide mass analysis is analogous to peptide mass fingerprinting. This allows the use of a number of existing PMF search engines to do the analysis. Most advantageously, the present invention addresses a very complex problem of mining of genomes with proteomic data but presents the results of this in a way which is completely familiar and highly understandable to the proteomics researcher which does not require the researcher to relearn a new tool or paradigm.
Further, segmenting the genome has advantages in terms of computational performance. In particular, working with a whole genome at once is likely to be demanding in terms of computer memory. Smaller segments can be analysed sequentially and thus require less memory at any particular point in the calculation.
A six frame translation is then carried out on each of the sequence fragments.
The masses of the various empirically derived peptides are then compared with the theoretical peptide masses produced by theoretical cleavage of the sequence fragments. This is done in a stepwise manner and frame by frame whereby all the empirical peptide masses are matched against all peptides from the first virtual protein and the number of matching peptides (matches or “hits”) is recorded. For each virtual protein, this process is carried out six times, once for each of the amino acid translations. However, the number of matches for each frame is calculated separately and the matches are not summed together. This process is then repeated for the second virtual protein and so on, until it has been carried out for all the virtual proteins. This step is illustrated in
Clearly the relevant part of the genome sequence may have been cut in the original division of the genome sequence, however the overlapping of the original and duplicate genome sequences reduces the risk of this. Even if the protein is split it may still be possible to identify the relevant part of the genome sequence if there are a reasonable number of hits, e.g. 6 to 10, in two adjacent overlapping fragments. The part of the sequence which carries the most peptide masses which match the peptide masses produced by the empirical digestion and has a number of hits which is clearly above the background (noise) level is likely to be that part of the genome which carries the protein of interest. By knowing where the part of the sequence came from this identifies the location of the protein in the genome sequence (
FIGS. 5 to 10 illustrate the results of carrying out the method of the present invention,
A culture of Mycobacterium tuberculosis was used as the source of proteins for experimental analysis. The sample was prepared and the proteins separated using 2D gel electrophoresis. A number of spots were cut from the gel, digested with trypsin, and the peptides resulting from the digestion were analysed with MALDI mass spectrometry. These peaks were analysed using standard peptide mass fingerprinting to identify the proteins contained in each spot,
The genome of M. tuberculosis was segmented into 1050 base pair segments, translated, and theoretically digested using the process described above. The peaks were searched against the genome using the method of the present invention as described above.
The peaks from a first spot were searched with 0.1 Da error tolerance, allowing for cystines to be modified by iodoacetamide and for methionine sulfoxide modifications, and minimum to match of four hits.
A second spot from the gel was then searched.
Both these proteins were found in this spot using standard peptide mass fingerprinting. These proteins did not stand out as clearly as in the previous spot, but were still identifiable. This demonstrates the process described in the patent application can also work when multiple proteins are located in the one spot and when the proteins being searched for are relatively small.
An incorrect hit is shown in
The method can be applied to higher order genomes including the human genome. To demonstrate this the genome sequence of chromosome 22 of Homo sapiens was prepared and searched using the method described above. A theoretical peak list was generated using the sequence of Q9BWW9 (Apolipoprotein L5) known to be located on chromosome 22. This peak list was searched against the genome using the method described in the patent application using an error tolerance of 0.1 Da and a minimum to match of 10.
A series of computational simulations were run in order to demonstrate the method and determine the optimum parameters for the method. The simplest simulation involved taking the set of known proteins for Pseudomonas aeruginosa. The set of 773 known proteins was taken from SWISS-PROT. Each protein was theoretically digested according to the cleavage rules of trypsin. Tryptic peptides whose mass was less than 400 Da were discarded, as these masses are not usually seen on a typical MALDI mass spectrum. The remaining tryptic peptides of each protein in turn were searched against the raw genome using the method described in the patent application. The region of the genome coding for the protein was determined by finding the segment with the highest number of matching peptides. The nearest incorrect hit was determined by finding the segment with the next highest number of peptides, excluding those segments connected to the segment with the highest number of peptides through a chain of overlapping segments. This is illustrated in
In order to summarise this information, the proteins were binned according to the number of tryptic peptides with mass greater than 400 Da generated from them in a theoretical digestion. The first bin contained all protein with 1 to 10 peptides, the second all proteins with 11 to 20 peptides, etc. The number of matching peptides in the best hit for each of the proteins in the bin was averaged, as was the number of matching peptides in the nearest incorrect hit. These two numbers were plotted as in
The results showed a distinct difference between the best hit and the best of the incorrect hits. The average second best hit has about four to five matching peptides for small query proteins, increasing to around nine to ten matching peptides for larger proteins. For a set of peptides to clearly be identified with a particular region of the genome, they must match more than this number of peptides. This is shown in the figure where the average number of matching peptides in the best hit is significantly higher than the second best hit. For large proteins, the average number of peptide matches approaches 25. This number is limited by the size of the segment as only a certain number of peptides can be expected to fit in the 1050 base pair segment. For smaller proteins, the difference between the first and second hits decreases as there are less peptides in the query sequence, but it can be seen that for all but the smallest proteins, a difference between the two hits is maintained with the average number of matches in the best hit around six to seven.
Several variations on the simulation were done to estimate the effect of different parameters involved in using the method.
1) Increasing the minimum to match, increased the difference between the two curves.
In an application of the method described, the minimum to match should take a value between four and nine, as this is the range for background hits determined in the experiment outlined above. Generally, a high value would be used first to screen out as much background noise as possible. This value would be gradually lowered, if necessary, until a region with a significant matching number of peptides is found.
2) Increasing the size of the segments increases the difference between the two curves. The number of random matches in the second best hit increases slightly, but the number of matches on the best hit increases significantly. A very long segment length is not used because once all query proteins are smaller than the size of the segment no improvement in the obtained and the bigger the segment is the harder it is to locate smaller proteins. In an application of the method described we use 1050 base segments, because this represents a good balance between the two.
3) Changing the composition of the query peak list by adding random peptides has almost no effect on the curves.
In an application of the method described, the peak list is determined by the data extracted from the mass spectrometer. The amount of real peaks and noise peaks is not known in advance.
4) Decreasing the error tolerance for the match between the query masses and the genome masses, increases the difference between the two curves. This is because the query masses are less likely to match another mass in the genome through random chance as the difference in mass tolerated when accepting a match is much smaller.
In an application of the method described, the error tolerance is usually taken in the range of 0.01 to 0.2 Da for experimental masses derived from MALDI mass spectrometry. The value is usually chosen to reflect the accuracy of the technique used to acquire the experimental masses. A typical value is 0.1 Da.
In an application of the method, the peak list used, as input, is the masses of the proteolytic peptides determined by mass spectrometry. The raw spectrum acquired from the mass spectrometer contains many “noise” peaks. Most of these are removed by using a peak-picking algorithm such as the one outlined in Breen et al. (2000, in press) [Breen, E. J., Hopwood, F. G., Williams, K. L., Wilkins, M. R. (2000) Automatic Poisson peak harvesting for high throughput protein identification, Electrophoresis, 21, 2243-2251; Breen E. J., Holstein, W. L., Hopwood, F. G. Smith, P. E., Thomas, M. L., Wilkins, M. R. (2003) Automated peak harvesting of MALDI-MS spectra for high throughput proteomics. Spectroscopy. In press.]
In the simulated testing described above, the peaks used were the masses calculated from the sequence of theoretically cleaved peptides. Masses under 400 Da were excluded because a MALDI mass spectrometer cannot generally measure peptide masses in this range.
The implementation of the methods described in the above examples, assumes the enzyme used to digest the gel spots is trypsin. This is the most common enzyme used experimentally. Thus the theoretical digestion of the segments is also done using the cleavage rules of trypsin.
The method can use any appropriate enzyme to digest the experimental proteins. In this case the theoretical digestion of the genome segments needs to use the cleavage rules for the enzyme to be used in the experimental analysis.
If the experimental analysis is done with multiple enzymes it is possible to use the findings from multiple searches with each of the enzymes to confirm the identification of the region of the genome. If both analyses identify a certain region of the genome as a possible protein-coding region, then the region is more likely to be correctly identified as such It is possible that each analysis may not have enough hits to be clearly distinguished from the background but because multiple analyses indicate the same region, it can still be identified as the protein-coding region.
In a particular application, a combined search could be implemented where a search is trypsin and the hits are tallied to each segment then a search is carried out with other enzymes and hits are tallied to each segment. Finally, the hits to each segment from the two searches are summed to give a composite score per segment. Only hits that are in the same frame are summed. This combined approach would dramatically increase the sensitivity of identification.
It is also possible to take missed cleaved peptides and modified peptides into account. When the cleavage rules are used to determine the theoretical peptides, the sequence of peptides resulting from a missed cleavage can also be calculated. This allows the mass of these peptides to also be determined. During the application of the method of the present invention these masses can also be compared to experimental masses. Similarly, one can calculate the mass of a modified form of each of the peptides and check these masses also when comparing against the experimental masses.
The method can be automated by writing an application or Script to take a series of peak lists and submit each in turn to a search against the genome. The results of this search can be databased and reviewed at a later time to determine the correct hit.
The present invention works particularly well with small genomes such as bacterial and yeast genomes or other eukayote genomes that have few introns and small amounts of non-coding DNA.
The method can also be used for the detection of pseudo genes which are versions of genes which have become defunct and identifying “protein families” of similar proteins. When a protein from a family of proteins is detected, a number of regions having a large number of matches may be identified. This indicates that the proteins may be members of the same protein family which may be for example be expressed in different tissues.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Number | Date | Country | Kind |
---|---|---|---|
PS 1118 | Mar 2002 | AU | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/AU03/00300 | 3/13/2003 | WO | 4/27/2005 |