METHOD FOR COMPARISON OF DNA BASE SEQUENCES

Information

  • Patent Application
  • 20010010903
  • Publication Number
    20010010903
  • Date Filed
    March 30, 1998
    26 years ago
  • Date Published
    August 02, 2001
    23 years ago
Abstract
A method for comparing DNA base sequences by comparing similarities between two amino acid sequences translated from two DNA base sequences, respectively, comprises (1) a step of dividing each of the first DNA base sequence and the second DNA base sequence into groups of successive three nucleotides each, translating each of these groups of nucleotides into an amino acid, and thereby obtaining a first amino acid sequence and a second amino acid sequence, (2) a step of determining similarities between each amino acid of the first translated amino acid sequence and each amino acid of the second translated amino acid sequence in view of nucleotide insertions or deletions in the first and second DNA base sequences and amino acid insertions or deletions in the first and second translated amino acid sequences, accumulating the thus determined similarities, and thereby determining a combination of each amino acid of the first amino acid sequence and a corresponding amino acid of the second amino acid sequence which gives the maximum accumulated similarity, (3) a step of outputting the maximum accumulated similarity, the alignment of the first and second translated amino acid sequences, the alignment of the first translated amino acid sequence and the first DNA base sequence, and the alignment of the second translated amino acid sequence and the second DNA base sequence.
Description


BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention


[0002] The present invention relates to a method for comparison of DNA base sequences and a method for search for DNA base sequences. In particular, it relates to a method for high-sensitivity detection of similarities between DNA base sequences and a method for estimation of an amino acid sequence coded for by a DNA base sequence.


[0003] 2. Description of the Related Art


[0004] In recent years, there has been the following increasing trend: the DNA base sequences of various organisms are determined and the function of a protein coded for by each DNA base sequence is analyzed. The DNA base sequence is a sequence of four kinds of bases A, C, G and T, and portions of the DNA base sequence code for biofunctional proteins, respectively. Of these proteins, those having an important function can be utilized, for example, for design and development of drugs, and there is desired a technique for accurately estimating the function of the protein coded for by the DNA base sequence. In general, the determination of the DNA base sequence is technically easier than experimental protein sequencing.


[0005] The function of a protein coded for by a newly determined DNA base sequence is estimated as follows: the DNA base sequence is translated into an amino acid sequence (which permits protein sequencing) by using the well-known codon table (each of the starting point of translation into amino acids, the terminating point of translation into amino acids and the kinds of amino acids are prescribed in terms of a triplet nucleotide unit (a codon unit)), and the result of the protein sequencing is compared with data on a protein having a known function, to judge whether the proteins are similar or not.


[0006] In a DNA base sequence, the exon region coding for protein information is a region to be translated into amino acids. The codons are unequivocally translated into the amino acids. When the direction of translation of the DNA base sequence and the translation starting point are known, the DNA base sequence can be translated into an amino acid sequence, i.e., a protein by picking out triplets of successive nucleotides from the DNA base sequence in succession. However, if there is an error due to a nucleotide insertion or deletion in the DNA base sequence, the exon region of the DNA base sequence is shifted. Since the DNA base sequence is translated into amino acids as codon units, it is translated into utterly different amino acids if a nucleotide insertion or deletion is present.


[0007] For comparing two DNA base sequences by translating them into amino acid sequences, respectively, and comparing these translated amino acid sequences, the translated amino acid sequences should be determined from the respective DNA base sequences.


[0008]
FIG. 1 is a diagram illustrating 6 kinds of reading frames in a DNA base sequence in the translation of the DNA base sequence into an amino acid sequence [(first prior art): for example, reference 1: Biotechnology textbook series 11 “Introduction of Computer in Biotechnology” written by Haruki Nakamura and Kenta Nakai, pp. 66-67 (1995), CORONA PUBLISHING CO., LTD., Tokyo, Japan)].


[0009] The 6 kinds of the reading frames are as follows:


[0010] Frame (1): a frame according to which a DNA base sequence is translated into an amino acid sequence as codon units from the 5′-terminal of the DNA base sequence.


[0011] Frame (2): a frame according to which the DNA base sequence is translated into an amino acid sequence as codon units while shifting the starting position of each codon by one base from that in frame (1).


[0012] Frame (3): a frame according to which the DNA base sequence is translated into an amino acid sequence as codon units while shifting the starting position of each codon by two bases from that in frame (1).


[0013] Frame (4): a frame according to which the translation of a sequence complementary to the DNA base sequence into an amino acid sequence as codon units is initiated from the 5′-terminal of the complementary sequence.


[0014] Frame (5): a frame according to which the complementary sequence is translated into an amino acid sequence as codon units while shifting the translation starting position by one base from that in frame (4).


[0015] Frame (6): a frame according to which the complementary sequence is translated into an amino acid sequence as codon units while shifting the translation starting position by two bases from that in frame (4).


[0016] From frame (1) to frame (3), the translation starting position is shifted base by base from the 5′-terminal. From frame (4) to frame (6), the translation starting position is shifted base by base from the 5′-terminal of the sequence complementary to the original DNA base sequence (the 3′-terminal of the original DNA base sequence). Therefore, there are the six kinds of the reading frames (1) to (6). A DNA base sequence is translated into an amino acid sequence by employing each of frames (1) to (6). Amino acid sequences translated from two DNA base sequences, respectively, by employing the same frame are compared. Thus, 6 kinds, in all, of amino acid sequences translated from one of the DNA base sequences are compared from those translated from the other DNA base sequence.


[0017] As a typical program for searching similar sequences, there is widely known BLAST developed by Altshul et al. of NCBI, a branch of U.S. NIH, the source program of which has been disclosed (see, for example, the first reference, pages 141 to 143). The BLAST family includes BLASTN for comparing DNA base sequences, BLASTP for comparing amino acid sequences, BLASTX for searching for each of 6 kinds of amino acid sequences mechanically translated from a DNA base sequence according to each of the above-mentioned 6 kinds of the frames, by using an amino acid sequence data base, and TBLASTX for mechanically translating each of a query DNA base sequence as a first DNA base sequence and a DNA base sequence read out of a DNA base sequence data base (a target DNA base sequence) as a second DNA base sequence according to each of the above-mentioned 6 kinds of the frames, and comparing 36 combinations of 6 kinds of amino acid sequences translated from the first DNA base sequence and 6 kinds of amino acid sequences translated from the second DNA base sequence. In the case of the BLAST family, high-speed pattern matching of a base sequence having a definite length in a query DNA base sequence with a target DNA base sequence was carried out at first, and a region similar to the query DNA base sequence is detected on the basis of the position of a base sequence with a definite length detected in the target DNA base sequence.


[0018] In the Smith-Waterman method, each base of a query DNA base sequence is compared with each base of a target DNA base sequence, a score (a similarity) suitable for the combination of the two bases is given, the scores (similarities) thus given are accumulated, and there is sought a path (an alignment) in which the accumulated score (similarity) becomes maximum [(third prior art): for example, reference 2: “Identification of Common Molecular Subsequences”, J. Mol. Biol.,147 (1981), pp. 195-197].


[0019] In the third prior art, the combinations of two bases of two DNA base sequences, respectively, are compared by a dynamic programming method, and scores between the two DNA base sequences are determined. When a DNA base sequence similar to a specific noted DNA base sequence (hereinafter referred to as “query DNA base sequence” or “first DNA base sequence”) is searched for in a DNA base sequence data base, a matrix is formed by aligning the bases of the query DNA base sequence (number of bases: M) in regular order from the 5′-terminal along a first axis (for example, x-axis) and the bases of a DNA base sequence (number of bases: N) read out of the DNA base sequence data base (hereinafter referred to as “target DNA base sequence” or “second DNA base sequence”) in regular order from the 5′-terminal along a second axis (for example, y-axis) (in the present specification, such a matrix is hereinafter referred to “score matrix”) (FIG. 2).


[0020]
FIG. 2 is a diagram illustrating accumulation paths of scores for comparing the first and second DNA base sequences. Each combination of the two bases of the first and second DNA base sequences, respectively, is expressed as the position of a score matrix element (i, j) (i=1, 2, - - - , M; j=1, 2, - - - , N).


[0021] In the dynamic programming method, shift paths (search paths) in three directions, the vertical direction, the horizontal direction and the bias direction (the directions a, b and c, respectively, shown in FIG. 2) to a score matrix element (i, j) are considered, and the position of (i, j) is shifted toward a score matrix element (M, N) at the lower right corner from the score matrix element (1, 1) at the upper left corner shown in FIG. 2, by changing the number i from 1 to M and the number j from 1 to N, whereby there is determined the optimum path (the optimum alignment) which shows the optimum combinations for similarities of the bases of the first DNA base sequence and the bases of the second DNA base sequence.


[0022] The value H(i, j) of a score matrix element (i, j) indicates an accumulated similarity (score) between a base sequence from the first base to the i-th base in the first DNA base sequence and a base sequence from the first base to the j-th base in the second DNA base sequence. For the shift paths in the directions a, b and c shown in FIG. 2, the accumulated similarities (scores), Ha(i, j), Hb(i, j) and Hc(i, j), respectively, are defined by the (equation 1), (equation 2) and (equation 3) shown below, by using a score s(i, j) indicating the similarity between the i-th base of the first DNA base sequence and the j-th base of the second DNA base sequence, a gap penalty score p and accumulated similarities (scores) H(i−1, j−1), H(i−1, j) and H(i, j−1) at score matrix elements (i−1, j−1), (i−1, j) and (i, j−1), respectively, at the original points before shift to the point (i, j). The maximum among Ha(i, j), Hb(i, j) and Hc(i, j) [(equation 4)] is selected as H(i, j). The above-mentioned score s(i, j) can be determined using a previously stored score table. For example, a score of 4 is given to a combination of the same bases, a score of −8n-4 is given when the number of inserted or deleted nucleotides is n, and a score of −3 is given to a combination of two different bases.




H


a
(i, j)=H(i−1, j−1)+s(i, j)   (equation 1)





H


b
(i, j)=H(i, j−1)+p   (equation 2)





H


c
(i, j)=H(i−1, j)+p   (equation 3)





H
(i, j)=max{Ha(i, j), Hb(i, j), Hc(i, j)}  (equation 4)



[0023] The gap penalty score p added in the shift path b corresponds to the presence of a nucleotide deletion after the i-th base of the first DNA base sequence, and the gap penalty score p added in the shift path c corresponds to the presence of a nucleotide deletion after the j-th base of the second DNA base sequence.


[0024] The first and second DNA base sequences are compared by varying the number i from 1 to M and the number j from 1 to N in shift paths from the score matrix element (1, 1) to the score matrix element (M, N), and scores or gap penalty scores are added up in each shift path, whereby there is determined H*=H(M, N), the maximum accumulated similarity (score) between the whole first DNA base sequence and the whole second DNA base sequence. Consequently, it is possible to determine an alignment which gives the greatest similarity between the first and second DNA base sequences, namely, the optimum alignment showing the optimum combinations of the bases of the first DNA base sequence and the bases of the second DNA base sequence.


[0025] The third prior art is applicable not only to the investigation of similarities between two DNA base sequences but also to the investigation of similarities between two amino acid sequences.



SUMMARY OF THE INVENTION

[0026] The above-mentioned first prior art involves the following problem. When a nucleotide insertion or deletion is present in a DNA base sequence, a frame shift occurs at the position of the nucleotide insertion or deletion, and an amino acid sequence coded for by the portion of the base sequence after the frame shift position does not have any similarity which would be given if there were no nucleotide insertion or deletion. Therefore, an amino acid sequence cannot be found which would be obtainable if there were no nucleotide insertion or deletion. Thus, a miss of omission occurs in the search.


[0027] Even if an amino acid sequence very similar to an amino acid sequence obtained by translation using, for example, the frame (1) among the 6 kinds of the frames in a DNA base sequence is present in an amino acid sequence translated from another DNA base sequence, the following problem is caused when a nucleotide insertion or deletion is present in the DNA base sequence: the position of the frame is changed to that of the frame (2) or the frame (3) in the portion of the base sequence after the position of the nucleotide insertion or deletion. In the prior art, there has been disclosed neither a method for comparison of DNA base sequences nor a method for search for DNA base sequences, which has been developed in view of a change of reading frame caused by a nucleotide insertion or deletion in the DNA base sequence.


[0028] The BLAST family including TBLASTX in the above-mentioned second prior art is disadvantageous in that a miss of omission occurs in the search because gaps due to nucleotide insertions or deletions in a DNA base sequence or amino acid insertions or deletions in an amino acid sequence are not considered for assuring high-speed calculation.


[0029] The above-mentioned third prior art is an accurate search method but is disadvantageous in that it requires a long period of time because each base of a DNA base sequence is compared with each base of another DNA base sequence. When the third prior art is combined with the first prior art, namely, each of two DNA base sequences, a quetry DNA base sequence and a target DNA base sequence is translated into an amino acid sequence and the translated amino acid sequences are compared, a longer search time is required because it is necessary to compare 36 combinations of 6 kinds of amino acid sequences translated from the first DNA base sequence according to the 6 kinds of the frames, respectively, explained in the first prior art and 6 kinds of amino acid sequences translated from the second DNA base sequence according to the 6 kinds of the frames, respectively.


[0030] Moreover, when the Smith-Waterman method as the third prior art is combined with the first prior art, the insertion or deletion of amino acids or the insertion or deletion of nucleotides as codon unit in a DNA base sequence can be considered, but the insertion or deletion of nucleotides in a number other than multiples of 3 (i.e. the number of nucleotides constituting a codon unit) in a DNA base sequence cannot be considered. Therefore, the change of the position of frame cannot be considered.


[0031] In the prior arts, there is not considered the prevention of the production of erroneous results due to nucleotide insertions or deletions in a DNA base sequence. That is, it is not considered that the DNA base sequence is translated into an amino acid sequence in view of the presence of the nucleotide insertions or deletions.


[0032] Japanese Patent Application No. 7-265157 [reference 3: application date in Japan: Oct. 13, 1995 (JP-A-09-105748 (laid-open date in Japan: Apr. 22, 1997))] which is not a known reference discloses a method for comparison of DNA base sequences which comprises dividing each of first and second DNA base sequences into triplets of successive nucleotides, to form first and second, respectively, intermediate DNA base sequences, translating each of the first and second intermediate DNA base sequences into amino acids to form first and second, respectively, translated amino acid sequences, determining a first similarity between the first DNA base sequence and the first intermediate DNA base sequence, a second similarity between the second DNA base sequence and the second intermediate DNA base sequence, and a third similarity between the first translated amino acid sequence and the second translated amino acid sequence, and choosing the first and second intermediate DNA base sequences and the first and second translated amino acid sequences so that a parameter obtained from the first, second and third similarities by the use of a predetermined function may be maximum.


[0033] Japanese Patent Application No. 8-167770 (reference 4: application date in Japan: Jun. 27, 1996) which is not a known reference discloses a method for comparison of sequences which comprises translating a query DNA base sequence into amino acids in view of nucleotide insertions or deletions, comparing the resulting translated amino acid sequence with a target amino acid sequence read out of an amino acid data base, according to the Smith-Waterman method, determining the score (similarity) between the i-th amino acid of the translated amino acid sequence and the j-th amino acid of the target amino acid sequence in view of 7 kinds of paths, and thereby aligning the translated amino acid sequence with the target amino acid sequence.


[0034] The reference 3, however, does not disclose a technique concerning a specific example of path in calculation according to the dynamic programming method. The reference 4 discloses a method comprising picking out successive codons each having a starting position one or two bases after that of the preceding codon, in the translation of a query DNA base sequence into an amino acid sequence (which corresponds to the first translation method employed in the present invention), but does not disclose the second and third translation methods employed in the present invention which are explained hereinafter in detail. The reference 4 does not disclose a technique for comparing an amino acid sequence translated from a query DNA base sequence with an amino acid sequence translated from a DNA base sequence read out of a DNA base sequence data base.


[0035] An object of the present invention is to provide a method for comparison of DNA base sequences which hardly causes a miss or omission in search and comprises translating each of a query DNA base sequence and a DNA base sequence read out of a DNA base sequence data base (a target DNA base sequence) into an amino acid sequence, and thereby comparing the two DNA base sequences through the translated amino acid sequences, in particular, a method for high-sensitivity detection of similarities between DNA base sequences and a method for estimation of an amino acid sequences coded for by a query DNA base sequence.


[0036] In the method for comparison of DNA base sequences of the present invention, when similarities between first and second DNA base sequences are investigated, each DNA base sequence is first divided into triplets of successive nucleotides which may involve a nucleotide insertion or deletion. Each of the triplets is translated into an amino acid according to the codon table. Similarities between each amino acid of the thus obtained first translated amino acid sequence and each amino acid of the thus obtained second translated amino acid sequence are accumulated in view of amino acid insertions or deletions in each amino acid sequence to obtain an accumulated score (similarity). There are determined combinations of amino acids of the first translated amino acid sequence and those of the second translated amino acid sequence which give the maximum accumulated similarity (the maximum accumulated score). Thus, there are attained the maximum accumulated score, the alignment of the first and second translated amino acid sequences, and the alignment of the DNA base sequence corresponding to the first translated amino acid sequence with the DNA base sequence corresponding to the second translated amino acid sequence. A specific noted DNA base sequence (a query DNA base sequence) is used as the above first DNA base sequence, and a known DNA base sequence read out of any of various DNA base sequence data bases (a target DNA base sequence) is used as the above second DNA base sequence.


[0037] As a method for translating each DNA base sequence into an amino acid sequence which is adopted in the method for comparison of DNA base sequences of the present invention, the following first, second and third translation methods are employed in combination.


[0038] In the first translation method, the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table while shifting a reading frame for the DNA base sequence at every triplet of successive nucleotides base by base from the end of the DNA base sequence.


[0039] In the second translation method, a reading frame for the DNA base sequence is shifted at every quartet of successive nucleotides base by base from the end of the DNA base sequence, the second of the four nucleotides of each quartet is taken as an inserted nucleotide, and the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table by using the remaining three of the four nucleotides.


[0040] In the third translation method, a reading frame for the DNA base sequence is shifted at every quartet of successive nucleotides base by base from the end of the DNA base sequence, the third of the four nucleotides of each quartet is taken as an inserted nucleotide, and the DNA base sequence is translated into an amino acid sequence according to a predetermined translation rule by codon table by using the remaining three of the four nucleotides.


[0041] In the method for comparison of DNA base sequences of the present invention, a dynamic programming method is employed for calculating the accumulated score in the comparison of the first and second amino acid sequences translated from the first and second, respectively, DNA base sequences. In the calculation according to the dynamic programming method, when there are accumulated scores (similarities) between the i-th amino acid of the first translated amino acid sequence and the j-th amino acid of the second translated amino acid sequence which is represented by a score matrix element (i, j), there are considered seven paths from score matrix elements (i−3, j−3), (i, j−3k), (i−3k, j), (i−3n+1, j−3n), (i− 3n, j−3n+1), (i−3m, j−3m−1) and (i−3 m−1, j−3m), respectively, wherein k is an integer in a range of k≧1, m is an integer in a range of m≧1, and n is an integer in a range of n≧2. When k=1, m=1 and n=2, there are considered paths from score matrix elements (i− 3, j−3), (i, j−3), (i−3, j), (i−5, j−6), (i−6, j−5), (i−3, j−4) and (i−4, j−3), respectively. The elements in the parentheses are positive numbers. The symbol i is an integer in a range of i≦M wherein M is the number of amino acids in the first translated amino acid sequence, and the symbol j is an integer in a range of j≦ N wherein N is the number of amino acids in the second translated amino acid sequence.


[0042] According to the present invention, similarities between the DNA base sequences can be compared through the translated amino acid sequences. Therefore, the comparison can be carried out in detail by listing scores reflecting not only the agreement or disagreement of amino acids but also chemical characteristics (e.g. the hydrophilicity or hydrophobicity of amino acids) and physical characteristics (e.g. the size of amino acids) in a score table used for the comparison for the similarities. Thus, the sensitivity of search for the similarities between the DNA base sequences is enhanced.


[0043] Furthermore, misses or omissions in the search can be reduced because the comparison can be carried out in view of nucleotide insertions or deletions in the DNA base sequences and amino acid insertions or deletions in the translated amino acid sequences.


[0044] The method for comparison of DNA base sequences of the present invention is summarized as follows with reference to FIG. 3. Each of a query DNA base sequence and a DNA base sequence read out of a data base is translated into an amino acid sequence (304, 306), similarities between the translated amino acid sequences are calculated in view of nucleotide insertions or deletions and amino acid insertions or deletions, followed by score accumulation by a dynamic programming method (307), top accumulated scores and paths are calculated by the dynamic programming method, for two translated amino acid sequences giving the top accumulated scores which have been obtained by the similarity search (312), tracing of a path giving the maximum accumulated score is calculated (313), and the result of alignment of the translated amino acid sequences is displayed together with that of alignment of the DNA base sequences. Even if a nucleotide insertion or deletion is present in the two DNA base sequences to be compared, it becomes possible to determine similarities between the DNA base sequences through the translated amino acid sequences. Therefore, the sensitivity of search is enhanced.







BRIEF DESCRIPTION OF THE DRAWINGS

[0045]
FIG. 1 is a diagram illustrating six kinds of reading frames of prior art in a DNA base sequence in the translation of the DNA base sequence into an amino acid sequence.


[0046]
FIG. 2 is a diagram illustrating accumulation paths of scores for comparing DNA base sequences by the Smith-Waterman method, a prior art.


[0047]
FIG. 3 is a flow chart illustrating an example of treating process in an embodiment of the present invention.


[0048]
FIG. 4 shows an example of table of prior art prescribing scores to be given to combinations of two amino acids which is used in the embodiment of the present invention.


[0049]
FIG. 5 shows a codon table of prior art prescribing the termination of translation into amino acids and the kinds of amino acids so that they may correspond to the triplet nucleotide units (codon units), respectively, in the codon table.


[0050]
FIG. 6 is a diagram illustrating the first translation method for translating a DNA base sequence into an amino acid sequence in the embodiment of the present invention.


[0051]
FIG. 7 is a diagram illustrating the second and third translation methods for translating a DNA base sequence into an amino acid sequence in the embodiment of the present invention.


[0052]
FIG. 8 is a diagram illustrating score accumulation paths for comparison of translated amino acid sequences in the embodiment of the present invention.


[0053] Each of FIG. 9 and FIG. 10 is a diagram showing a point (i−3, j−4) at which scores S2(i−3, j−4) and S3(i−3, j−4) are determined in the embodiment of the present invention.


[0054] Each of FIG. 11 and FIG. 12 is a diagram showing a point (i−4, j−3) at which scores S4(i−4, j−3) and S5(i−4, j−3) are determined in the embodiment of the present invention.


[0055]
FIG. 13 is a diagram showing general examples of alignment result corresponding to shift paths, respectively, in 9 directions in calculation by a dynamic programming method in the embodiment of the present invention.


[0056]
FIG. 14 is a diagram showing specific examples of alignment result corresponding to the shift paths, respectively, in the 9 directions in calculation by the dynamic programming method in the embodiment of the present invention.


[0057] Each of FIG. 15, FIG. 16 and FIG. 17 is a diagram showing an example of alignment result obtained by similarity search in the embodiment of the present invention.


[0058]
FIG. 18 is a diagram showing the structure of an apparatus for practicing the method for comparison of DNA base sequences of the present invention.







DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0059] An examination is given below by taking the case of search for a query DNA base sequence by the use of a DNA base sequence data base.


[0060]
FIG. 3 is a flow chart illustrating an example of treating process in the embodiment of the present invention. The outline of a method for comparison of DNA base sequences as the embodiment of the present invention is explained below with reference to FIG. 3. First, (step 301) to (step 304) are carried out.


[0061] (Step 301): input of a score table prescribing the similarity of each combination of two amino acids.


[0062] (Step 302): input of the number of search output of DNA base sequences with top accumulated scores displayed in an output device as a result of search in a DNA base sequence data base.


[0063] (Step 303): input of a query DNA base sequence.


[0064] (Step 304): translated amino acid sequences A1, A2, A3, A4, A5 and A6 are obtained by translating each of the query DNA base sequence and a sequence complementary to the query DNA base sequence by each of the first, second and third translation methods explained hereinafter.


[0065] The translated amino acid sequence A1 is obtained by translation from the query DNA base sequence by the first translation method. The translated amino acid sequence A2 is obtained by translation from a sequence complementary to the query DNA base sequence by the first translation method. The translated amino acid sequence A3 is obtained by translation from the query DNA base sequence by the second translation method. The translated amino acid sequence A4 is obtained by translation from the query DNA base sequence by the third translation method. The translated amino acid sequence A5 is obtained by translation from the sequence complementary to the query DNA base sequence by the second translation method. The translated amino acid sequence A6 is obtained by translation from the sequence complementary to the query DNA base sequence by the third translation method.


[0066] Then, all target DNA base sequences to be read out of the DNA base sequence data base are subjected to the following steps (step 305) to (step 308).


[0067] (Step 305): the target DNA base sequences are read out of the DNA base sequence data base.


[0068] (Step 306): translated amino acid sequences B1, B2, B3, B4, B5 and B6 are obtained by translating each of the read-out target DNA base sequences and their complementary base sequences by each of the first, second and third translation methods explained hereinafter in detail.


[0069] The translated amino acid sequence B1 is obtained by translation from the target DNA base sequence by the first translation method. The translated amino acid sequence B2 is obtained by translation from a sequence complementary to the target DNA base sequence by the first translation method. The translated amino acid sequence B3 is obtained by translation from the target DNA base sequence by the second translation method. The translated amino acid sequence B4 is obtained by translation from the target DNA base sequence by the third translation method. The translated amino acid sequence B5 is obtained by translation from the sequence complementary to the target DNA base sequence by the second translation method. The translated amino acid sequence B6 is obtained by translation from the sequence complementary to the target DNA base sequence by the third translation method.


[0070] (Step 307): accumulated similarities between the translated amino acid sequences in each of the following 4 combinations of the 4 kinds of the translated amino acid sequences obtained in (step 304) and (step 306) is calculated by a dynamic programming method:


[0071] (a) the combination of the translated amino acid sequences A1 and B1,


[0072] (b) the combination of the translated amino acid sequences A1 and B2,


[0073] (c) the combination of the translated amino acid sequences A2 and B1, and


[0074] (d) the combination of the translated amino acid sequences A2 and B2.


[0075] (Step 308): DNA base sequences with top accumulated scores up to the number of search output are selected, and information on the DNA base sequences with top accumulated scores is read out of the DNA base sequence data base and stored.


[0076] Next, all the DNA base sequences read out of the DNA base sequence data base are subjected to the following (step 309) to (step 311).


[0077] (Step 309): the accumulated similarities (scores) are lined up in order of decreasing value, and top accumulated scores corresponding to the number of search output are sorted.


[0078] (Step 310): the DNA base sequences with top accumulated scores are displayed in a display (403 in FIG. 18). In this case, the DNA base sequences with top accumulated scores may be output in an outer memory (404 in FIG. 18) such as a hard disc.


[0079] (Step 311): there is input the number of similarity search results (the number of output of alignment results) at which display of the alignment results is considered preferable, judging from the top accumulated scores displayed in (step 310).


[0080] Subsequently, all the target DNA base sequences for which the alignment results are displayed are subjected to following (step 312) to (step 314).


[0081] (Step 312): accumulated scores and paths are calculated by the dynamic programming method.


[0082] (Step 313): tracing of a path giving the maximum accumulated score is calculated to obtain an alignment result of two amino acid sequences translated from the query DNA base sequence and the target DNA base sequence read out of the DNA base sequence data base, respectively, and an alignment result of the DNA base sequences corresponding to the translated amino acid sequences, respectively.


[0083] (Step 314): the alignment result obtained in (step 313) is displayed in a display (403 in FIG. 18). At the same time, the alignment result may be output in an outer memory (404 in FIG. 18) such as a hard disc.


[0084]
FIG. 4 shows Blosum 62, an example of table of prior art prescribing scores to be given to combinations of two amino acids which is used in the embodiment of the present invention. The symbols A, R, N, - - - , W, Y and V on the axis of abscissa and the axis of ordinate in FIG. 4 are abbreviations of amino acids. The symbol B (As*) denotes either Asn or Asp, the symbol Z (Gl*) denotes either Gln or Clu, the symbol X (***) denotes either incapability of translation or an unknown amino acid, and the symbol O (Stp) denotes a termination codon.


[0085] There is explained below a method for translating each of the query DNA base sequence and the target DNA base sequence read out of the DNA base sequence data base into an amino acid sequence in view of nucleotide insertions or depletions [(step 304) and (step 306)].


[0086]
FIG. 5 shows a codon table of prior art prescribing the termination of translation into amino acids and the kinds of amino acids so that they may correspond to the triplet nucleotide units (codon units), respectively, in the codon table. In each of the DNA base sequences, each triplet nucleotide unit (codon) codes for an amino acid according to FIG. 5. In FIG. 5, the symbols in the parentheses are one-word abbreviations of amino acids.


[0087]
FIG. 6 is a diagram illustrating the first translation method for translating each DNA base sequence into an amino acid sequence in the embodiment of the present invention. In the first translation method, a codon (a triplet of nucleotides) is picked out from the 5′-terminal of each DNA base sequence and translated into an amino acid according to FIG. 5. Then, the next codon having a starting point one base after that of the first codon is picked out and translated into an amino acid according to FIG. 5. Thereafter, subsequent codons each having a starting point one base after that of the preceding codon are continuously translated in the same manner as above, until the last nucleotide of a codon picked out agrees with the nucleotide at the 3′-terminal of the DNA base sequence, whereby the DNA base sequence is translated into an amino acid sequence. Thus, the translated amino acid sequence A1 or B1 is obtained. A sequence complementary to the DNA base sequence is also translated into an amino acid sequence according to FIG. 5 in the same manner as above. Thus, the translated amino acid sequence A2 or B2 is obtained. Consequently, two kinds in all of the translated amino acid sequences (A1 and A2, or B1 and B2) can be obtained by the first translation method.


[0088] In FIG. 6, ATGCC, - - - , CGAT is chosen as an example of DNA base sequence. A codon ATG is picked out from the 5′-terminal and translated into an amino acid according to FIG. 5, and the next codon TGC having a starting point one base after that of the first codon is picked out and then translated into an amino acid according to FIG. 5. Thereafter, subsequent codons GCC, - - - , CGA and GAT each having a starting point one base after that of the preceding codon are picked out and then translated into amino acids A, - - - , R and D, respectively. The resulting translated amino acid sequence is MCA, - - - , RD. As shown in FIG. 6, a sequence complementary to the DNA base sequence ATCG, - - - , GGCAT is also translated into an amino acid sequence according to FIG. 5 in the same manner as above, whereby the translated amino acid sequence IS, - - - , GAH is obtained.


[0089]
FIG. 7 is a diagram illustrating the second and third translation methods for translating each DNA base sequence into an amino acid sequence in the embodiment of the present invention.


[0090] In the second translation method, four nucleotides are picked out from the 5′-terminal of each DNA base sequence, and the second of the four nucleotides is taken as an inserted nucleotide. The remaining three nucleotides (a first revised DNA base sequence) are translated into an amino acid according to FIG. 5. Thereafter, the same translation is repeated according to FIG. 5 except for picking out subsequent quartets of successive nucleotides each of which has a starting point one base after that of the preceding quartet of successive nucleotides, until the last nucleotide of four successive nucleotides picked out agrees with the nucleotide at the 3′-terminal of the DNA base sequence, whereby the DNA base sequence is translated into an amino acid sequence. Thus, the translated amino acid sequence A3 or B3 is obtained.


[0091] In the third translation method, four nucleotides are picked out from the 5′-terminal of each DNA base sequence, and the third of the four nucleotides is taken as an inserted nucleotide. The remaining three nucleotides (a second revised DNA base sequence) are translated into an amino acid according to FIG. 5. Thereafter, the same translation is repeated according to FIG. 5 except for picking out subsequent quartets of successive nucleotides each of which has a starting point one base after that of the preceding quartet of successive nucleotides, until the last nucleotide of four successive nucleotides picked out agrees with the nucleotide at the 3′-terminal of the DNA base sequence, whereby the DNA base sequence is translated into an amino acid sequence. Thus, the translated amino acid sequence A4 or B4 is obtained.


[0092] A sequence complementary to each DNA base sequence is translated by each of the second and third translation methods in the same manner as above, whereby a translated amino acid sequence A5 or B5 and a translated amino acid sequence A6 or B6 are obtained which are not shown. Consequently, two kinds in all of the translated amino acid sequences (A3 and A5, or B3 and B5) can be obtained by the second translation method, and two kinds in all of the translated amino acid sequences (A4 and A6, or B4 and B6) can be obtained by the third translation method.


[0093] In the example shown in FIG. 7, the DNA base sequence is ATGCC, - - - , CGAT. Therefore, when the DNA base sequence is translated into an amino acid sequence by each of the second and third translation methods, four nucleotides corresponding to ATGC are picked out from the 5′-terminal at first, and AGC, a base sequence in the case of taking the second nucleotide (T) as an inserted nucleotide (a first revised DNA base sequence) and ATC, a base sequence in the case of taking the third nucleotide (G) as an inserted nucleotide (a second revised DNA base sequence) are translated into an amino acids S and I, respectively. Next, TCC (a first revised DNA base sequence) and TGC (a second revised DNA base sequence) which have been obtained from TGCC (i.e. a quartet of nucleotides having a starting point one base after that of the first quartet of nucleotides) are translated into an amino acids S and C, respectively. Thereafter, such translation is continued in the same manner as above except for picking out subsequent quartets of nucleotides each of which has a starting point one base after that of the preceding quartet of nucleotides, whereby amino acid sequences are translated from the DNA base sequence. Consequently, the translated amino acid sequences are SS, - - - , H and IC, - - - , R. In addition, ATCG, - - - , GGCAT, a sequence complementary to the DNA base sequence shown in FIG. 7 is translated into an amino acid sequence by each of the second and third translation methods in the same manner as above to obtain the translated amino acid sequences not shown.


[0094] There is explained below in detail (step 307) in which accumulated scores between translated amino acid sequences are calculated by the dynamic programming method for calculating accumulated similarities.


[0095] In the present invention, a score matrix for comparing amino acid sequences is obtained by modifying the score matrix for comparing DNA base sequences according to the Smith-Waterman method which is shown in FIG. 2. Using the score table prescribing scores to be given to combinations of two amino acids which is shown in FIG. 4, similarities between two amino acids of translated amino acid sequences, respectively, to be compared is determined and then accumulated. Accumulated similarities between the translated amino acid sequences are calculated by the dynamic programming method by using the translated amino acid sequences A1, A2, A3, A4, A5 and A6 obtained in (step 304) and the translated amino acid sequences B1, B2, B3, B4, B5 and B6 obtained in (step 306).


[0096] The bases of a first translated amino acid sequence (A1 or A2) are aligned in regular order along a first axis (for example, x-axis) from the 5′-terminal of the DNA base sequence corresponding to the first translated amino acid sequence, and the bases of a second translated amino acid sequence (B1 or B2) are aligned in regular order along a second axis (for example, y-axis) from the 5′-terminal of the DNA base sequence corresponding to the second translated amino acid sequence. Thus, there is formed a score matrix H in which the value H(i, j) of a score matrix element (i, j) indicate an accumulated similarity between an amino acid sequence from the first amino acid to the i-th amino acid in the first translated amino acid sequence and an amino acid sequence from the first amino acid to the j-th amino acid in the second translated amino acid sequence. The bases of a 1st, 3rd, 5th, 7th or 9th translated amino acid sequence (any of A1, A2, A3, A4, A5 and A6) are aligned in regular order along a first axis (for example, x-axis) from the 5′-terminal of the DNA base sequence corresponding to the 1st, 3rd, 5th, 7th or 9th translated amino acid sequence, and the bases of a 2nd, 4th, 6th, 8th or 10th translated amino acid sequence (any of B1, B2, B3, B4, B5 and B6) are aligned in regular order along a second axis (for example, y-axis) from the 5′-terminal of the DNA base sequence corresponding to the 2nd, 4th, 6th, 8th or 10th translated amino acid sequence. Thus, there are formed 1st, 2nd, 3rd, 4th and 5th matrices s1(i, j) to s5(i, j) which indicate the score (similarity) of each combination of two amino acids. First to fourth groups of 5 matrices each are formed by combination of the translated amino acid sequence A1, A2, A3, A4, A5 or A6 and the translated amino acid sequence B1, B2, B3, B4, B5 or B6. In each of the 5 matrices, the translated amino acid sequence along the first axis and that on the second axis are referred to as Ai and Bj, respectively, and for simplification, the translated amino acid sequences along the first axis and the second axis, respectively, in each matrix is represented by (Ai, Bj).


[0097] The 1st group of matrices is composed of


[0098] a score matrix H having sequences (A1, B1),


[0099] a 1st matrix s1 having sequences (A1, B1),


[0100] a 2nd matrix s2 having sequences (A1, B3),


[0101] a 3rd matrix s3 having sequences (A1, B4),


[0102] a 4th matrix s4 having sequences (A3, B1), and


[0103] a 5th matrix s5 having sequences (A4, B1),


[0104] wherein A1 is used as the 1st, 3rd and 5th translated amino acid sequences, A3 as the 7th translated amino acid sequence, A4 as the 9th translated amino acid sequence, B1 as the 2nd, 8th and 10th translated amino acid sequences, B3 as the 4th translated amino acid sequence, and B4 as the 6th translated amino acid sequence.


[0105] The 2nd group of matrices is composed of


[0106] a score matrix H having sequences (A1, B2),


[0107] a 1st matrix s1 having sequences (A1, B2),


[0108] a 2nd matrix s2 having sequences (A1, B5),


[0109] a 3rd matrix s3 having sequences (A1, B6),


[0110] a 4th matrix s4 having sequences (A3, B2), and


[0111] a 5th matrix s5 having sequences (A4, B2),


[0112] wherein A1 is used as the 1st, 3rd and 5th translated amino acid sequences, A3 as the 7th translated amino acid sequence, A4 as the 9th translated amino acid sequence, B2 as the 2nd, 8th and 10th translated amino acid sequences, B5 as the 4th translated amino acid sequence, and B6 as the 6th translated amino acid sequence.


[0113] The 3rd group of matrices is composed of


[0114] a score matrix H having sequences (A2, B1),


[0115] a 1st matrix s1 having sequences (A2, B1),


[0116] a 2nd matrix s2 having sequences (A2, B3),


[0117] a 3rd matrix s3 having sequences (A2, B4),


[0118] a 4th matrix s4 having sequences (A5, B1), and


[0119] a 5th matrix s5 having sequences (A6, B1),


[0120] wherein A2 is used as the 1st, 3rd and 5th translated amino acid sequences, A5 as the 7th translated amino acid sequence, A6 as the 9th translated amino acid sequence, B1 as the 2nd, 8th and 10th translated amino acid sequences, B3 as the 4th translated amino acid sequence, and B4 as the 6th translated amino acid sequence.


[0121] The 4th group of matrices is composed of


[0122] a score matrix H having sequences (A2, B2),


[0123] a 1st matrix s1 having sequences (A2, B2),


[0124] a 2nd matrix s2 having sequences (A2, B5),


[0125] a 3rd matrix s3 having sequences (A2, B6),


[0126] a 4th matrix s4 having sequences (A5, B2), and


[0127] a 5th matrix s5 having sequences (A6, B2),


[0128] wherein A2 is used as the 1st, 3rd and 5th translated mino acid sequences, A5 as the 7th translated amino acid sequence, A6 as the 9th translated amino acid sequence, B2 as the 2nd, 8th and 10th translated amino acid sequences, B5 as the 4th translated amino acid sequence, and B6 as the 6th translated amino acid sequence.


[0129]
FIG. 8 is a diagram illustrating score accumulation paths for comparison of DNA base sequences in the embodiment of the present invention.


[0130] The 1st to 4th groups of matrices are independently used. The shift paths (search paths) {circle over (1)} to {circle over (9)} in 9 directions to a score matrix element (i, j) shown in FIG. 8 are considered for each group of matrices by the dynamic programming method. The position of (i, j) is shifted toward the score matrix element (M, N) at the lower right corner in FIG. 8 from the score matrix element (1, 1) at the upper left corner by changing the number i from 1 to M (the number of amino acids constituting the amino acid sequence on the first axis of each score matrix) and the number j from 1 to N (the number of amino acids constituting the amino acid sequence on the second axis of the score matrix), whereby there is determined the optimum path (the optimum alignment) showing the optimum combination for similarity of each amino acid of the first translated amino acid sequence and a corresponding amino acid of the second translated amino acid sequence.


[0131] The value H(i, j) of a score matrix element (i, j) indicates an accumulated similarity between an amino acid sequence from the first amino acid to the i-th amino acid in the first translated amino acid sequence and an amino acid sequence from the first amino acid to the j-th amino acid in the second translated amino acid sequence.


[0132] In the case of the shift paths {circle over (1)} to {circle over (9)} in the 9 directions to a point (i, j) from points (1) to (11) shown in FIG. 8, the maximum among H1(i, j) to H11(i, j) [(equation 16)] is selected as an accumulated similarity (score) H(i, j). For determining the scores s1(i, j) to s5(i, j), the score table shown in FIG. 4 is used. Each of H1(i, j) to H11(i, j) is defined by (equation 5) to (equation 15), respectively, by using the scores s1(i, j) to s5(i, j) which indicate similarities between the i-th amino acid in the amino acid sequence on the first axis and the j-th amino acid in the amino acid sequence on the second axis, gap penalty scores Wa and Wn, and the values of score matrix elements at the original points before shift to the point (i, j), H(i−3, j−3), H(i−3, j), H(i, j−3), H(i−5, j−6), H(i−6, j−5), H(i−3, j−4), H(i−4, j−3), H(i−6, j−7) and H(i−7, j−6).


[0133] Each of FIG. 9 and FIG. 10 shows the relationship between the first term (i−6, j−7) of H8(i, j) or H9(i, j), respectively, and (i, j). Each of FIG. 11 and FIG. 12 shows the relationship between the first term (i−7, j− 6) of H10(i, j) or H11(i, j), respectively, and (i, j). The point (i−3, j−4) in each of FIG. 9 and FIG. 10 is a point at which scores s2 and s3 are determined. The point (i−4, j−3) in each of FIG. 11 and FIG. 12 is a point at which scores s4 and s5 are determined.




H


1
(i, j)=H(i−3, j−3)+s1(i, j)= H(i−3, j−3)+s*1(A*i, B*j)   (equation 5)



[0134] wherein H1(i, j) corresponds to a shift path from a point (i−3, j−3) to a point (i, j).




H


2
(i, j)=H(i, j−3)+wa   (equation 6)



[0135] wherein H2(i, j) corresponds to a shift path from a point (i, j−3) to a point (i, j).




H


3
(i, j)=H(i−3, j)+wa   (equation 7)



[0136] wherein H3(i, j) corresponds to a shift path from a point (i−3, j) to a point (i, j).




H


4
(i, j)=H(i−5, j−6)+wn+s1(i, j)= H(i−5, j−6)+wn+s1*(A*i, B*j)   (equation 8)



[0137] wherein H4(i, j) corresponds to a shift path from a point (i−5, j−6) to a point (i, j).




H


5
(i, j)=H(i−6, j−5)+wn+s1(i, j)= H(i−6, j−5)+wn+s1*(A*i, B*j)   (equation 9)



[0138] wherein H5(i, j) corresponds to a shift path from a point (i−6, j−5) to a point (i, j).




H


6
(i, j)=H(i−3, j−4)+wn+s1(i, j)= H(i−3, j−4)+wn+s1*(A*i, B*j)   (equation 10)



[0139] wherein H6(i, j) corresponds to a shift path from a point (i−3, j−4) to a point (i, j).




H


7
(i, j)=H(i−4, j−3)+wn+s1(i, j)= H(i−4, j−3)+wn+s1(A*i, B*j)   (equation 11)



[0140] wherein H7(i, j) corresponds to a shift path from a point (i−4, j−3) to a point (i, j).




H


8
(i, j)=H(i−6, j−7)+wn+s2(i−3, j−4)+ s1(i, j)=H(i−6, j−7)+wn+s2*(A*i-3, {bj-4bj-3 bj-1})+s1*(A*i, B*j)   (equation 12)





H


9
(i, j)=H(i−6, j−7)+wn+s3(i−3, j−4)+ s1(i, j)=H(i−6, j−7)+wn+s3*(A*i-3, {bj-4bj-2 bj-1})+s1*(A*i, B*j)   (equation 13)



[0141] wherein each of H8(i, j) and H9(i, j) involves a shift path from a point (i−6, j−7) to a point (i, j).




H


10
(i, j)=H(i−7, j−6)+wn+s4(i−4, j−3)+ s1(i, j)=H(i−7, j−6)+wn+s4*({ai-4ai-3ai-1}, B*j-3)+s1*(A*i, B*j)   (equation 14)





H


11
(i, j)=H(i−7, j−6)+ wn+s5(i−4, j−3)+ s1(i, j)= H(i−7, j−6)+wn+s5*({ai-4ai-2ai-1}, B*j-3)+s1*(A*i, B*j)   (equation 15)



[0142] wherein each of H10(i, j) and H11(i, j) involves a shift path from a point (i−7, j−6) to a point (i, j).




H
(i, j)=max{H1(i, j), H2(i, j), H3(i, j), H4(i, j), H5(i, j), H6(i, j), H7(i, j), H8(i, j), H9(i, j), H10(i, j), H11(i, j)}  (equation 16)





s


1
(i, j)=s1*(A*i, B*j)   (equation 17)





s


2
(i−3 , j−4)=s*(A*i-3,{bj-4bj-3bj-1})   (equation 18)





s


3
(i−3, j−4)=s*(A*i-3,{bj-4bj-2bj-1})   (equation 19)





s


4
(i−4, j−3)=s*({ai-4ai-3ai-1}, B*j-3)   (equation 20)





s


5
(i−4, j−3)=s*({ai-4ai-2ai-1}, B*j-3)   (equation 21)



[0143] In the above equations, A*i is the i-th codon (triplet of nucleotides) of the first DNA base sequence [the query DNA base sequence (hereinafter referred to A*)], B*j is the j-th codon (triplet of nucleotides) of the second DNA base sequence [the target DNA base sequence (hereinafter referred to B*)], ai is the i-th nucleotide of A*, and bj is the j-th nucleotide of B*. The right member of each of (equation 17) to (equation 21) indicates a score between codons and hence can be determined according to the score table shown in FIG. 4, by translating each codon into an amino acid.


[0144] In the manner described above, the optimum path (the optimum alignment) showing the optimum combination for similarity of each amino acid of the first translated amino acid sequence and a corresponding amino acid of the second translated amino acid sequence is determined for each of the first to fourth groups of matrices by the dynamic programming method by using these groups of matrices independently.


[0145] In the above equations, wa denotes a gap penalty due to an amino acid insertion or deletion, and wn denotes a gap penalty due to a nucleotide insertion or deletion in the DNA base sequence. In the present embodiment, it was assumed that wa=wn=−12. When successive amino acid insertions or deletions are present, wa was taken as −12 at the first insertion or deletion and as −4 at the second and subsequent insertions or deletions.


[0146] There is given below a detailed explanation of (step 312) in which accumulated scores and paths are calculated by the dynamic programming method for obtaining alignment results, and (step 313) in which tracing of a path giving the maximum accumulated score is calculated.


[0147] In (step 312), the accumulated scores are determined by the dynamic programming method by carrying out the same calculation as described in (step 307), for two amino acid sequences giving top accumulated scores which have been obtained from the query DNA base sequence and the target DNA base sequence read out of the DNA base sequence data base, respectively. In this case, for each score matrix element, information on the kind of a calculation path selected from those represented by (equation 5) to (equation 16) and a shift path giving the maximum accumulated similarity (score) are stored as the position (i, j) of the final of the score matrix elements, in addition to the accumulated similarities (scores).


[0148] In (step 313), the calculation path stored for each score matrix element is traced back from the position (i, j) of the final of the score matrix elements giving the maximum accumulated similarity (score) which has been stored in (step 312), whereby there can be known an alignment result of the translated amino acid sequences which gives the maximum accumulated similarity (score).


[0149]
FIG. 13 is a diagram showing general examples of alignment result which correspond to the shift paths, respectively, in the 9 directions in calculation by the dynamic programming method in the embodiment of the present invention.


[0150]
FIG. 14 is a diagram showing specific examples of alignment result which correspond to shift paths, respectively, in the 9 directions in calculation by the dynamic programming method in the embodiment of the present invention.


[0151] In FIG. 14, the first line in each example of alignment represents a first DNA base sequence, the second line one or two amino acids translated from this first DNA base sequence, the third line one or two amino acids translated from a second DNA base sequence, and the fourth line this second DNA base sequence. The symbol “-” represents a nucleotide of amino acid deletion, and the symbol “*” represents an amino acid which cannot be determined by translation because of a nucleotide deletion or the presence of an unknown base n which has not been determined to be any of a, c, g and t.


[0152] Next, examples of practical application of the present embodiment is explained below. There was chosen a query sequence concerning Arabidopsis thaliana registered in the EST data base of Gen Bank, a public data base of DNA base sequences, and all sequences derived from rice (Oriza sativa) which had been registered in the EST data base were used as target sequences for similarity search. Each DNA base sequence registered in the EST data base involve a definite amount of sequence errors because the output result from a DNA sequencer has been registered as such. Therefore, such DNA base sequences are suitable for confirming the effectiveness of the present invention in which two DNA base sequences are compared through amino acid sequences in view of nucleotide insertions or deletions present in the DNA base sequences.


[0153] Each of FIG. 15, FIG. 16 and FIG. 17 is a diagram showing an example of alignment result obtained by similarity search in the embodiment of the present invention. In each of FIG. 15, FIG. 16 and FIG. 17, the “Query sequence” section shows a name given to a query DNA base sequence and a brief explanation of this sequence, and the “Target sequence” section shows a name given to a target DNA base sequence read out of the EST data base and selected by similarity search, and a brief explanation of this sequence. The “Score” section shows the accumulated similarity (score), the lengths of the query sequence and the target sequence, and the alignment regions of the query sequence and the target sequence.


[0154] In each “Query” section showing the alignment result, the query sequence and an amino acid sequence translated from the query sequence are shown in the upper row and the lower row, respectively. In each “Taget” section showing the alignment result, the target DNA base sequence selected by similarity search and an amino acid sequence translated from this DNA base sequence are shown in the lower row and the upper row, respectively.


[0155] The DNA base sequence and the translated amino acid sequence in the “Query” section showing the alignment result in FIG. 15 are represented by sequence numbers 1 and 2, respectively, and the translated amino acid sequence and the DNA base sequence in the “Taget” section showing the alignment result are represented by sequence numbers 3 and 4, respectively. The DNA base sequence and the translated amino acid sequence in the “Query” section showing the alignment result in FIG. 16 are represented by sequence numbers 5 and 6, respectively, and the translated amino acid sequence and DNA base sequence in the “Taget” section showing the alignment result are represented by sequence numbers 7 and 8, respectively. The DNA base sequence and the translated amino acid sequence in the “Query” section showing the alignment result in FIG. 17 are represented by sequence numbers 9 and 10, respectively, and the translated amino acid sequence and DNA base sequence in the “Taget” section showing the alignment result are represented by sequence numbers 11 and 12, respectively.


[0156] In FIG. 15, FIG. 16 and FIG. 17, the symbol : between the upper and lower translated amino acid sequences indicates that the amino acids corresponding to each other are the same. The symbol . between the sequences indicates that the value of a score matrix corresponding to the amino acids is positive. The absence of any symbol between the sequences indicates that the value of a score matrix corresponding to the amino acids is zero or negative. The symbol - represents a nucleotide or amino acid deletion. The symbol n denotes an unknown base n which has not been determined to be any of a, c, g and t. The symbol * denotes an amino acid which cannot be determined by translation because of the a nucleotide deletion or the presence of the unknown base.


[0157] The regions b, b′, c, d and e shown by the quadrangles in FIG. 15 are explained below. Each of the regions b and b′ indicate that the optimum path involves one or two amino acid insertions or deletions, i.e., a result corresponding to (equation 6) or (equation 7). The region c indicates that the optimum path involves a nucleotide deletion, i.e., a result corresponding to (equation 8) or (equation 9). Each of the regions d and e corresponds to a nucleotide insertion: the region d indicates that the optimum path involves a result corresponding to (equation 10) or (equation 11), and the region e indicates that the optimum path involves a result corresponding to any of (equation 12) to (equation 15).


[0158] Only the regions enclosed with the quadrangles in FIG. 16 and FIG. 17 are regions obtained by applying TBLASTX of prior art. In the method of the present invention, information on similarities between two base sequences can be obtained through translated amino acid sequences, in regions unobtainable by application of TBLASTX of prior art. Particularly when the result shown in FIG. 16 is compared with that obtained by the use of TBLASTX of prior art, it can be seen that the result obtained according to the present invention is information on the similarities in a continuous wider region. Particularly in the case of the example shown in FIG. 17, the method of the present invention gives information on the similarities in a region three times as wide as that obtained by TBLASTX of prior art.


[0159] In the present invention, since all of amino acid insertions or deletions and nucleotide insertions or deletions in each DNA base sequence are taken into consideration, similarity search can be carried out in a wide region of the base sequence to attain higher similarities (higher accumulated scores), so that an alignment result in the wide region of the base sequence can be obtained. Consequently, it becomes possible to obtain a more complete sequence as an amino acid sequence coded for by the DNA base sequence. Knowing the amino acid sequence of a protein coded for by a DNA base sequence is the first step in the analysis of the biological functions of genes. At present, the number of data in an available amino acid sequence data base is much smaller than that in an available DNA base sequence data base. Obtaining information on the amino acid sequence by the method of the present invention from the DNA base sequence obtained by measurement gives information useful for analyzing the function of the protein.


[0160]
FIG. 18 is a diagram showing the structure of an apparatus for practicing the method for comparison of DNA base sequences of the present invention. The apparatus for practicing the method for comparison of DNA base sequences of the present invention comprises a device 401 for input of the above-mentioned first and second DNA base sequences; a calculation processing device 402 having the following programs within: a translation program for translating each DNA base sequence into an amino acid sequence, a sequence comparison program for comparing the above-mentioned first and second translated amino acid sequences, and a program for aligning the first and second translated amino acid sequences and aligning the DNA base sequences corresponding to the first and second, respectively, translated amino acid sequences; an output device for output of the maximum accumulated similarity, the alignment result of the first and second translated amino acid sequences, and the alignment result of the DNA base sequences corresponding to the first and second, respectively, translated amino acid sequences; and an outer memory which stores various DNA base sequence data bases, various amino acid sequence data bases, a score table, the codon table, etc.


[0161] A summary of the present invention is given below. The present invention is characterized by (A) a method for comparing DNA base sequences by comparing similaritie between a first DNA base sequence and a second DNA base sequence, which comprises (1) a step of dividing each of the first DNA base sequence and the second DNA base sequence into groups of successive three nucleotides each, translating each of these groups into an amino acid, and thereby obtaining a first amino acid sequence and a second amino acid sequence, respectively, (2) a step of determining similarities between each amino acid of the first translated amino acid sequence and each amino acid of the second translated amino acid sequence in view of nucleotide insertions or deletions in the first and second DNA base sequences and amino acid insertions or deletions in the first and second translated amino acid sequences, accumulating the thus determined similarities, and thereby determining a combination of each amino acid of the first translated amino acid sequence and a corresponding amino acid of the second translated amino acid sequence which gives the maximum accumulated similarity, (3) a step of outputting the maximum accumulated similarity, the alignment of the first and second translated amino acid sequences, the alignment of the first translated amino acid sequence and the first DNA base sequence, and the alignment of the second translated amino acid sequence and the second DNA base sequence, wherein the step (1) comprises translating each of the first and second DNA base sequences by each of the following methods: (i) a method of translating each DNA base sequence into an amino acid sequence while shifting a reading frame for the base sequence at every triplet of successive nucleotides base by base from the 5′-terminal of the base sequence, (ii) a method of shifting a reading frame for each DNA base sequence at every quartet of successive nucleotides base by base from the 5′-terminal of the base sequence, translating the three nucleotides other than the second nucleotide of each quartet into an amino acid, and thus translating the base sequence into an amino acid sequence, and (iii) a method of shifting a reading frame for each DNA base sequence at every quartet of successive nucleotides base by base from the 5′-terminal of the base sequence, translating the three nucleotides other than the third nucleotide of each quartet into an amino acid, and thus translating the base sequence into an amino acid sequence.


[0162] In the method (A), the present invention is characterized in that in the step (2), when a matrix is formed by aligning the amino acids of the first translated amino acid sequence in regular order in the direction of a first axis and the amino acids of the second translated amino acid sequence in regular order in the direction of a second axis, and an accumulated similarity at a matrix element (i, j) indicating the position of combination of the i-th amino acid of the first translated amino acid sequence and the j-th amino acid of the second translated amino acid sequence is determined, any path is selected from seven paths to the matrix element (i, j) from matrix elements (i−3, j−3), (i, j−3k), (i−3k, j), (i−3n+ 1, j−3n), (i−3n, j−3n+1), (i−3m, j−3m−1) and (i−3m−1, j−3m) [wherein k is an integer in a range of k≧1, m is an integer in a range of m≧1, n is an integer in a range of n≧2, i is an integer in a range of i≦M (M is the number of amino acids in the first translated amino acid sequence), and j is an integer in a range of j≦N (N is the number of amino acids in the second translated amino acid sequence)] so that the accumulated similarity may be maximum.


[0163] In addition, the present invention is characterized by (B) a method for comparing DNA base sequences by comparing similarities between a first DNA base sequence and a second DNA base sequence, which comprises (1) a step of dividing each of the first and second DNA base sequences into groups of successive three nucleotides each, translating each of these groups into an amino acid, and thereby obtaining a first amino acid sequence and a second amino acid sequence, respectively, (2) a step of determining similarities between each amino acid of the first translated amino acid sequence and each amino acid of the second translated amino acid sequence in view of nucleotide insertions or deletions in the first and second DNA base sequences and amino acid insertions or deletions in the first and second translated amino acid sequences, accumulating the thus determined similarities, and thereby determining a combination of each amino acid of the first translated amino acid sequence and a corresponding amino acid of the second translated amino acid sequence which gives the maximum accumulated similarity, (3) a step of outputting the maximum accumulated similarity, the alignment of the first and second translated amino acid sequences, the alignment of the first translated amino acid sequence and the first DNA base sequence, and the alignment of the second translated amino acid sequence and the second DNA base sequence.


[0164] Furthermore, the present invention is characterized in that each of the methods (A) and (B) comprises the same steps (1), (2) and (3) as above except for using a base sequence complementary to the first DNA base sequence in place of the first DNA base sequence and a base sequence complementary to the second DNA base sequence in place of the second DNA base sequence.


Claims
  • 1. A method for comparing DNA base sequences by comparing similarities between a first DNA base sequence and a second DNA base sequence, which comprises (1) a step of dividing each of said first DNA base sequence and said second DNA base sequence into groups of successive three nucleotides each, translating each of these groups of nucleotides into an amino acid, and thereby obtaining a first amino acid sequence and a second amino acid sequence, respectively, (2) a step of determing similarities between each amino acid of said first translated amino acid sequence and each amino acid of said second translated amino acid sequence in view of nucleotide insertions or deletions in said first and second DNA base sequences and amino acid insertions or deletions in said first and second translated amino acid sequences, accumulating the thus determined similarities, and thereby determining a combination of each amino acid of said first translated amino acid sequence and a corresponding amino acid of said second translated amino acid sequence which makes the accumulated similarity maximum, (3) a step of outputting said maximum accumulated similarity, the alignment of said first and second translated amino acid sequences, the alignment of said first translated amino acid sequence and said first DNA base sequence, and the alignment of said second translated amino acid sequence and said second DNA base sequence, wherein said step (1) comprises translating each of said first and second DNA base sequences by each of the following methods: (i) a method of translating each DNA base sequence into an amino acid sequence while shifting a reading frame for the base sequence at every triplet of successive nucleotides base by base from the 5′-terminal of the base sequence, (ii) a method of shifting a reading frame for each DNA base sequence at every quartet of successive nucleotides base by base from the 5′-terminal of the base sequence, translating the three nucleotides other than the second nucleotide of each quartet into an amino acid, and thus translating the base sequence into an amino acid sequence, and (iii) a method of shifting a reading frame for each DNA base sequence at every quartet of successive nucleotides base by base from the 5′-terminal of the base sequence, translating the three nucleotides other than the third nucleotide of each quartet into an amino acid, and thus translating the base sequence into an amino acid sequence.
  • 2. A method for comparing DNA base sequences according to claim 1, wherein in said step (2), when a matrix is formed by aligning the amino acids of said first translated amino acid sequence in regular order in the direction of a first axis and the amino acids of said second translated amino acid sequence in regular order in the direction of a second axis, and said accumulated similarity at an element (i, j) of said matrix which indicates the position of combination of the i-th amino acid of said first translated amino acid sequence and the j-th amino acid of said second translated amino acid sequence is determined, any path is selected from paths to the element (i, j) of said matrix from elements of said matrix (i−3, j−3), (i, j−3k), (i−3k, j), (i−3n+1, j−3n), (i−3n, j−3n+1), (i−3m, j−3m−1) and (i−3m−1, j−3m) so that said accumulated similarity may be maximum.
  • 3. A method for comparing DNA base sequences according to claim 1, wherein in said step (2), when a matrix is formed by aligning the amino acids of said first translated amino acid sequence in regular order in the direction of a first axis and the amino acids of said second translated amino acid sequence in regular order in the direction of a second axis, and said accumulated similarity at an element (i, j) of said matrix which indicates the position of combination of the i-th amino acid of said first translated amino acid sequence and the j-th amino acid of said second translated amino acid sequence is determined, any path is selected from paths to the element (i, j) of said matrix from elements of said matrix (i−3, j−3), (i, j−3), (i−3, j), (i−5, j− 6), (i−6, j−5), (i−3, j−4) and (i−4, j−3) so that said accumulated similarity may be maximum.
  • 4. A method for comparing DNA base sequences according to claim 1, which comprises the same steps (1), (2) and (3) as above except for using a base sequence complementary to said first DNA base sequence in place of said first DNA base sequence and a base sequence complementary to said second DNA base sequence in place of said second DNA base sequence.
  • 5. A method for comparing DNA base sequences by comparing similarities between a first DNA base sequence and a second DNA base sequence, which comprises (1) a step of dividing each of said first DNA base sequence and said second DNA base sequence into groups of successive three nucleotides each, translating each of these groups of nucleotides into an amino acid, and thereby obtaining a first amino acid sequence and a second amino acid sequence, respectively, (2) a step of determing similarities between each amino acid of said first translated amino acid sequence and each amino acid of said second translated amino acid sequence in view of nucleotide insertions or deletions in said first and second DNA base sequences and amino acid insertions or deletions in said first and second translated amino acid sequences, accumulating the thus determined similarities, and thereby determining a combination of each amino acid of said first translated amino acid sequence and a corresponding amino acid of said second translated amino acid sequence which gives the maximum accumulated similarity, (3) a step of outputting said maximum accumulated similarity, the alignment of said first and second translated amino acid sequences, the alignment of said first translated amino acid sequence and said first DNA base sequence, and the alignment of said second translated amino acid sequence and said second DNA base sequence.
  • 6. A method for comparing DNA base sequences according to claim 5, which comprises the same steps (1), (2) and (3) as above except for using a base sequence complementary to said first DNA base sequence in place of said first DNA base sequence and a base sequence complementary to said second DNA base sequence in place of said second DNA base sequence.
  • 7. A method for comparing DNA base sequences which comprises (1) a step of picking out triplets of successive nucleotides from each of a first DNA base sequence and a second DNA base sequence while shifting the starting position of the triplet base by base from the 5′-terminal, and translating the triplets into amino acids in regular order to obtain a translated amino acid sequence A1 or B1, respectively, (2) a step of picking out triplets of successive nucleotides from a sequence complementary to each of said first and second DNA base sequences while shifting the starting position of the triplet base by base from the 5′-terminal, and translating the triplets into amino acids in regular order to obtain a translated amino acid sequence A2 or B2, respectively, (3) a step of picking out quartets of successive nucleotides from each of said first and second DNA base sequences while shifting the starting position of the quartet base by base from the 5′-terminal, translating the three nucleotides other than the second nucleotide of each quartet into an amino acid, and thus translating the quartets into amino acids in regular order to obtain a translated amino acid sequence A3 or B3, respectively, (4) a step of picking out quartets of successive nucleotides from each of said first and second DNA base sequences while shifting the starting position of the quartet base by base from the 5′-terminal, translating the three nucleotides other than the third nucleotide of each quartet into an amino acid, and thus translating the quartets into amino acids in regular order to obtain a translated amino acid sequence A4 or B4, respectively, (5) a step of picking out quartets of successive nucleotides from a sequence complementary to each of said first and second DNA base sequences while shifting the starting position of the quartet base by base from the 5′-terminal, translating the three nucleotides other than the second nucleotide of each quartet into an amino acid, and thus translating the quartets into amino acids in regular order to obtain a translated amino acid sequence A5 or B5, respectively, (6) a step of picking out quartets of successive nucleotides from a sequence complementary to each of said first and second DNA base sequences while shifting the starting position of the quartet base by base from the 5′-terminal, translating the three nucleotides other than the third nucleotide of each quartet into an amino acid, and thus translating the quartets into amino acids in regular order to obtain a translated amino acid sequence A6 or B6, respectively, (7) a step in which the amino acids of said translated amino acid sequence A1 or A2 as a first translated amino acid sequence are aligned in regular order along a first axis from the 5′-terminal of the base sequence corresponding to said first translated amino acid sequence and the amino acids of said translated amino acid sequence B1 or B2 as a second translated amino acid sequence are aligned in regular order along a second axis from the 5′-terminal of the base sequence corresponding to said second translated amino acid sequence, whereby there is formed a score matrix H wherein the value H(i, j) of a matrix element (i, j) indicates an accumulated similarity between an amino acid sequence from the first amino acid to the i-th amino acid in said first translated amino acid sequence and an amino acid sequence from the first amino acid to the j-th amino acid in said second translated amino acid sequence; the amino acids of said translated amino acid sequence selected from said translated amino acid sequences A1, A2, A3, A4, A5 and A6 are aligned in regular order along a first axis as a 1st, 3rd, 5th, 7th or 9th translated amino acid sequence from the 5′-terminal of the base sequence corresponding to said selected translated amino acid sequence and the amino acids of said translated amino acid sequence selected from said translated amino acid sequences B1, B2, B3, B4, B5 and B6 are aligned in regular order along a second axis a 2nd, 4th, 6th, 8th or 10th translated amino acid sequence from the 5′-terminal of the base sequence corresponding to the latter selected translated amino acid sequence, whereby there are formed five matrices 1st, 2nd, 3rd, 4th and 5th matrices s1(i, j), s2(i, j), s3(i, j), s4(i, j) and s5(i, j) which indicate similarities between the i-th amino acid of the former selected translated amino acid sequence and the j-th amino acid of the latter selected translated amino acid sequence; and the following 1st, 2nd, 3rd and 4th groups of matrices are formed: said 1st group of matrices being composed of a score matrix H formed from said translated amino acid sequence A1 and said translated amino acid sequence B1, a 1st matrix s1 formed from said translated amino acid sequence A1 and said translated amino acid sequence B1, a 2nd matrix s2 formed from said translated amino acid sequence A1 and said translated amino acid sequence B3, a 3rd matrix s3 formed from said translated amino acid sequence A1 and said translated amino acid sequence B4, a 4th matrix s4 formed from said translated amino acid sequence A3 and said translated amino acid sequence B1, and a 5th matrix s5 formed from said translated amino acid sequence A4 and said translated amino acid sequence B1, said 2nd group of matrices being composed of a score matrix H formed from said translated amino acid sequence A1 and said translated amino acid sequence B2, a 1st matrix s1 formed from said translated amino acid sequence A1 and said translated amino acid sequence B2, a 2nd matrix s2 formed from said translated amino acid sequence A1 and said translated amino acid sequence B5, a 3rd matrix s3 formed from said translated amino acid sequence A1 and said translated amino acid sequence B6, a 4th matrix s4 formed from said translated amino acid sequence A3 and said translated amino acid sequence B2, and a 5th matrix s5 formed from said translated amino acid sequence A4 and said translated amino acid sequence B2, said 3rd group of matrices being composed of a score matrix H formed from said translated amino acid sequence A2 and said translated amino acid sequence B1, a 1st matrix s1 formed from said translated amino acid sequence A2 and said translated amino acid sequence B1, a 2nd matrix s2 formed from said translated amino acid sequence A2 and said translated amino acid sequence B3, a 3rd matrix s3 formed from said translated amino acid sequence A2 and said translated amino acid sequence B4, a 4th matrix s4 formed from said translated amino acid sequence A5 and said translated amino acid sequence B1, and a 5th matrix s5 formed from said translated amino acid sequence A6 and said translated amino acid sequence B1, and said 4th group of matrices being composed of a score matrix H formed from said translated amino acid sequence A2 and said translated amino acid sequence B2, a 1st matrix s1 formed from said translated amino acid sequence A2 and said translated amino acid sequence B2, a 2nd matrix s2 formed from said translated amino acid sequence A2 and said translated amino acid sequence B5, a 3rd matrix s3 formed from said translated amino acid sequence A2 and said translated amino acid sequence B6, a 4th matrix s4 formed from said translated amino acid sequence A5 and said translated amino acid sequence B2, and a 5th matrix s5 formed from said translated amino acid sequence A6 and said translated amino acid sequence B2, (8) a step in which for each of said 1st to 4th groups of matrices, the value H(i, j) of an element (i, j) of said score matrix which indicates an accumulated similarity between an amino acid sequence from the first amino acid to the i-th amino acid in said first translated amino acid sequence and an amino acid sequence from the first amino acid to the j-th amino acid in said second translated amino acid sequence, is determined as follows: H(i, j)=max{H1(i, j), H2(i, j), H3(i, j), H4(i, j), H5(i, j), H6(i, j), H7(i, j), H8(i, j), H9(i, j), H10(i, j), H11(i, j)} by selecting any path from paths to an element (i, j) of said score matrix from elements of said score matrix (i− 3, j−3), (i−3, j), (i, j−3), (i−5, j−6), (i−6, j−5), (i−3, j−4), (i−4, j−3), (i−6, j−7) and (i−7, j−6) so that said accumulated similarity may be maximum, namely, the maximum among the following H1(i, j) to H11(i, j) is selected: an accumulated similarity H1(i, j) corresponding to a path from the element (i−3, j−3) of said matrix to the element (i, j) of said matrix: H1(i, j)=H(i−3, j−3)+s1(i, j) accumulated similarity H2(i, j) corresponding to a path from the element (i, j−3) of said matrix to the element (i, j) of said matrix: H2(i, j)=H(i, j−3)+wa accumulated similarity H3(i, j) corresponding to a path from the element (i−3, j) of said matrix to the element of (i, j) said matrix: H3(i, j)=H(i−3, j)+wa accumulated similarity H4(i, j) corresponding to a path from the element (i−5, j−6) of said matrix to the element (i, j) of said matrix: H4(i, j)=H(i−5, j−6)+wn+s1(i, j) accumulated similarity H5(i, j) corresponding to a path from the element (i−6, j−5) of said matrix to the element (i, j) of said matrix: H5(i, j)=H(i−6, j−5)+wn+s1(i, j) accumulated similarity H6(i, j) corresponding to a path from the element (i−3, j−4) of said matrix to the element (i, j) of said matrix: H6(i, j)=H(i−3, j−4)+wn+s1(i, j) accumulated similarity H7(i, j) corresponding to a path from the element (i−4, j−3) of said matrix to the element (i, j) of said matrix: H7(i, j)=H(i−4, j−3)+wn+s1(i, j) accumulated similarity H8(i, j) and H9(i, j) which correspond to a path from the element (i−6, j−7) of said matrix to the element (i, j) of said matrix: H8(i, j)=H(i−6, j−7)+wn+s2(i−3, j−4)+ s1(i, j) H9(i, j)=H(i−6, j−7)+wn+s3(i−3, j−4)+ s1(i, j) accumulated similarity H10(i, j) and H11(i, j) which correspond to a path from the element (i−7, j−6) of said matrix to the element (i, j) of said matrix: H10(i, j)=H(i−7, j−6)+wn+s4(i−4, j−3)+ s1(i, j) H11(i, j)=H(i−7, j−6)+wn+s5(i−4, j−3)+ s1(i, j) wherein wa is a numerical value indicating an amino acid insertion or deletion in each amino acid sequence and wn is a numerical value indicating a nucleotide insertion or deletion in each DNA base sequence, (9) a step in which the optimum alignment showing the optimum combination for similarity of each amino acid of said first translated amino acid sequence and a corresponding amino acid of said second translated amino acid sequence is determined from a plurality of said score matrices H in said 1st to 4th groups of matrices.
Priority Claims (1)
Number Date Country Kind
09-079586 Mar 1997 JP