The present application claims priority from Japanese application JP 2004-177319 filed on Jun. 15, 2004, the content of which is hereby incorporated by reference into this application.
1. Field of the Invention
The present invention relates to a method for preparing a correlation diagram or a multiple alignment among nucleic acid sequences by conducting a correlation analysis among a plurality of nucleic acid sequences.
2. Background Art
In general, nucleic acid has two polynucleotide strands arranged in parallel via hydrogen bonding between bases and the polynucleotide strands twist with respect to each other to form a double helix structure. The bonding between the bases is based on hydrogen bonding between adenine (A) and thymine (T), and guanine (G) and cytosine (C) in a complementary manner, so that no other combination takes place. A polynucleotide strand bonded to a certain polynucleotide strand in a complementary manner is referred to as a complementary strand of the polynucleotide strand.
Conventionally, ClustalW (1994-), a program made by J. Thompson and T. Gibson, has been used as a method for conducting correlation analysis among biopolymers including nucleic acid. A calculation method used in the program is described in ClustalW Thompson JD, Higgins DG, Gibson TJ (Nucleic Acid Res. 1994 Nov: 4673-80). ClustalW analyzes genealogical relationships in evolution among different biopolymers and prepares a multiple alignment thereof.
Non-patent Document: Nucleic Acid Res. 1994 Nov: 4673-80
The conventional correlation analysis, however, has the following problems.
1. In a case where the direction of a nucleic acid sequence (5′→3′ (+direction) or 3′→5′ (− direction)), which is a calculation object, is uncertain, significant results cannot be obtained from an analysis in many cases (the problem of the accuracy of analysis results).
As shown in
2. One of the methods to resolve the aforementioned problem 1 includes a method where the sequences of complementary strands of all nucleic acid sequences, which are objects of calculation, are prepared and these sequences are added to calculation objects. However, in this case, the number of nucleic acid sequences as the calculation objects is doubled and the amount of calculation time is approximately quadrupled (the problem of calculation time).
3. Further, in method 2, a half of sequences in analysis results are not significant relative to the results, so that result display becomes confusing (the problem of result display).
It is an object of the present invention to provide a method for conducting correlation analysis among a plurality of nucleic acid sequences in a high-speed manner on the basis of the considerations of a complementary strand of an analysis object sequence, and for deriving results of high accuracy.
In order to achieve the aforementioned object, in the present invention, upon conducting correlation analysis among a plurality of nucleic acid sequences, either an original sequence or a complementary strand sequence thereof is selected as an input so as to have more significant results, and a correlation diagram or a multiple alignment among nucleic acid sequences is prepared. In other words, a homology search is conducted among one particular sequence (hereafter referred to as a query) selected arbitrarily from nucleic acid sequences that are analysis objects and all the rest sequences of the analysis objects. On the basis of results thereof, which of an original sequence and a complementary strand sequence will make more significant analysis results is determined in each sequence, and the sequence thereof is selected as the analysis object. Then, correlation analysis is conducted among the sequences selected as the analysis objects. The method of the present invention can be performed by loading a program into a computer.
By selecting the direction of an analysis object sequence, the accuracy of analysis results can be improved, and the problem of calculation time can also be resolved, since the number of object sequences is not increased. Further, all the sequences displayed in analysis results include only those sequences that are significant for the results.
According to the present invention, by determining the directions of input sequences, correlation analysis among nucleic acid sequences, which has required huge amount of time and resulted in low accuracy, can be conducted in a high-speed manner and in high accuracy.
In the following, embodiments of the present invention are described concretely with reference to the drawings.
A user inputs an arbitrary nucleic acid sequence into the central processing unit 101 using the keyboard 104 or the mouse 105. The central processing unit 101 selects the directions of input sequences that make analysis results more significant, using the inputted nucleic acid sequence. Then, the central processing unit 101 conducts correlation analysis among these nucleic acid sequences and draws a correlation diagram or a multiple alignment among the nucleic acid sequences on the display device 103 on the basis of results thereof.
A user inputs an arbitrary nucleic acid sequence into the data input and output processing device 204 using the keyboard 207 or the mouse 208. The data input and output processing device 204 transmits the inputted sequence to the device 201 for preparing a correlation diagram or a multiple alignment among nucleic acid sequences through the communication channel 203. The device 201 for preparing a correlation diagram or a multiple alignment among nucleic acid sequences conducts correlation analysis among nucleic acid sequences using the transmitted nucleic acid sequence, and transmits results thereof to the data input and output processing device 204 through the communication channel 203. The data input and output processing device 204 draws a correlation diagram or a multiple alignment among nucleic acid sequences on the display device 206 on the basis of the transmitted analysis results.
When the process is initiated (501), inputted sequences are read (502). Among the input sequences, one arbitrary sequence is handled as a query sequence 505, and the other sequences are handled as target sequences 504 (503). The target sequences 504 are stored in a database 506 for homology search.
Next, a homology search is conducted (507) among the query sequence 505 and the sequences in the database 506 for homology search. Search results 508 are sorted (509) in descending order of search score value in each target sequence. A direction of a nucleic acid sequence that indicates the highest score value in each target sequence of the results is handled as the direction of the sequence (510).
After the directions of the target sequences are determined, the number of sequences having “+” directions is counted (511). In a case where the sequences of “+” directions reach a majority, the query sequence is handled without change as an input sequence (513) for correlation analysis among sequences, the target sequences of “+” directions are handled without change as input sequences for correlation analysis among sequences, and complementary strands of the target sequences of “−” directions are prepared and handled as input sequences (515) for correlation analysis among sequences. In a case where the sequences of “+” directions do not reach a majority, a complementary strand of the query sequence is prepared and handled as an input sequence (514) for correlation analysis among sequences, the target sequences of “−” directions are handled without change as input sequences for correlation analysis among sequences, and complementary strands of the target sequences of “+” directions are prepared and handled as input sequences (516) for correlation analysis among sequences.
After the input sequences for correlation analysis among sequences are decided in this manner, the correlation analysis among sequences is conducted (517) and analysis results 518 are outputted. When the analysis results are outputted, information for drawing a correlation diagram or a multiple alignment among sequences is prepared (519), and the correlation diagram or the multiple alignment among sequences is drawn on a display device (520).
When the process is initiated (801), sequence file input through drag and drop from a user is received (802). After the file input is completed, when the “display of a multiple alignment” button or the “display of a correlation diagram among sequences” button is pressed (803), correlation analysis among sequences is conducted (804). When the analysis is completed, the types of the buttons pressed by the user are determined (805). If the “display of a multiple alignment” button has been pressed, a multiple alignment is displayed (807), and if the “display of a correlation diagram among sequences” button has been pressed, a genealogical tree is displayed (806).
Number | Date | Country | Kind |
---|---|---|---|
177319/2004 | Jun 2004 | JP | national |