The field of the present invention is in the area of high-speed and high-throughput computing in the area of biotechnology. Specifically the invention is related to computing assemblies for sets of DNA or RNA sequence reads.
1. DNA/RNA Sequence Reads
Reads are the scientific results of DNA or RNA materials that are run on gels or some other means to determine the genetic material's nucleotide sequence. Each read possesses at least two different types of data. The first part is the base call of the read. The base call may be a best guess of the base (adenine, guanine, cytosine, or tyrosine) in a particular position in the genetic material being sequenced. The second part may be a generated probability that the particular base call is correct (sometimes referred to as a Phred scored).
2. Assemblies
Assemblies are reads that have been put together in a manner similar to a jigsaw puzzle. Each read may be a relatively small part of a larger portion of genetic material. The reads may be overlapping and these overlapping portions of the reads may be used to splice the reads together. Further, mismatches are allowed when overlapping these reads due to the probabilities assigned to each base in each read. If a base call contains a low probability read it may be likely that it will be removed or changed when spliced together with other reads. In this way assemblies are created by putting reads together. Phrap is the current industry standard method of creating assemblies from reads. Other assembly programs include SEQMAN II (from DNA Star) and GCG (from Accelrys) as well as assemblers created by TIGR and STADEN.
3. Phrap
Phrap is a computer program developed at the University of Washington to create assemblies from sets of DNA or RNA sequence reads. Any other program designed to create assemblies from reads may also be incorporated with the present invention. Phrap was originally created as a tool for the scientists of the Human Genome Project who needed to join reads together into assemblies. The algorithm as taught be the creators of Phrap used in the current method of Phrap is as follows (http://www.phrap.org/phrap.docs/phrap.html):
Phrap and other assembly software are currently employable on Unix and PC based computers. Currently Phrap's limitations are based on the computer's ability to perform the algorithm. Specifically, as the number of reads goes up by an order of N, the numbers of computations needed by Phrap goes up by an order of N2. Currently to run Phrap with high numbers of reads companies and scientists have tried to use bigger computers to solve the problem. However, there has been little work done to alter the Phrap algorithm itself.
5. Divide and Conquer Computer Methods—as Described by Cormen, Leiserson, and Rivest in “Introduction to Algorithms”.
Many useful algorithms are recursive in structure: to solve a given problem, they call themselves recursively one or more times to deal with closely related subproblems. These algorithms typically follow a divide-and-conquer approach: they break the problem into several subproblems that are similar to the original problem but smaller in size, solve the subproblems recursively, and then combine these solutions to create a solution to the original problem.
The divide and conquer paradigm involves three steps at each level of the recursion:
Myers and Miller is one of many sequence alignment algorithms. The algorithm contains many attractive computational characteristics. In particular, the Myers and Miller algorithm is dramatically more memory efficient than traditional methods such as the Needleman-Wunsch and Smith-Waterman.
To align two sequences of length N and M, both the Needleman and Wunsch and Smith Waterman require a memory resource proportional to N×M while Myers and Miller requires a memory section proportional to N. This becomes very important when N and M grow large.
Using graph theory, Myers and Miller were also able to improve the time scaling characteristics, a full exposition of which is beyond the scope of this text. See Myers and Miller CABIOS (1989) and “Bioinformatics and Genome Analysis” by David. W. Mount (in preparation). The Myers and Miller algorithms is well know and popular and is found in numerous applications such as: EMBOSS Stretcher, Fasta Align, ClustalW and GNU diff.
7. Compression Ratio
For a sequence A, the compression ratio is defined as the length of the compressed sequence of A divided by the length of the original sequence A. The compression ratio may be computed after using any string/sequence compression method such as the Lempel-Ziv algorithm or the Burros-Wheeler algorithm to compress the sequence A. String compression is well known in the art of computer science.
It may be an object of the present invention to create and implement a method for creating an assembly. The method may possess several steps. These steps may include obtaining a set of DNA or RNA sequence reads, grouping the DNA or RNA sequence reads into categories, or running an assembly program on each of the separate categories of DNA or RNA sequence reads. The grouping and running steps may also be executed recursively as necessary.
It may also be an object of the present invention to construct an apparatus for creating assemblies from DNA or RNA sequence reads. The apparatus may have several functions performed by various elements. For instance, there may be means for obtaining a set of DNA or RNA sequence reads, means for grouping DNA or RNA sequence reads into categories, and means for creating assemblies from the DNA or RNA sequence reads.
A further object of the present invention may be a computer system employing a graphical user interface or command line interface including a display device and a selection device, a method of displaying information on the display device in a menu form, accepting menu selection input from a user, and a method of processing data. The method of processing data may include retrieving a set of menu entries for the menu, each of the menu entries representing a method to perform upon DNA or RNA sequence reads, displaying the set of menu entries on the display device or displaying a set of parameters on the display device. The method may also include providing the user an opportunity to modify the set of parameters, receiving an indication of a menu entry selection from the user via the selection device, or in response to the indication of a menu entry selection, performing a method on the DNA or RNA sequence reads to create assemblies based on the set of parameters and said set of menu entries.
Another object of the present invention may be a set of application program interfaces embodied on a computer-readable medium for execution on a computer in conjunction with an application program that determines assemblies of DNA or RNA sequence reads. The application program interface may include a first interface that receives function selections for a method for assembling DNA or RNA sequence reads, a second interface that may receive parameters for the functions, a third interface that may receives DNA or RNA sequence reads, and an interface that may return an assembly of the DNA or RNA sequence reads.
It may also be an object of the present invention to create a divide and conquer algorithm to improve Phrap and other assembly methods. This method may divide DNA or RNA sequence reads into various groups by a variety of functions. These groups may then have assemblies created by Phrap or another assembler. These assemblies may then be used to create assemblies for all of the reads. The method may perform more than one level of recursion.
The present invention can be embodied as a software application resident with, in, or on any of the following: a database, a Web-server, a separate programmable device that communicates with a Web-sever through a communication means, a software device, a tangible computer-usable medium, or otherwise. Embodiments comprising software applications resident on a programmable device are preferred. Alternatively, the present invention can be embodied as hardware with specific circuits, although these circuits are not now preferred because of their cost, lack of flexibility, and expense of modification.
The present invention may be a computer program used in conjunction with Phrap or any other sequence assembly method. The computer program may be written in Perl, C, or any other language. A computer program may be joined with Phrap or any other sequence assembly program or run on top of it. A computer program may also be written to replace Phrap and determine sequence assemblies on its own. The present invention may be used to separate the reads into categories. These groups may then individually be used as input to Phrap, to another program, or to a separate module of the present invention. The resulting assemblies of Phrap, the other program, or separate module of the present invention are then combined into new sets of “reads” to be possibly used again as input to a recursion of Phrap, the other program, or separate module of the present invention.
The present invention may provide several ways to break the original sets of reads into separate categories. These sets may be created heuristically or by specific functions. The following are several functions that may be uses to create category groupings among the reads.
1. By Size
The reads may be broken up by size in a variety of ways. First, the size groupings may be on relative size. For instance the reads can be broken up into equal grouping where bins all contain the same number of reads (
2. By Entropy
Entropy is defined as −(Σp_k*log(p_k)). p_k, the probability of a given base occurring is defined as: (number of occurrences of base k)/(sequence length). Other methods of creating bins based on the entropy of the read may include other concepts such as deviation from mean or any other measurement based upon the entropy or relative entropy of the read.
3. By GC Percentage
GC percentage is defined as (Σguanine+Σcytosine)/(Σtotal bases). The algorithm may cluster the reads by GC percentage. The size of the bins may be based on the number of reads in each bin or the GC percentage. Other methods of creating bins based on the GC percentage of the read may include other concepts such as deviation from mean or any other measurement based upon the GC percentage or relative GC percentage of the read.
4. By Nature of Longest Repeat
In each read there may be some longest repeat. The repeat may be of a single base or a repetition of multiple bases. The algorithm may cluster each of the reads based on the longest repeat of a single base or the longest repeat of multiple bases. A repeating region is well know in the art. As an example, the sequence “TAGAGAGAGAGAGATCATCGAT” contains a GA repeat from bases three through sixteen, which is this sequence's longest repeat. The categories of the present invention may be based on the GC percentage or any other percentage of the repeat as well as similar methods.
5. By Nature of Longest High or Low Entropy Area
Each read may have a long high or low entropy area within the read's sequence. This area will have some GC content. An example of a sequence with high entropy and high GC content is “GAGTGTATCTGCCCGCCGGCGTGCCCGGCTAC”. The entropy is high because there is not an even distribution of A, G, C, and T. The GC percentage is high because over 70% percent of the sequence is a G or a C.
Using this method to divide reads, the reads may be placed into bins based on the GC content of the area of the longest high or low entropy read. Also an average of some number of the highest or lowest entropy areas may be used to categorize the data.
6. Information Carrying Capacity (Compression Ratio)
Each read may carry some amount of information. In a random ordering each strand may exhibit base probabilities close to ¼ for each base. Compression algorithms exploit information within a sequence to compress the sequence. The more information in the sequence, the more the sequence may be able to be compressed. Therefore sequences with more information will tend to be compressed further, as their compression ratio may exhibit. Further if sequences carry similar amounts of information then they may contain a similar structure.
The compression ratio may then be used to determine possible sequence similarity. The compression ratio may provide information about the information carrying content of reads. Reads with similar compression ratios may then be assumed to have a higher probability of possessing overlapping regions. Further, the information content may be able to take into account reads of different lengths by providing a metric that may accurately describe the complexity of each read. Therefore, compression ratios may be used to divide the reads into subgroups.
7. Compression of Appended Sequences
A different scheme to create the sequence grouping may be with a method that uses the compression ration of appended reads to create groupings. For instance two reads A and B may be tested to see if they should be placed into the same grouping. The test is conducted by compressing two sequences and comparing their results. The two sequences A and B are appended to form the sequence AB. The sequence AB is then compressed. If the AB compression ratio is high relative to the compression ratios of sequences A and B then sequences A and B may be placed into the same grouping. If the AB compression ratio is low relative to the compression ratios of A and B then the sequences A and B may not be placed into the same category.
A second scheme to create sequence grouping with the compression of appended sequences may be with a method that determines if compressions are high or low by using the metric of the compression ratio of AB subtracted by the compression ratio of AA. If this metric is below some threshold value then the sequences A and B may be placed into the same grouping. If the metric is above some threshold value then the sequences A and B may be placed into a separate grouping. In this way sequences can be compared to one another and placed into grouping. Whenever, a sequence is found to not fit into any of the current groupings, it is placed into a new grouping.
A second method may also be used to create new groupings for the first or second schemes to create sequence groupings through the compression of appended sequences. A single sequence is used to create the first grouping. After the best sequences based on the compression ratios, numbering some pre-determined (by-user or automatic) number, are placed into a single group based on the selected first sequence, a new sequence is selected from the group of sequences that are not yet in a group. Then the best sequences relative to this newly selected sequence are placed into a group. This process may continue until no sequence is not within a group.
8. Hybrid
Two or more of any of the above functions may be combined to create groupings. In this way an even better refinement of groupings may be possible.
The present invention may be executed in a fashion described in
The algorithm used for Phrap contains approximately N2 computations, where N is the number or reads to be assembled. However, the algorithm of the present invention contains approximately C*I*M2 computations. In the present invention, C is defined as the number of categories assembled individually, I is the number of iterations or recursions in the algorithm, and M is approximately the average number of sequence per category. Since M will tend to be much less than N (unless most of the reads are placed in the same category) the algorithm of the present invention may tend to run faster and need less computations as the number of reads increases.
The algorithm may be executed through a web-page and web-server. An exemplary display of such a web-page is
The preferred embodiment is for the present invention to be executed by a computer as software stored in a storage medium. The present invention may be executed as an application resident on the hard disk of a PC computer with an Intel Pentium or other microprocessor and displayed with a monitor. The processing device may also be a multiprocessor. The computer may be connected to a mouse or any other equivalent manipulation device. The computer may also be connected to a view screen or any other equivalent display device.
Referring to
In the current embodiment, a user may execute a plurality of functions, some of which are shown in
The embodiments described herein are merely illustrative of the principles of this invention. Other arrangements and advantages may be devised by one skilled in the art without departing from the spirit or scope of the invention. Accordingly, the invention should be deemed not to be limited to the above detailed description. Various other embodiments and modifications to the embodiments disclosed herein may be made by those skilled in the art without departing from the scope of the following claims.
Number | Date | Country | |
---|---|---|---|
Parent | 10043377 | Jan 2002 | US |
Child | 11139463 | May 2005 | US |