The present teachings relate to the field of sequencing genetic material.
Traditional shotgun sequencing forms scaffolds by examining the overlap between the sequenced ends of fragments. First, genetic sequence material is sheared into fragments. These fragments are size selected to isolate fragments of specific length; typically, 2 kbp, 10 kbp, and 150 kbp. Selected fragments are inserted into cloning vectors and cloned. After removal from clones, the first several hundred bases of each end of the insert sequence are determined. Next, algorithms determine fragment orientation and their relationship to each other utilizing fragment overlap and length information. Overlapping fragments are collapsed into a scaffold.
Generally, a significant number of bases between fragments must agree before it can be stated with a degree of certainty that fragments do in fact overlap. Generally, the number of fragments, and hence clones, required for sequencing is directly proportional to the amount of overlap required. For example, statistical calculations show that 5× sequencing coverage (50× clone coverage) is required for a “good” assembly (˜90% of all bases established.) Clone coverage is defined as the average number of clones that cover any particular base and sequencing coverage is defined as the average number of independently sequenced bases that are used to determine the consensus base. Thus for 5× coverage, on average 5 independently sequenced bases cover any base on the consensus sequence.
The present teachings employ a technique called Restriction Site Shotgun Sequencing (RSSS.) It can reduce the amount of overlap required between fragment ends while still producing a good assembly. A decrease in overlap can be achieved by using additional information in the fragments to assist in determining that two fragments overlap.
To be added after claims are finalized.
The skilled artisan will understand that the drawings, described below, are for illustration purposes only. The drawings are not intended to limit the scope of the present teachings in any way.
a illustrates a clone comprising, the vector, insert and restriction sites.
b illustrates the digestion products of the clone in
a illustrates common fragment sizes for three clones starting at nonconcurrent positions.
b illustrates common fragment sizes for three clones starting at concurrent positions.
The section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described in any way.
While the present teachings are described in conjunction with various embodiments, it is not intended that the present teachings be limited to such embodiments. On the contrary, the present teachings encompass various alternatives, modifications, and equivalents, as will be appreciated by those of skill in the art.
Figure one illustrates the traditional Shotgun Sequencing technique. First a genetic sequence (102) is sheared to generate fragments (104). These fragments are then size selected to choose fragments that can be conveniently cloned (106). Selected fragments are inserted into vectors and cloned (108). Via sequencing, the first several hundred bases of each end of the insert sequence are determined (110). Next, via sequence assembly techniques, the insert end sequence information and the approximate length of the inserts can be used to determine overlapping fragments (112). These overlapping fragments can then be collapsed into a scaffold. For a more complete description of the process, the reader is referred to U.S. Pat. No. 6,714,874 included by reference in its entirety.
Some embodiments use the process illustrated in
In some embodiments, fragments resulting from the digestion next undergo a labeling process to produce Labeled Digest Fragments (LDF.) Some embodiments use a single-base extension reaction where the ddNTPs used in the reaction have a dye distinguishable from the dye on the other ddNTPs used in the sequencing reaction. The product of the single-base extension reaction is illustrated in
Looking back at
Some embodiments build a restriction map as indicated in 330 and further detailed in
Some embodiments group clones for tiling by examining the Total Length of Shared Sizes (TLSS) between clones. The TLSS is the sum of the shared fragment sizes between two clones. Thus two fragments that have a TLSS that exceeds a threshold are designated as overlapping. If a fragment is a candidate for joining a clone family and it does not meet the TLSS threshold, it is rejected. One method of determining a suitable TLSS threshold involves using a complete mammalian genome, and via simulation, determining a value for the TLSS for which there is a high probably that the clones overlap. To accomplish this, some embodiments in silico shear the genome into fragments of the length expected for the cutter that will be used for digestion. For example, a test genome can be generated using a mammalian C4 (created by Celera Genomics for customer use, designated as Release 26) genome sequence with any gaps filled with random scaffold sequences from a pool of mammalian DNA repeats. This results in a 2,861,601,159 base pair sequence. If a cutter that statistically would result in 10 kb fragments will be used, then the sequence can be in silico sheared into clone fragments mean length 10 kb and standard deviation of 1 kb. These inserts can be circularly annealed to a cloning vector such as pBR194c and in silico digested. The pBR194c vector is illustrated in
a shows a few of the clone digests from the simulated genome. The clone position is shown in the first column. Between the first pairs and the last pairs of clones, with no effective overlap, very few shared fragments (underlined) are found. The total number of basepairs overlapping are 509 in the first pair and 387 in the second. The inner pair of fragments has 2.8 Kb of overlap and the shared fragments (bolded) are much more frequent. Their sum is 5,638 bp.
In order to determine a suitable threshold for the total length of shared sizes that can be used to group clones together, some embodiments compute a threshold by graphically determining a threshold beyond which it is not probable that non-overlapping clones would not have an acceptable TLSS. For example, in
Once a set of overlapping clones is identified, the clones can be aligned into a restriction fragment map.
One skilled in the art will appreciate that a variety of tiling algorithms exist that can form the basis for the tiling process described herein. For example, the method of Durand (“An efficient program to construct restriction maps from experimental data with realistic error levels”, Nucleic Acids Research v 12:1, 703-716, 1984) can serve as the basis.
Some embodiments employ logic that considers the length of the insert-end fragments in conjunction with the tiling path to place the insert-end fragments. If a frequent cutter is used in the digestion, it is likely that the sequenced portion of the clone will be cut. This will result in at most two fragments that do not fit the tiling path. If there are more than two fragments, some embodiments can flag the clone as a false join. In
Some embodiments detect the location of fragments greater than the accepted maximum size by taking notice of multiple end fragments that cannot be placed without overlap. For example, in
Some embodiments recover insert digest fragments (type 1 or type 3) that are masked by the subtracted-out type 2 digest fragments. Short insert digest fragments are likely covered by a sequencing read. Longer fragments can be detected by adding the marked vector fragment sizes one by one to a clone to see if the tiling with its neighbors improves.
Some embodiments account for polymorphisms that either remove a cutting site or add a new one. Logic can be used that recognizes that two-non-conforming fragments can be joined to form a fragment pair that would be the same length as another fragment pair from a clone that does fit the tiling path. These can be combinatorially created from pairs of non-conforming fragments to see if the pairs match a conforming fragment of an adjacent clone. This is illustrated in
Some embodiments consider the effects of a clone having multiple same-sized fragments. For example,
Once information in addition to the fragment sizes is used to check the tiling and the orientations of as many fragments as possible, the tiling can be collapsed into a scaffold as indicated in
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
A computer system 500 can perform the methods described in the present teaching. Consistent with certain implementations of the invention, a consensus sequence or scaffold can be is provided by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in memory 506. Such instructions may be read into memory 506 from another computer-readable medium, such as storage device 510. Execution of the sequences of instructions contained in memory 506 causes processor 504 to perform the process described herein. Alternatively hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus implementations of the invention are not limited to any specific combination of hardware circuitry and software.
The term “computer-readable medium” as used herein refers to any media that participates in providing instructions to processor 504 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as memory 506. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, papertape, any other physical medium with patterns of holes, a RAM, PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on the magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector coupled to bus 502 can receive the data carried in the infra-red signal and place the data on bus 502. Bus 502 carries the data to memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
The foregoing description of an implementation of the invention has been presented for purposes of illustration and description. It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practicing of the invention. Additionally, the described implementation includes software but the present invention may be implemented as a combination of hardware and software or in hardware alone. The invention may be implemented with both object-oriented and non-object-oriented programming systems.
All literature and similar materials cited in this application, including but not limited to, patents, patent applications, articles, books, treatises, and internet web pages are expressly incorporated by reference in their entirety for any purpose.
While the present teachings have been described in terms of these exemplary embodiments, the skilled artisan will readily understand that numerous variations and modifications of these exemplary embodiments are possible without undue experimentation. All such variations and modifications are within the scope of the current teachings.
This application claims a priority benefit under 35 U.S.C. § 119(e) from U.S. Patent Application No. 60/579,742, filed Jun. 15, 2004, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
60579742 | Jun 2004 | US |