The present disclosure generally relates to sequencing data samples and, more specifically, to sequencing data samples to detect and identify non-host nucleic acid sequences.
With the advent of nucleic acid sequencing, it has become possible to identify the presence of an organism based on the presence of its nucleic acids, without relying on the growth of the organism, or presence of non-nucleic acid macromolecules. Sequencing has also been used to identify the presence of previously unknown bacteria. These bacteria have been discovered in environmental sites (ocean, Antarctic, deep sea vents) and on the human body (oral, elbow crease, gut). In many examples, this discovery process is based on (1) “broad range” amplification with primers from highly conserved regions in the 16S ribosomal subunit, (2) obtaining sequence information for the variable region of the amplicon(s) that is between the primers, (3) comparing the sequence information to a database of the 16S sequences for known bacteria, (4) analyzing those sequences that are not in the database and determine which (if any) of the known bacteria are close relatives and (5) based on this “relatedness” assigning the bacteria associated with the new 16S sequence to a likely taxa, genus, species, etc. In one approach the conserved sequences are in the 16S/23S genes and produce an amplicon for sequencing in the variable internal transcribed spacer (ITS) region that is between them. In fungi, the approach is similar. The 18S/5.8S/28S genes are highly conserved and have the ITS1 and ITS2 between them, respectively, which are the variable regions that are sequenced and used for comparison.
However, this strategy is based on a single or a limited number of sites that have conserved regions. Conventional strategies rely on highly conserved regions. Such approaches provide a very narrow scope for comparison and determination of a new species. Whole genome approaches to finding new sequences are needed. Currently the sequencing capacity has been developed to generate the required data. However, there are no tools to effectively analyze this amount of data. These needs and others are the subject of the present disclosure.
Generally, the disclosure is directed to identifying known and unknown non-reference nucleic acid sequences (i.e., nucleic acid sequences that are not typically found in a reference, or source of nucleic acids) using sequence data. This can be achieved by comparing one or more sample sequences with reference sequences in a data structure and excluding one or more sample sequences that are associated with the reference sequences in the data structure, or by excluding all sequences that are associated with the nucleic acid sequence source (reference genome or genomes) and also excluding all sequences that can be associated with any known genome or gene. The disclosure is also directed to data structures that can be employed to identify non-reference nucleic acid sequences using sequence data. Files containing sequence data and other information can be loaded into the data structures, where the sequences can be used as the searchable key in the data structures. The disclosure also includes mapping a sample sequence to a reference sequence that includes any known genome with any number of mismatches. These and other advantages and features of the present disclosure will become apparent to those of ordinary skill in the art upon reading this disclosure in its entirety.
Generally, the disclosure addresses identifying known and unknown non-reference nucleic acid sequences (i.e., nucleic acid sequences that are not typically found in a reference genome, or source of nucleic acids) using sequence data. This can be achieved by comparing one or more sequences of the sample with reference sequences in a data structure and excluding one or more sequences of the sample that are associated with the reference sequences in the data structure. Further, this can be achieved by excluding all sequences of the sample that are associated with the nucleic acid sequence source and also excluding all sequences of the sample that can be associated with any known genome or gene. The remaining sequences of the sample are unknown non-reference sequences because they do not correspond to any of the reference sequences detected. In various embodiments, the remaining sequences of the sample can be used as seeds for the de novo assembly of unknown nucleic acid sequences not from the nucleic acid sequence source.
The disclosure includes data structures that can be employed to identify unknown non-reference nucleic acid sequences using sequence data. Files containing sequence data and other information can be loaded into the data structure, where the sequences can be used as the searchable key in the data structure. Further, the data structure can allow all sequences contained in the data structure to be simultaneously considered when mapping a sequence of the sample to a reference genome sequence and/or any known genome or gene. Additionally, the methodologies described herein can exhaustively map a sample sequence to reference sequences and also need not rely upon incomplete matching heuristics.
The disclosure also includes mapping a sequence of the sample to a reference sequence that includes any known genome with any number of mismatches (i.e. insertions, deletions, or substitutions of nucleic acids). For example, the sequence of the sample can be mapped to the reference sequence with no mismatches. If the sequence of the sample does not map to the reference sequence with no mismatches, the sequence of the sample can be mapped to the reference sequence with one mismatch. If the sequence of the sample does not map to the reference sequence with one mismatch either, the sequence of the sample can be mapped to the reference sequence with any number of mismatches until the sequence of the sample maps to the reference sequence. In general, once the sequence of the sample maps to the reference sequence with k number of mismatches, the sequence of the sample need not be mapped to the reference sequence with k+1 number of mismatches. In another example, if the sequence of the sample exactly maps to the reference sequence, then the sequence of the sample need not be mapped to the reference sequence with one mismatch.
Furthermore, in one example, all sequence data reads may be inserted into a lookup table by reducing the sequence data reads into addresses in the lookup table. Next, every subsequence of size N may be determined across the reference sequence, such as the host genome, and then all possible variants with 0-k mismatches can be determined and then determine whether any of the possible mismatch variants match any of the addresses already occupied by sequences from the sample. This approach may be exhaustive and moreover, no sequence alignment may take place. All possible variants may be generated with a given number of mismatches, however these may not be stored and instead, the variations may be iteratively processed.
Another embodiment can take the form of a one-sample sequencing approach. In such an approach, a determination can be made for every nucleic acid sequence of the sample as to whether the sequence can be mapped to a reference genome exactly (i.e., with no mismatches). Sequence of the sample that can be exactly mapped to a reference genome are excluded from the list of potential non-reference nucleic acid sequence members. A determination can then be made as to whether any of the remaining sequences of the sample can be mapped to the reference genome with one, two, three and so on mismatches as appropriate or desired. The remaining sequence of the sample that can be mapped to the reference genome with k mismatches can be excluded from the list of potential non-reference nucleic acid sequences. Additionally, the number of mismatches k, may be a user chosen parameter. For example, N may be the length of the nucleic acid sequence. Thus, as long as k/N is higher then the sequencing error rate, then k may be a sufficient choice by the user. Further, the number of mismatches may depend on a number of factors such as the mutation rate in the organism, genomic variability of the organism, the sequencing error rate and so on.
Yet another embodiment can take the form of a two-sample sequencing approach. In such an approach, for example a sample from tissue affected by a disease or disorder may be sequenced, and then a sample from (apparently) healthy tissue of the same organism may be sequenced. Next, all sequences that are common to both samples can be excluded. Optionally, all sequences associated with any known genome or gene also can be excluded. Optionally, the remaining sequences of the sample can be used as seed sequences for the de novo assembly of potential unknown non-reference nucleic acid sequence.
It should be noted that embodiments of the present disclosure can be used for any type of sequencing data or in any method used to identify non-reference nucleic acid sequence. The embodiment can include or work with a variety of nucleic acid sequence data, including DNA data, RNA data, methylated DNA data, data sequencing systems, data sequencing computations and methodologies, and the like. Aspects of the present disclosure can be used with practically any apparatus related to data sequencing and data sequencing devices or any apparatus that can relate to any type of data system, or can be used with any system in the identification of non-reference nucleic acid sequence. Accordingly, embodiments of the present disclosure can be employed in computers, data processing systems and devices used in data sequencing, and the like.
Before explaining the disclosed embodiments in detail, it is to be understood that the disclosure is not limited in its application to the details of the particular arrangements shown, and is capable of being realized in still other embodiments. Moreover, aspects of the disclosure can be set forth in different combinations and arrangements to define disclosures unique in their own right. Also, the terminology used herein is for the purpose of description and not of limitation.
The sequencing system 120 can connect to the computing environment by any methods, such as through proprietary, local or wide area network, and the like. The sequencing system 120 can connect to a server 130 and to a central processing unit 140 (“CPU”) via a communication bus 150. The CPU 140 can include a processor 142 and a main memory 144. The main memory 144 is a computer readable storage medium that is operable to store applications and/or other computer executable code which runs on the processor 142. The memory 144 may be volatile or non-volatile memory implemented using any suitable technique or technology such as, for example, random access memory (RAM), disk storage, flash memory, solid state and so on. There can be one CPU or multiple CPUs for the system 100. It is also possible for the server 130 and the CPU 140 to be one system or separate systems in the computing environment.
In one example, various devices in the system 100 can also communicate with each other through the communication bus 150. Although only one communication bus is illustrated, this is done for explanatory purposes and not to place limitations on the system 100. Generally, multiple communication buses can be included in any computing environment. As shown in
A user input interface 160 and a data storage interface 170 can also be connected to the communication bus 150. The user input interface 160 can allow a user to input information and/or to receive information through one or multiple input devices such as input device 162, within the hosted development environment or to the client systems 110. The user inputs can include various elements such as a keyboard, a touchpad, a mouse, any type of monitor including CRT monitors and LCD displays, speakers, a microphone, and the like. The data storage interface 170 can include data storage devices such as data storage device 172 (including databases, hard drives, tape drives, floppy drives, and the like).
Further, when sequencing the sample, either or both of the DNA and/or RNA nucleic acid sequences can be simultaneously considered. For example, a virus can appear in the sample as DNA, single-stranded RNA, or double stranded RNA. Even though the virus can appear as any one of these forms, the virus can still be detected in the sample because either or both of the DNA and/or RNA can be simultaneously considered. Further, the methodologies described herein can exhaustively map a sample sequence to reference sequences and also need not rely upon incomplete matching heuristics.
The nucleic acid sequence can be identified by any sequencing methods known in the art. In
In one non-limiting embodiment, the nucleic acids in the sample can be sequenced by Maxam Gilbert sequencing. Maxam Gilbert sequencing is “chemical sequencing” based on chemical modification of DNA and subsequent cleavage at specific bases. Classically, nucleic acids are radioactively labeled at one end and the DNA fragment to be sequenced. is purified Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). Thus a series of labeled fragments is generated, from the radiolabelled end to the first ‘cut’ site in each molecule. The fragments in the reactions are then size-separated by gel electrophoresis and the order of the bands indicates the sequence.
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by Sanger sequencing. The Sanger method is based on termination of DNA synthesis in a small portion of molecules. The label can be radioactively or fluorescently labeled nucleotides or primers. The DNA sample is divided into four separate sequencing reactions, containing the four standard deoxynucleotides and the DNA polymerase. To each reaction is added a small concentration of only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP). Incorporation of a dideoxynucleotide into the elongating DNA strand terminates extension, resulting in various DNA fragments of varying length. The reactions are then size-separated by gel electrophoresis and the order of the bands indicates the sequence.
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by dye-terminator sequencing. Dye-terminator sequencing is an alternative to the chain-termination in that the four ddNTPs each have a separate fluorescent label. This allows for a single reaction mixture and single lane on the gel.
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by sequencing by synthesis. The incorporation of the next base is observed, instead of the observing the termination of synthesis. In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by pyrosequencing. Pyrosequencing is based on detecting the activity of DNA polymerase with a chemiluminescent enzyme. The template DNA is immobilized, and solutions of A, C, G, and T nucleotides are added sequentially. Light is produced only when the nucleotide solution complements the first unpaired base of the template. The sequence of solutions which produce chemiluminescent signals allows the determination of the sequence of the template. The light can occur in low throughput in or high throughput on an array (454 (Roche) with the GS FLX.).
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by reversible terminator sequencing (also a sequencing by synthesis method). This method is similar to dye-terminator sequencing, but differs in that reversible versions of dye-terminators are used. One nucleotide at a time is added by the polymerase. The fluorescence corresponding to that position is detected. The blocking group of the terminator NTP is removed. This allows the polymerization of another nucleotide (Illumina and Helicos).
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by sequencing by ligation. This method uses a DNA ligase enzyme to identify the target sequence. Used in the polony method and in the SOLiD technology (Applied Biosystems, now Invitrogen). There is a pool of all possible oligonucleotides of a fixed length, labeled according to the sequenced position. Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for matching sequences results in a signal corresponding to the complementary sequence at that position.
In another non-limiting embodiment, the nucleic acids in the sample can be sequenced by “sequencing by hybridization.” This method uses a microarray. A single pool of DNA is fluorescently labeled and hybridized to an array of known sequences. If the DNA hybridizes strongly to a given spot on the array, causing it to “light up”, then that sequence is inferred to exist within the DNA being sequenced.
Other sequencing methods currently under development may include nanopore sequencing, sequencing by labeling the DNA polymerase and sequencing by electron microscope.
Next, in the operation of block 220, the nucleic acid sequences of the sample that can be associated with the reference genome can be excluded from the list of potential unknown non-reference sequences. The operation of block 220 will be discussed in more detail with respect to
In the operation of block 240, the remaining sequences of the sample can be used as seed sequences for the de novo assembly of potential unknown non-reference nucleic acid sequences. The remaining sequences of the sample can be the sequences of the sample after all sequences associated with the reference genome have been identified and excluded and all sequences of the sample associated with any known genomes or genes have been identified and excluded. Excluding sequences of the sample associated with the sample genome and with any reference sequence (e.g. known genome or gene) will be discussed in further detail below with respect to
Additionally, the de novo assembly process can be similar for both the one-sample and the two-sample sequencing approaches (the two-sample sequencing approach will be discussed in further detail below). The de novo assembly process, for either approach, can use the sequences of the sample that have passed all of the mapping filtration steps as seed sequences. Both the sequences that passed the filters and the sequences that did not pass the filters can be used in the assembly process. For example, the assembled sequences can have an associated “score” or quality of the assembled sequences that can indicate the number of sequences from the excluded categories and the number of sequences from the passed categories. The score can allow the identification of which sequences can be newly identified non-reference nucleic acid sequence and which sequences can be from known genomes or genes.
For example, category A can have sequences of the sample that can be mapped to the reference genome (with or without mismatches) and category B can have sequences of the sample that can be mapped to the known non-reference genomes or genes (non-host but still genomic material, with or without mismatches). Category C can have sequences of the sample that passed all the filters from either the one-sampling or two-sampling sequencing approach. Continuing this example, for de novo assembly from a sequence from category C, the number of sequences can be monitored from each category that were used to extend the assembly. Then, for each assembled contig, the ratio can be calculated between the number of subsequences that are in categories A, B or C. If the majority of sequences to assemble the contig came from category A, it can be concluded that the contig can be more likely to be associated with the reference genome than the non-reference sequence genomic material. Alternatively, if the category B sequences are predominant in the assembly, it can be concluded (and possibly, the exactly genome source can be identified) that the contig can be associated with the non-reference but known genomic material. However, the contig may also consist of predominantly category C sequences and this may contribute to evidence of the identification of a new unknown non-reference genome organism.
Also in
The operation of block 310 can continue until the nucleic acid sequences of the sample cannot be exactly mapped to the reference genome sequence. At the point when the nucleic acid sequences of the sample cannot be exactly mapped to the reference genome, the method proceeds to the operation of block 320. In the operation of block 320, the nucleic acid sequences of the sample that can be mapped to the reference genome with one or any combination of one, two, three or more mismatches, can be excluded from potential non-reference nucleic acid sequence members. As mentioned previously, mismatches can include one or any combination of insertions, deletions and/or substitutions. For example, a mismatch of two can include an insertion and a deletion and a mismatch of three can include an insertion, a deletion and another insertion in the nucleic acid sequence data. When the of the sample cannot be mapped to the reference genome with mismatches, as deemed necessary, the process is complete for excluding sequences that can be associated with the reference genome, as depicted in the operation of block 340.
The method 400 can continue to the operation of block 420 once the nucleic acid sequences of the sample cannot be mapped exactly to any known genome or genes in the database. In operation of block 420, all the nucleic acid sequences of the sample that can be mapped to the collection of known genomes and genes with mismatches can be determined. As discussed with respect to
It should be noted that the one-sample sequencing approach for identification of unknown non-reference nucleic acid sequences can also be attempted using a brute force de novo sequencing method. This approach does not exclude sequences associated with the host genome, but rather uses all sequencing sample reads as seeds for de novo assembly. The assembled sequences can be mapped to all available/known gene and genome reference sequences, such as those publicly available through GenBank sequences, to obtain positive or suggestive identification of the source of genomic sequences.
The operation of block 510 of method 500 can sequence the sample from the affected tissue (the comparison sample). Similarly, the operation of block 520 can sequence the sample from the apparently healthy tissue of the same organism as the comparison sample tissue. The operations of blocks 510 and 520 can also be performed in the opposite order and the samples can be sequenced, as discussed previously. Sequencing the comparison sample first and the healthy sample (control) second, is for explanatory purposes only, and generally the sequencing can be performed in either order or at the same time (at different locations/lanes, etc).
Next, in the operation of block 530, all sequences that are common to both the control sample and the comparison sample can be excluded.
Next in the operation of block 620, the comparison sample set of sequences can be mapped to the control sample set of sequences with any combination of one, two, three or more insertions, deletions and/or substitutions (mismatches). The comparison sample set of sequences that can be mapped to the control sample set of sequences with mismatches can be excluded from the comparison sample sequence, i.e., from potential non-reference nucleic acid sequence members in the comparison sample sequence in the operation of block 630. Once the comparison sample set of sequences cannot be mapped to the control sample set of sequences with mismatches, the method 600 completes in the operation of block 640.
Returning to
Various data structures can be used to implement the methodologies discussed above. Following is a detailed discussion of data structures and the implementation of data structures, mapping, assembly and various applications of the one-sample sequencing approach and the two-sample sequencing approach. The various data structures can include and/or employ sequences that can be organized in the data structure, where the sequences can be available in two formats, base-only and base-and-quality. Each base of a sequence in the base-only format belongs to an alphabet such as {A, T, G, C, N} for DNA, or {A,U,G,C,N} for RNA where N means a given base has not been determined by sequencing method Each base of a sequence in the base-and-quality format is a pair (b,qi) where b is in an alphabet {A, T, G, C, N} for DNA, or {A,U,G,C,N} for RNA and qi where i=1 to sequence size are the probabilities of error (using next generation sequencing, the probability that a given base is determined incorrectly).
In various aspects, the number of reference sequences (e.g. host sequences) are several orders of magnitude greater than the number of unidentified sequences. In various non-limiting examples, the reference sequence can be present in a proportion on the order of 105, 106, 107, 108, 109, or 1010 greater than the non-reference sequence. The number of sequences of reference sequence and non-reference sequences in the data structure can thus be chosen to have multiple non-reference sequences in a given sample.
In one non-limiting example, in the case of a virus infecting a host, a virus with a 10 kb genome is integrated entirely into a single chromosome location in all cells in the affected human tissue (sample one). The human haploid genome is 3.2 Gb, so each human cell has approximately 6.4 Gb genomic material. If DNA is obtained from these cells, the virus DNA represents approximately six orders of magnitude less than of the DNA obtained from the human. If short sequences are randomly generated from the sample, then 1 of every 1 million reads should be the virus DNA. Thus in this scenario the theoretical minimum of sequencing information that is required is 1 million sequences.
In various aspects, the size of the obtained sequences can determine the total amount of sequencing data. If the sequences are each on average 50 bases in length, then 106 sequences represents 50 Mb of sequence information. If the length of the sequences is 36 bases, then 106 sequences represents 36 Mb of sequencing data. If this single detected sequence is different from all sequences (in this case host sequences) in the reference sequences in a second data set (e.g. partial or entire human genome) by 1 or more bases (mismatches include substitutions, insertions or deletions in any position and in any combination), then the described method would identify the sequence as is characteristic of sample one (i.e. or non-host nucleic acid sequence) and use the sequence in conjunction with a search algorithm to find a known homologous sequence and a potential identity of the non-reference DNA. In most cases, selecting the average number of non-reference sequences to be one is not preferred, so the number of non-reference sequences likely to be identified can be increased by increasing the number of reference sequences that are entered into the data structure.
In another non-limiting example, a bacterium with a 5 Mb genome is associated with all of the cells in the affected tissue (sample one). The human haploid genome is 3.2 Gb, so each human cell has approximately 6.4 Gb genomic material. The bacterial DNA represents approximately three orders of magnitude less than the DNA obtained from the sample. If 50 base sequences are randomly generated from the sample, then approximately 1 of every 4 thousand reads should be bacterial DNA. Thus, in this scenario, the theoretical minimum sequencing information that is required is 4 thousand sequences.
If the sequences are each on average 50 bases in length, then 4000 sequences represents 0.4 Mb of sequence information. If the length of the sequences is 36 bases, then 4000 sequences represents 0.15 Mb of sequencing data. If this single detected sequence is different from all sequences (in this case host sequences) in the reference sequences in a second data set (e.g. partial or entire human genome) by 1 or more bases (mismatches include substitutions, insertions or deletions in any position and in any combination), then the described method would identify the sequence as is characteristic of sample one (i.e. or non-host nucleic acid sequence) and use the sequence in conjunction with a search algorithm to find a known homologous sequence and a potential identity of the non-reference DNA. In most cases, selecting the average number of non-reference sequences to be one is not preferred, so the number of non-reference sequences likely to be identified can be increased by increasing the number of reference sequences that are entered into the data structure.
In another non-limiting example, a virus with a 10 kb genome is associated with 10% of the cells in an affected tissue (sample one). The human haploid genome is 3.2 Gb so each human cell has approximately 6.4 Gb genomic material (change to 10 Gb to make math simpler). If DNA is obtained from these cells, the bacterial DNA represents approximately 1/10,000,000 of the total DNA obtained. If 50 b sequences are randomly generated from the sample, then 1 of every 10 million reads on average is viral DNA. Thus in this scenario the theoretical minimum of sequencing information to obtain a single viral sequence is 10 million reads. (As above, the size of the reads can determine the total amount of sequencing data required). If the sequences are 50 bases in length, then this is 500 Mb of sequencing information. If the sequences are 36 bases in length, then this is 360 Mb of sequencing data. If this single read is different from any 50 b stretch (if 50 b reads are used or from any 36 b stretch if 36 b reads are used) of sequence information in the human genome by 1 or more (depending on the set criteria) bases (substitutions, insertions or deletions in any position and in any combination), then the described method would identify it is unique to sample one (or non-host) and use it in conjunction with a search algorithm to find a known homologous sequence and a potential identity of the non-reference DNA.
Many types of data structures which provide efficient sequence lookup can be used such as, sorted arrays, suffix arrays, suffix trees, hash tables, any variation of the aforementioned structure and so on. Additionally combinations of these data structures, such as combination of sorted arrays, hash tables, and suffix trees within a single conglomerated data structure can be used. In one embodiment of a combined data structure, a hash table and a suffix tree can be used together. In this example, the prefix—first m bases of the sequence, is stored in a hash table while the suffix is stored in a suffix array. Such a data structure allows for compact representation of sequencing reads, thereby increasing lookup speeds. The data structures can use sequences or subsequences of a sequence as the searchable keys and can only need to organize the searchable keys to allow for searching. Even though there are two sequence formats, the per-base qualities can not be used as a searchable key and instead can be considered as data associated with the searchable keys.
In one example, the data structure can be a hash table that can be used in conjunction with genomic sequence data. The hash table can allow a way to determine the presence of a given sequence, and when a sequence is present can retrieve the associated number of copies the same sequence is detected in the sample, and can retrieve the associated per-base qualities of the sequence.
In one embodiment, the procedures “LOAD-SEQUENCES” and “LOAD-SEQUENCES-WITH-QUALITY” can load the nucleic acid sequences of the sample from a file into any of the data structures previously mentioned. The procedure LOAD-SEQUENCES can load the sequences of the sample without the per-base quality which can save memory space in the processor (which can also result in faster mapping and assembly than using the procedure LOAD-SEQUENCES-WITH-QUALITY), but can result in the loss of the ability to distinguish bad bases from good bases. The procedure LOAD-SEQUENCES-WITH-QUALITY can load sequences of the sample with their per-base qualities. Both procedures can save sequences with identical sequences only once with a copy number.
In one embodiment, hashing can be used to implement a lookup table.
Another embodiment can implement a lookup table using sorted-arrays. The array can have the same or more elements as the number of keys and each array element can contain a record similar to the example shown in
In yet another embodiment, a lookup table can be implemented using a binary-search-tree (“BST”). The BST can have nodes, where each node in the tree can contain a record such as the example of
In still another embodiment, a lookup table can be implemented using a suffix-tree or any of its variations such as a suffix-array. A suffix-tree can use a collection of edges on the path from the root node to one of the leaf nodes to represent a nucleic acid sequence. Other fields, such as copy number and per-base qualities that are associated with a sequence can be stored in the corresponding leaf node. A search can be performed in a suffix-tree, after the suffix-tree is built, by traversing the tree from the root to one of its leaf nodes.
As discussed previously, mapping can be used in the identification of non-reference nucleic acid sequence in both the one-sample sequencing approach and the two-sample sequencing approach. Mapping can be used for determining how much of the reference genome or gene can be present in sequences of the sample. In the process of mapping, the assumption can be made that a sequence of the sample with high similarity to a subsequence in the reference genome or gene, is expected to be the result of sequencing that part of the reference genome or gene. Mapping involves finding a set of sequences of the sample that have high sequence similarity to the subsequences in the reference genome or gene and depending on the number of sequences of the sample found for a subsequence, the presence of the subsequence in the sample can be confirmed or rejected. Different levels of confidence for subsequences of the reference genome or gene can be determined using the number of sequences of the sample found and the per-base quality of the sequences of the sample. Thus, a higher confidence can be associated with a higher number of sequences of the sample or a higher per-base quality of the sequences of the sample.
The mapping step can map all sequences of the sample or subsequences of sequences of the sample to a reference sequence set ref_set using up to k (0 to k) mismatches. Procedure A
As previously discussed with respect to
The method 800 can proceed to the operation of block 820 when there are unprocessed sequences of the sample and can obtain the next sequence of the sample. Once the sequence of the sample is obtained, in the operation of block 830, the sequence of the sample can be checked against either the reference genome sequence or any reference sequences or subsequences of known genomes or genes for a perfect match. If the sequence of the sample is a perfect match, the sequence of the sample can be excluded in the operation of block 860 and the sequence of the sample can not be checked against a reference genome sequence or any reference sequences or subsequences of known genomes or genes with one mismatch. In the event, the sequence of the sample is not a perfect match, the sequence of the sample can be checked again in the operation of block 840, against a reference genome sequence or any reference sequences or subsequences of known genomes or genes with one mismatch. If the sequence of the sample matches the reference genome sequence or any reference sequences or subsequences of known genomes or genes with one mismatch the sequence of the sample is excluded in the operation of block 860 and the sequence of the sample can not be checked for two mismatches. However, if the sequence of the sample does not match the reference genome sequence or any reference sequences or subsequences of known genomes or genes with one mismatch, the sequence can be checked against a reference genome sequence or any reference sequences or subsequences of known genomes or genes with two mismatches in the operation of block 850. As shown in
Table 2 provides pseudo code for performing the operation of excluding sequences of the sample from a given set of reference sequences. The procedure of Table 2 is similar to the method 800 of
The procedure of Table 2 can iterate k from 0 to K mismatches. After each iteration a sequence of the sample can be removed that is mapped to a position in the reference sequence from the set of sequences of the sample R. The set of sequences of the sample can be mapped to the reference sequence using any number of mismatches from zero mismatches (an exact match), one mismatch, two mismatches, and so on. A set of all k-mismatch variants (V) can be created by taking a subsequence of size N+k from each position of the reference sequence. In this context refp,p+N−1+k can refer to taking a subsequence of ref from position p to position p+N−1+k. N+k can be used instead of N to provide extra “suffix” bases for deletion as all k mismatch variants can have the same size N. Each k mismatch variant can be checked as well as its complement in R. If a sequence matches in R, then the sequence of the sample can be mapped and can be removed from R.
Table 3 provides pseudo code for performing the operation of removing sequences of the sample from R that are mapped to the set of reference sequences up to K mismatches and returned to the remaining sequences of the sample. The procedure of Table 3 can include at least the inputs of a set of reference nucleic acid sequences (ref_set), the set of sequences of the sample (R), the sequence size (N), maximum mismatches (K) and so on. The two-sample sequencing approach can employ the method of replacing the set of reference nucleic acid sequences ref_set with the set of sequences from the control sample. The two-sample sequencing approach can be similar to the one-sample sequencing approach with respect to the rest of the mapping methodology. However, the analysis of the two-sample sequencing approach can be performed by using sub-sequences of sequences where the subsequences can be a size shorter than N.
Assembly, as previously discussed with respect to
User can specify the minimum size of the overlapping subsequence (a value less than or equals to N−1). This value is given as Pmax_shearing in Table 5 where the difference between N and Pmax_shearing is the minimum size of the overlapping subsequence. Thus, the process will first count the votes from sequences of the sample overlapping with N−1 bases and its 1, 2, 3 . . . K mismatch variants, then count the votes from sequences of the sample overlapping with N−2 bases and its 1, 2, 3, . . . K mismatch variants, and so on until N-Pmax_shearing. Next a decision rule can be applied to determine the most likely next base.
In the methodology described in Table 5, we use a majority rule to determine the next base. In addition we require that the winning base must have at least a given amount of votes (Pmin_%: minimum percentage of votes the base which have the majority of votes must have in order to be the next base). If none of the bases meets this requirement then the extension process stops. Table 6 and Table 7 show our methodologies for finding the next base to the 5′ end of the template, and to the 3′ end of the template. The procedure Complementary returns the complementary base of a given base (complementary of A is T and vice versa; complementary of G is C and vis versa). The procedure Reverse-Complementary returns the reverse complementary sequence of a given sequence (i.e. the reverse complementary of ATGC is GCAT).
Embodiments within the scope of the invention include computer systems configured to perform the methods disclosed herein.
Computer systems are generally well-known in the art. Those skilled in the art will appreciate that aspects of the invention can be practiced in computing environments or network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Various embodiments discussed herein, including embodiments involving a satellite or cable signal delivered to a set-top box, television system processor, or the like, as well as digital data signals delivered to some form of multimedia processing configuration, such as employed for IPTV, or other similar configurations can be considered as within a network computing environment. Further, wirelessly connected cell phones, a type of hand-held device, are considered as within a network computing environment. For example, cell phones include a processor, memory, display, and some form of wireless connection, whether digital or analog, and some form of input medium, such as keyboards, touch screens, etc.
Hand-held computing platforms can also include video on demand type of selection ability. Examples of wireless connection technologies applicable in various mobile embodiments include, but are not limited to, radio frequency, AM, FM, cellular, television, satellite, microwave, WiFi, blue-tooth, infrared, and the like. Hand-held computing platforms do not necessarily require a wireless connection. For example, a hand-held device can access multimedia from some form of memory, which can include both integrated memory (e.g., RAM, Flash, etc) as well as removable memory (e.g., optical storage media, memory sticks, flash memory cards, etc.) for playback on the device. Aspects of the invention can also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. In a distributed computing environment, program modules can be located in both local and remote memory storage devices.
In certain embodiments, computer systems include a processor configured to perform the methods disclosed herein. The computer system can be configured to identify sequences as described herein. The processor can be further configured to identify the sequences, compare sequences, process sequences, exclude sequences, and so on. In other variations, the computer systems described herein can comprise a processor configured to implement the processes using various data structures as described herein.
Embodiments within the scope of the present invention also include computer readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, DVD, CD ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. When information is transferred or provided over a network or another communications link or connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the source of the information can be properly viewed as a computer-readable medium, such as a server, a storage medium, a processor, and the like.
Combinations of the above should also be included within the scope of computer-readable media. Computer-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
The methods described herein can be used to identify sequences from a particular source. The sequences are identified rapidly by eliminating sequences that are either already known as described above. In various embodiments, sequences can correspond to a particular environment, organism, disease state, or condition can be identified. The presence of nucleic acid sequences can be identified in an individual organism, or group of organisms. The methods described herein can be used to identify any unknown sequences that are from a particular source. The presence of unknown genomic material in any sample, including environmental samples and suspicious samples can be identified.
In one embodiment, the sequence can correspond to non-host nucleic acid sequences (i.e., sequences not normally associated with a host organism). The term “host” can refer to an animal or plant that has been or is infected by another organism, examples of animals can include mammals, non mammals, and invertebrates. Examples of mammals include humans, non-human primates, farm animals, sport animals, mice, and rats. Examples of plants include, but are not limited to, agricultural crops. In general, host organisms have been infected by a microorganism, such as a virus, bacterium or fungus, parasite, protozoan. By eliminating sequences corresponding to the host organism, and then eliminating sequences corresponding to known microorganism sequences, new nucleic acid sequences that correspond to a previously unknown microorganism can be identified.
Alternatively, the methods described herein can be used to identify nucleic acid sequences associated with a disease or condition in an organism, such as an animal or plant. The condition may not even be infectious/contagious at the current time. Sequences from samples of healthy tissue and diseased tissue can both be obtained. Sequences corresponding to healthy tissue can then be identified and removed from the set of disease sequences as described herein. Alternatively, sequences corresponding to other known sequences can be eliminated. The remaining sequences correspond to diseased tissue. Exemplary embodiments of the disease or condition include cancer or a type of cancer, or an organ suitable for transplantation. Disease progression can also be monitored.
In various other embodiments, nucleic acid sequences associated with foreign genetic material in an ingestible product can be identified. For example, nucleic acid sequences that are indicators of foreign material in foodstuffs (e.g. food manufacturing and/or beverage processing), medicines, or vaccines can be determined. Foreign nucleic acid sequences in this context are identified as “non-reference” sequences, where the foodstuff, medicine, or vaccine is the “reference.” The suitability of material to leave quarantine or to be distributed can also be determined based on the amount of foreign nucleic acid material present in the sample. The presence of foreign genomic material in tissue culture used for the generation of therapeutics can be determined.
Other applications of the methodologies described herein can include, but are not limited to identifying genomic material associated with a particular type of cancer, such as the cause of the particular type of cancer, identifying genomic material associated with a chronic disease/condition where the condition may not be infectious and/or contagious at the current time, identifying genomic material associated with an “apparently” infectious disease, where there can be no cultivatable cause and so on. Other applications can include the determination of suitability of organs for transplant, the suitability of material to leave quarantine, suitability of material for distribution, the presence of novel nucleic acid sequence material in returning astronauts, the sterility of plant tissues. The methodologies described herein can be used to determine the presence of foreign genomic material in tissue culture used for generation of therapeutics, the presence of foreign genomic material in vaccine materials, the presence of foreign genomic material in food processing and the presence of foreign material in beverage manufacturing. Furthermore, the nature of an emerging disease can be determined using the methodologies described herein. Additionally, the methodologies described herein can be used to identify the presence of unknowns in any sample, the presence of unknowns in an environmental sample and the presence of unknowns in a suspicious sample.
It will be understood to those of skill in the art that the methods disclosed herein can be adapted to other types of biological sequences, such as protein sequences and carbohydrates. In the case of protein sequences, the amino acids include 20 letters (corresponding to naturally occurring amino acids) as opposed to 4 letter alphabet of nucleic acid sequences. The algorithms and data-structures disclosed herein can thus be adapted to the 20 letter alphabet for reference and non-reference sequences.
Although the present disclosure has been described with respect to particular apparatuses, configurations, components, systems and methods of operation, it will be appreciated by those of ordinary skill in the art upon sequencing this disclosure that certain changes or modifications to the embodiments and/or their operations, as described herein, can be made without departing from the spirit or scope of the disclosure. Accordingly, the proper scope of the disclosure is defined by the appended claims. The various embodiments, operations, components and configurations disclosed herein are generally exemplary rather than limiting in scope.
This patent application claims priority to U.S. provisional patent application No. 61/074,150, filed Jun. 20, 2008, and entitled “Method and Apparatus for Sequencing Data Samples”, the contents of which is incorporated herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61074150 | Jun 2008 | US |