No government funding has been received for this work to date.
Compact Discs 1-7, listing “Signatures” [sequences] are attached to, and made a part of, this Application. The contents of each CD are listed below, just before Table A. The text Table of Contents is identical with the CD contents. The CDs are enclosed as a part of the application under 37 Code of Federal Regulations Section 1.58.
The computer programs and subroutines of the invention are set forth on the CD, which is enclosed as a part of the application under 37 Code of Federal Regulations Section 1.96.
Contained herein is material that is subject to international copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent disclosure by any person as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all rights to the copyright whatsoever.
The detection and identification of microbes in complex samples from the environment, clinical specimens, food, etc. is an important and challenging problem. Naturally causative agents, including those responsible for malaria1, infectious and food-borne bacterial diarrhea2, and dengue fever3. The causative agents of such illnesses are extremely diverse in that they include bacteria, eukaryotic organisms and viruses4. Moreover, a considerable number of new/modified agents appear every year (SARS5, West Nile6, Monkeypox7, just to name a few). The threat of deliberate biological terrorism is also of concern8.
In addition to detecting known organisms, there is enormous value in the identification of previously unknown microorganisms responsible for environmentally and clinically important processes. The common process of identifying previously unknown human pathogens (microbes, viruses or their variants/modifications), for example, is uncertain and time consuming4. Often there are many weeks, months, or even years between the observation of initial infection and final identification of the causative agent. In addition to that, there are many microbes and viruses present in the human body, which do not represent an immediate danger to its physical condition, but may cause serious long-term consequences. Helicobacter pylori9, found to be the main cause of stomach ulcers, and papillomavirus10, many types of which cause benign skin tumors (warts) in their natural hosts, are examples.
Recent progress in genomics has led to new molecular approaches. Identification of bacteria, viruses, and rickettsia is increasingly based on the use of nucleic acid technologies4. In fact, the most detailed way to characterize a microbe or virus is sequencing and assembly of its complete genome. Several hundred complete genomes have become recently available and many more sequencing projects are in progress. Knowledge of complete genome sequences makes it possible to find unique “genetic signatures” of organisms. These signatures are employed with such approaches as polymerase chain reaction (PCR)11 and microarrays of cDNA12 and oligonucleotides13 for rapid determination of the presence of known organisms.
However, it is important to stress that many identification techniques require extraction of DNA of the organism of interest from complex samples (natural waters, rumen contents, food, blood, tissue, etc.), for analysis14,15. This is not a trivial task, especially for viruses. For nucleic acid based technologies, the discrimination of a single organism/pathogen's DNA in the presence of background (host) DNA is also extremely difficult, because the concentration of other DNA is usually orders of magnitude higher than that of the organism/pathogen. A possible solution to this problem is to use PCR primers recognizing sequences which are uniquely present in the organism of interest and absent in other genomes16-18. In this case, PCR will exponentially amplify and effectively purify DNA of the organism of interest but not the other DNA, which may be present. It is also clear, that only nucleic acid sequences present in the organism of interest and absent in the background DNA can be used as potential specific identification signatures. This implies either extensive experimental effort (the current approach) or the extremely difficult computational task of finding such organism-unique and “background-blind” subsequences in the genome of each organism to be detected.
So far only a small fraction of microbial/viral genomes have been sequenced, however highly conserved genes, such as 16S ribosomal RNA (rRNA)18-23, can be employed to distinguish a microbial species or group of species. The nucleotide sequence of 16S rRNA is regarded as the most useful marker for genetic characterization of bacteria. This molecule has been characterized in many thousands of bacteria as tabulated at the Ribosomal Database Project (RDP) site17-24,25 which now includes entries from over 72,600 aligned and annotated bacterial 16S rRNA sequences. These sequences can be used to construct a phylogenetic tree which positions each strain relative to every other. Organisms that are neighbors on this tree are genetically most similar and likely to share many biochemical properties
The straightforward sequence-based methods for pathogen discovery require partial sequencing and comparing results with known sequences present in existing databases GenBank, DDBJ, EMBL etc. Unfortunately such an approach cannot be used to identify new unknown organisms, and host/background DNA can cause considerable difficulty.
Another approach to identify known organisms is to find unique subsequences, present in this organism only (“signature” sequences) and use them as microarray probes or PCR primers. However, this requires not only knowledge of the sequence of the genome of interest, but also sequences of all other genomes which can be present in the sample. This method is restricted by the incompleteness of the list of sequenced genomes. Limitations of present technologies, such as melting temperature variations, possible hybridization with mismatches, primer-primer interactions, etc., make it also a nontrivial computational task. It becomes even more computationally difficult if the organism of interest has to be identified in the presence of much longer host DNA or in presence of numerous of other organisms (for example in environmental samples).
More than a thousand complete genomes, including hundreds of non-viral ones, have become available in the last several years. Many additional sequencing projects are in progress. The number of completed genomes, however, is so small compared to the number of extant species that true comparative genomics is still in its infancy. A relevant question thus arises as to whether there is sufficient material to look at from a statistical viewpoint26.
We have developed methods of identifying highly-specific PCR primers and DNA probes for species of interest in the presence of large backgrounds of other (e.g. human) DNA. This technology is useful in developing nucleic acid-based diagnostics, forensic assays, veterinary and agricultural assays. It also has utility in rapidly identifying unknown pathogens or biowarfare agents or other microorganisms.
According to the invention, a process for identifying whether any parasite or other microorganism is present in a given host comprises: a. scanning for non-host signatures, b. scanning for one-error-removed non-host signatures; c. scanning for N-error removed non-host signatures; where N is selected to give the desired statistical certainty of the presence or absence of any parasite in the host. Specific identifiers (“signatures”) are provided herein for many pathogens and other microorganisms.
The straightforward sequence-based methods for pathogen discovery require partial sequencing and comparing results with known sequences present in existing databases GenBank, DDBJ, EMBL etc. Unfortunately such an approach cannot be used to identify new unknown organisms, and host/background DNA can cause considerable difficulty.
Another approach to identify known organisms is to find unique subsequences, present in this organism only (“signature” sequences) and use them as microarray probes or PCR primers. However, this requires not only knowledge of the sequence of the genome of interest, but also sequences of all other genomes which can be present in the sample. This method is restricted by the incompleteness of the list of sequenced genomes. Limitations of present technologies, such as melting temperature variations, possible hybridization with mismatches, primer-primer interactions, etc., make it also a nontrivial computational task. It becomes even more computationally difficult if the organism of interest has to be identified in the presence of much longer host DNA or in presence of numerous of other organisms (for example in environmental samples).
More than a thousand complete genomes, including hundreds of non-viral ones, have become available in the last several years. Many additional sequencing projects are in progress. The number of completed genomes, however, is so small compared to the number of extant species that true comparative genomics is still in its infancy. A relevant question thus arises as to whether there is sufficient material to look at from a statistical viewpoint31.
The starting material can be a “host” suspected to comprise microorganisms to be detected. Humans, rats, mice, and other organisms are suitable hosts.
Using the “signatures”, techniques and algorithms provided herein, the presence of virtually any microorganism can be detected in the host.
A high-speed computer is required for the large computational steps of applying the algorithms of the invention.
The shape of the frequency distribution for certain short subsequences: 2-4-mers [4-8] and 8-9-mers [9, 10] was proposed to be used to decide which microbial genome is being considered, based on a given random piece of genome or the entire genome. Algorithmically, such type of analyses employs a repeatable search for the short patterns in genomes, also known as the exact string matching problem.
Exact string matching is a well-developed area in computer science. The traditional definition of this problem is the following: Given a string P of size n called the pattern and the longer string T of size m called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T. Many algorithms have implemented this problem often integrating precomputation of either the pattern or the text.
Some algorithms, such as Rabin-Karp [13], Boyer-Moore [14], and Knuth-Morris-Pratt [15] apply precomputation to the pattern. The memory usage in such approaches is not very extensive and if n<<m; the time is proportional to the length of the text or sequence O(m). Other approaches are based on the idea of precomputing the text [16-18]. Such algorithms are more memory-expensive and the estimation of time required for precomputing is O(m), despite this string matching can be done extremely fast and depends only of the pattern size: O(n). Here it is necessary to note that there is some variation in the problem definition; in some cases one needs to find any occurrence instead of all occurrences of the pattern in the text, nonetheless the problem has the same notation in most of the literature.
To make an optimal choice as to which algorithm is better, one has to take into account additional parameters of the problem under consideration. In fact, we rarely match just one pattern against one sequence. In many cases we have a finite set of k sequences and a finite set of l patterns and need to perform kl searches to find the occurrence of all patterns in all sequences.
Another important parameter, which has to be considered, is the average ratio of the pattern length to the text size r=n/m. Let us provide a simple example. Assume we need to compare the Knuth-Morris-Pratt and the Suffix Tree algorithms. Taking into account that the precomputation time will be O(nl) for the Knuth-Morris-Pratt algorithm and O(km) for the Suffix Tree algorithm, we find the running time estimations are O(nl+lkm) for Knuth-Morris-Pratt and O(km+knl) for the Suffix Tree. Now, if usage of memory (which in some cases can be critical for Suffix Trees) is not a concern, we can easily estimate which algorithm would be preferable based on the parameters of a given real problem. In typical situations, the estimates above are expected for these two classes of algorithms. In general, one needs linear time for precomputing, which allows performing the search operation in linear time.
2.2 Calculation of the Presence of all Possible n-Mers in a Given Text
When applying this problem to the area of our interest, biology, four characteristics emerge:
To take advantage of the specifics of our problem, in particular the fact that we can perform all calculations simultaneously for all n-mers for each given value of n, which is relatively small. We decided to employ an approach similar to the one used in the well-known counting-sort algorithm (for an example see [12]). The basic idea is to set in correspondence to each of the 4n n-mers a particular element in a counting array, A, and define the procedure to convert the n-mer character sequence to an index of an element in such an array. From this point of view, such an array can be treated as the extreme case of the hash table and the procedure to convert the sequence to the index can be treated as the hash function. Let us illustrate on a simple example. Assume we need to calculate how many different n-mers are present in the text T.
In this algorithm, T(j1,j2) stands for a substring of the string T, starting in position j1 and ending in position j2. The function CONVERT_TO_INTEGER_VALUE(s) is needed to convert string s of length n, which in our case is created using only a 4 character alphabet, to a unique integer value—corresponding to an index in the array A. In fact, if we assign to each character of our alphabet values from 0 to 3, each string can be interpreted as an integer value in a base-4 number system. Using a naïve algorithm this function can be implemented to have the run time O(n). We can utilize the fact that only 2 bits are needed for each character and that we read the text T sequentially, so that each string T(j1,j2) already contains n−1 elements of the next one T(j1+1, j2+1). Moreover we can implement the function CONVERT_TO_INTEGER_VALUE( ) using simple binary shift operations reducing execution time to O(1) time.
Thus the overall running time estimation for Algorithm 1 is O(4n+m). Practically, if both text T and the counting array A can be placed in memory it takes only seconds on a regular PC (1 GHz clock) to calculate, for example, how many 15-mers are present in the DNA sequence of Mycobacterium tuberculosis H37Rv (NC_000962, which has a length of 4,411,529 bp) genome as well as its complimentary sequence.
The biggest concern with this algorithm is memory usage, in particular the size of the counting array. The size of A can be defined based of the value of n and the characteristics of the problem. If we are interested only in the presence/absence of n-mers in a text, we need only 1 bit per element, where A [ . . . ]=1 indicates that the n-mer is present or A [ . . . ]=0 indicate its absence. In case of 17-mers, the necessary amount of memory required would be 417=17,179,869,184 bits=2.15 Gb. A PC or workstation can meet such a requirement. It is, however, limiting for larger n-mers, say up to 20 in length.
If the required memory is not available or inconvenient, one can decompose the counting array A. The idea is a simple divide-and-conquer strategy: we divide our n-mer in to two parts, a prefix of size n1 and a suffix of size n−n1. One then creates the array A to track the appearance of (n−n1)-mers in the suffixes and summarize results for all prefixes. Such an implementation is actualized in Algorithm 2.
The operation count of Algorithm 2 is O(4n
For our problem area, another interesting task was to find how many n-mers are present in only one of two given texts T1 and T2, and how large is the number of n-mers that belong to both texts. There are at least three different ways to handle such a problem which were implemented and tested in our calculations:
Our approach, similar to that described in Algorithms 1 and 2 can also be used to calculate the actual number of n-mers present in the text. A rough or naïve algorithm can be created by employing the counting array A to store the number of appearances of each n-mer:
As a result, Algorithm 3 in running time O(4n+m) produces array A, which in contrast to the cases discussed earlier contains integer numbers to store the presence in the text or sequence of each n-mer. The required memory in this case is much larger; instead of one bit we will need from 1 to 8 bytes to keep integer values. For example, using 4 bytes for each integer value the necessary memory for the case of 14-mers will be 1.07 Gb.
These algorithms are quite efficient but are extremely memory intensive. The problem is complicated when we need to store the n-mer calculations in a file, or if we want to do a comparative analysis of the presence/absence of n-mers in two or more than two sequences (the arrays of all the sequences need to be stored in memory). These algorithms ignore the fact that the n-mer array for n>11 is sparse and therefore can be compressed to save memory and hard disk space. Therefore a more careful look at the structure of the original data and the produced results is required.
Before implementing our methods, we needed to gather a complete set of current data. The National Center for Biotechnology Information (NCBI) [25] is an excellent source for sequenced genomes. Complete genomic sequences of over 1000 viruses and over 100 microbes can be found in the NCBI database. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life—bacteria, archaea, and eukaryota—are represented, as well as many viruses and organelles.
With several newly sequenced genomes added frequently, the problem we first faced was, how to download the genomic sequence in the format that we need and keep our database up-to-date. FTP programs however cannot be used to download the sequences due to three problems. Firstly tracking down newly added sequences is a problem and downloading all the genomic sequences whenever a new sequence is added would be a waste of time. Secondly the NCBI database provides more information about the genomic sequences than we need. We want to download only the information pertinent to us and store it in a format which can be easily read by the analysis programs. Thirdly there are different sub-categories in the classification hierarchy (like viruses can be classified as single stranded DNA virus, double stranded DNA viruses, single stranded RNA viruses, etc). This information is very useful for our analysis but is not available in all FTP files; it is however available from the web site. To overcome these issues we developed a program in Java for data mining. It checks for the newly sequenced genomes and downloads the genomic sequences not present in our local database in our format.
The program can efficiently download all viral genomes (>1000) in 5-6 hrs on a normal computer with a fast Internet connection.
To make the previous algorithms more efficient it is necessary to take a more careful look at the structure of the original data and the produced results. In Tables 3.1 and 2 we show how many n-mers appear only once, more than once, and never appear in the genome of Mycobacterium tuberculosis H37Rv and in the human genome. As one can see, if the text size m is less than 4n, practically all n-mers are found to be present. However, for 4n>>m we face a very different situation: the majority of n-mers are simply absent in the text. Furthermore, the number of n-mers present in that text just once is much larger than the number of n-mers which are present more than once. It was already mentioned that a concern of our approach is the memory required for n>12. However, Tables 3.1 and 3.2 lead us to conclude that for such numbers the majority of array A will be occupied by zeros (sparse). In fact, the number of different n-mers cannot be larger than m−n. We also cannot expect the number of n-mers which appear more than once to be larger than (m−n)/2. Thus, to conveniently keep information about all present n-mers we need two arrays: one (R) to keep the “sequence” of n-mers and another (Q) to keep the integer number of its appearance. Of course it would be even easier to place both the n-mer and the number of its appearances in one data structure. The estimation of the memory usage is straightforward. For the array Q it is rinteger(m−n)/2≅rintegerm/2, where rinteger is the size reserved for the integer variable. For the array R it is rn(m−n)/2≅rnm/2, where rn is the size reserved for the n-mer. Given the parameters of our problem area such a size is manageable. For example using 2 bytes for integers and 2 bits for all 16-mers in the sequence on the order of 3,000,000,000 (range of the human genome) we will need in the worst-case only 4.5 GB RAM. In practice, as seen from Tables 3.1 and 3.2 the number of n-mers present in real genomes is much less than hypothetical worst case. For larger genome like wheat or amphibians the above formulas provide a guide to calculate the memory required.
The following algorithm can be introduced to generate arrays Q and R:
The Algorithm 4 produces two synchronized arrays Q and R of dynamically defined size sum. The total run time estimation of Algorithm 4 is O(4n+m+4n)=O(4n+m). Because the array R of present n-mers is created sorted, the time estimation to check the presence of any n-mer in such array requires logarithmic time O(log(sum)), so that the worst case would be O(ln(m)).
It is important to mention that because these two arrays (R and Q) represent the set of all n-mers present in the original sequence it is reasonable to place them in one data structure to store such data and use for future analysis. Traditional set operations such as union, intersection, and subtraction can be introduced, and implemented for or such objects with linear run time O(sum1+sum2), where sum1 and sum2 are sizes of arrays in two data structures under considerations.
In the new RQ set, many useful functions have been integrated for enhanced analysis of the subsequences. The new functions include Union, Subtraction and Intersection among others. Each of these algebraic operations has a linear running time taking less than an hour on a standard PC.
Union—This function to calculate the union of the subsequences present in the all the microbial sequences provides us with the ability to analyze the presence and unanticipated absence of the subsequences within all the microbial genomes. It was observed that more than 8000 subsequences are not present in any of the microbial genomes.
These 8000 subsequences showed distinctive similarity between themselves which leads us to the conclusion that the absence of the subsequences are not random and further analysis needs to be done.
(R1, Q1) and (R2, Q2) are the RQ sets of two organisms on which the Union operation is to be performed and n1 and n2 are the number of different subsequences present in the two RQ sets, respectively. (UnionR, UnionQ) is the temporary RQ set used to store the results of the Union operations. In the RQ sets the sequences are ordered according to the unique index, which is calculated from the sequence. In the Union function both the sequences are traversed and the sequences that are present in either or both the RQ sets are stored in the UnionRQ set. Using the proposed algorithm and the RQ sets, the complete Union of the ten largest microbial genomes, as shown in Table 3.3, can be obtained in a relatively short period. The operation count of the Algorithm is (O (n1+n2)) where n1 and n2 are the length of the two RQ sets. For n-mer size less than 9, the calculations can be performed in a negligible amount of time and for higher n-mers it takes a reasonable amount of time on a standard PC as can be seen in table 1.
Bradyrhizobium
japonicum strain
Streptomyces
coelicolor
Mesorhizobium loti
Nostoc sp. PCC
Pseudomonas
aeruginosa
Pseudomonas
putida
Xanthomonas
axonopodis
Shewanella
oneidensis MR-1.
Escherichia
coli K-12.
Leptospira interrogans
Subtraction—The function to isolate the subsequence present in one particular genome and absent in all other genomes presents the excellent opportunity to find the subsequence unique to a particular organism and analyze its significance.
The function Subtraction incorporated in the R Q Set enables us to subtract subsequences present in one of the genomes with those present in other genomes.
Just as with the Union function, in this algorithm (R1, Q1) and (R2, Q2) are the two RQ sets that are passed to the function as inputs and n1 and n2 are the number of different subsequences present in the two RQ sets, respectively. (SubtractionR, SubtractionQ) is the temporary RQ set used to store the result of (R2,Q2) being subtracted from (R1, Q1). This algorithm has a run time of O(n1) where n1 is the length of (R1, Q1) set. The operational count of the Subtraction algorithm is O(n1), where n1 is the number of different subsequences in the first organism. In the worst case the operational cost would be O(4n) where n is the length of the subsequence. In actuality the number of present subsequences is much less than the total possible subsequences for large subsequence lengths (>13).
The program when tested, once again, on the ten largest microbials for complete Subtraction for different subsequence sizes (8-15) gave the following result:
Bradyrhizobium
japonicum
Streptomyces
coelicolor
Mesorhizobium loti
Nostoc sp. PCC 7120.
Pseudomonas
aeruginosa
Pseudomonas putida
Xanthomonas
axonopodis
Shewanella
oneidensis.
Escherichia coli K-12.
Leptospira interrogans
Intersection—Another extremely useful function incorporated in the R Q set operations is the Intersection function. It is of great importance to analyze the subsequences which are common to two or to a set of genomes. The Intersection operation enables us to do exactly that.
The result of the Intersection operation performed on all of the available microbial genomes shows that the only subsequences that are universally present occur in the rRNA sequences. This result is in agreement with the fact that the ribosomal RNA genes are most conserved among all regions during the evolution in microbial genomes.
Here (R1, Q1) and (R2, Q2) are the two RQ sets and n1 and n2 are the number of different subsequences present in the two RQ sets, respectively. (IntersectionR, IntersectionQ) is a temporary RQ set used to store the intersection result of (R1, Q1) and (R2, Q2). Like the other operations, this algorithm necessitates linear running time. Just like we did for the Union and Subtraction operations, we performed the Intersection operation on the ten largest microbial genomes. Running time for 8-mers was negligible (<1 sec) and the running time for 15-mers was less than a minute, as shown in Table 3.5, on a Standard PC having 1 Gb RAM.
Bradyrhizobium
japonicum
Streptomyces
coelicolor
Mesorhizobium loti
Nostoc sp. PCC 7120
Pseudomonas
aeruginosa
Pseudomonas putida
Xanthomonas
axonopodis
Shewanella
oneidensis.
Escherichia coli K-12.
Leptospira interrogans
For our analysis we downloaded all genomes currently available from the NCBI site [25] including microbial (107), viral (954), and multicellular organisms (5) genomes, with sizes ranging from 0.32 Kb (Cereal yellow dwarf virus-RPV satellite RNA NC_003533) to 2.87 Gb (human).
For our computations with multicellular organisms, microbial, and viruses, we used both complementary sequences for computational convenience because it is the way in which we can observe it based on the present technology (PCR, cDNA Microarrays, etc.). This trivially increases the amount of analyzed material by a factor of two. This apparent redundancy does not affect the statistical outcome and allows us to simplify the analysis.
In our study, we examined the presence/absence of short subsequences in more than one genome simultaneously. We also compared the frequencies of presence/absence of each n-mer in each of the genomes for 5≦n≦20. To our knowledge, no such studies have been done for n>11 due to the rapid growth of computational complexity with traditional algorithms. We performed such analyses separately in four different sets of genomes: RNA based viruses (542 genomes), DNA based viruses (412 genomes), Microorganisms (107 genomes) and Human. In each group the number of simultaneously present 5-18-mers was calculated for each pair of genomes. The human genome contains 24 chromosomes, for which the numbers of simultaneously present 7-20-mers were calculated for each pair of chromosomes.
To be able to perform calculations for longer (n>11) n-mers new algorithms and specific data structures (such as counting arrays [19] and incomplete search trees [20]) were utilized. The principal advantage of our approach is its time and memory efficiency, since only n-mers that are present in a genome under consideration (but not all possible 4n n-mers) are taken into account. This approach also provides an efficient way to store sequences for later use.
Microbial/Viral Fingerprints Using Random Subsets of n-Mers
One may use relatively small sets of randomly picked n-mers for differentiating between different viruses and organisms.
This idea can be illustrated by continuing our example for three microbial genomes. Let n* be the size of the n-mer, that fits the interval, where from 5% to 50% of all possible n-mers show up for a desirable range of genome lengths. In accordance with Table 4.5, we may choose the value n*=12. Let us randomly pick L 12-mers (say, L=1000). For example, L can be the number of probes placed on a microarray. Given a genome G1 with the frequency of presence of n-mers p1, we expect that K=p1L n-mers present in G1 will appear also in our random set, forming a “fingerprint” of G1 (in our example, we expect 50<K<500). The probability, c, that the fingerprint of G1 will exactly coincide with the fingerprint of some other genome G2 (with the frequency of presence of n-mers p2) is found in Appendix B. The result is
ε=(1−p1−p2+2p12)L. 5.3)
Here p12 is the probability for the n-mer to be present in both genomes simultaneously. Let us consider the numeric example mentioned in Tables 4.3 and 4.4 of two species that are far from each other, Salmonella typhi vs. Mycobacterium tuberculosis H37Rv; p1=0.3465, p2=0.2600, p12=0.1160; with L=1000 a remarkable accuracy of ε=1.7*10−204 can theoretically be achieved.
Given a desirable probability of error, e, one can determine the appropriate size, L, of a random set of n-mers which can be used for reliable identification of genomes as
For related organisms, the genomes may contain large common parts. This means that p12 may be close to p1 and p2. To give a numeric example of close relatives, let us consider Staphylococcus aureus N315 vs. Staphylococcus aureus Mu50. Now p1=0.198, p2=0.203, p12=0.197 and an accuracy of ε=10−10 can be achieved with L=4451. We would like to stress the logarithmic dependence of the sampling size or the number of probes on the microarray, L, on the error probability, ε. This feature is of principal importance for the estimation procedure under discussion.
Therefore, we can use practically any sufficiently random subset of n-mers of an appropriate size to construct a microarray to determine which organism a given DNA/RNA sample belongs to. Different sizes of n-mers must be employed for recognition of different organisms based on their genome lengths. Values of n that correspond to given intervals of genome lengths can be easily calculated using the above formulas. In fact, only 11 different n values, 7≦n≦17 can apply to a large variety of genome sizes from 1 Kb to 9 Gb.
The important advantage of such an approach is that it can be used without a priori knowledge of the sequence itself. This implies that there is no need to perform the expensive and time-consuming process of sequencing before array design. It is enough to obtain the purified DNA sample, hybridize it on a sufficiently random microarray chip and check which n-mers appear. Taking into account how accessible the DNA of thousands of microbial and viruses are, how easily each microarray can be produced, and the fact that we do not need to determine quantitative values of expression (we need just a yes/no answer)—it should be possible to produce an essentially universal microbial/viral DNA chip.
We next consider what happens when we try to compare closely related organisms using this approach (e.g. different types of influenza or different strands of the same microbes). We assume that two genomes G1 and G2 almost coincide and differ only in m randomly located nucleotides. This situation simulates the existence of point mutations or single nucleotide polymorphisms (SNPs). Let L be the size (number of probes) of the microarray and p—the frequency of presence of n-mers in a genome with a TSL value M. The value of L, necessary to distinguish the fingerprints of these two genomes with the error probability ε, can be estimated by the formula (see Appendix B):
Such an approach can provide the level of accuracy necessary for the individual human fingerprints. Let us assume that the differences between individual human beings appear only because of SNPs, which have equal probability and are randomly located in the genome. According to literature estimates [23], the total number of SNPs in the human genome is expected to be around 3,000,000. Then, calculating the necessary size for the random microarray (m/M˜0.1%, ε=10°, n=17, p=0.284), we have L˜4769. This rough estimation is promising and indicates that this possibility deserves a proper experimental study. We would like to recall, that our theoretical estimations have been made for randomly picked sets of n-mers. The further possibility exists to start with a larger than necessary random set of n-mers (say, L=10000) and then to decrease the microarray size experimenting with the desirable set of genomes (using, for instance, a simple optimization approach).
Research conducted at the University of Houston has focused on understanding the presence/absence of short subsequences in various genomes. The study of these short subsequences, also referred to as n-mers where n corresponds to the length of the subsequence, has revealed some interesting results. Using existing algorithms, analysis has been performed for more than 250 microbial, viral and multicellular organisms. A remarkable similarity of the shape of presence/absence distributions for different n-mers in all genomes was found. The same analysis applied analytically and numerically to random sequences also shows a similar shape of the distribution (the “random boundary”), with differences that correlate with biology. This led to the hypothesis that the presence/absence distribution of n-mers in all genomes considered (provided that the condition M<4n holds, where M is the total genome sequence length) can be treated as nearly random. A correlation analysis of distributions of the presence/absence of short subsequences of different length (n-mers, n=5-20) in 250 published microbial and virus genomes was also performed. The results show that for organisms, which are not close relatives of each other, the presence/absence of different 7-17-mers in their genomes is practically uncorrelated.
The low correlation among the n-mers present in different genomes suggests that it is possible to use random sets of n-mers (with appropriately chosen n) to discriminate between different microbial and viral genomes. Using a single appropriately chosen set of n-mers for each organism category (e.g. bacteria, viruses or humans), one would experimentally determine the presence or absence of every n-mer in the organism of interest. The important advantage of this approach is that it can be used without any a priori knowledge of the genomic sequence of the target organism.
This idea to use random sets of n-mers cannot be applied, as is, to samples of mixed organism categories, i.e. organisms with dramatically different genome lengths. A mixed sample of microbial, virus, and human, all belonging to different categories, would require conflicting values of n. For instance, an n-mer of length 16 or 17 can distinguish the human genome; however, in smaller genomes such as a microbial genome, there are very few 16- or 17-mers present. An n-mer of size 12 can distinguish a microbial but when applied to the human genome, nearly all 12-mers are present. In the context of probe design for a microarray assay, if 12-mers were used as probes, in the presence of a mixed sample such as a microbial and human mix, nearly all probes would bind to targets. The image produced by the microarray would appear very dark offering little information about the contents of the sample. On the other hand, if 16- or 17-mers were used, those probes specific to the larger human genome would bind; however, because very few probes would bind to the smaller microbial genome, the microarray image would appear very sparse if not completely blank. The case is even more desolate for the detection of a virus having an even smaller genome.
To discover the presence of a smaller genome(s) in a mixed sample, we must first determine what n-mers are present in the smaller genome(s) that are not present in the larger. When designing probes, we want to only include those n-mers not found in the larger genome. There can, however, be many n-mers meeting this criterion. Thus a problem arises: we need to guarantee that the microarray is dense enough to include many n-mers not found in the larger genome. A microarray is normally spotted with one particular probe at each position on the slide. Following this traditional approach, a majority of the n-mers unique to the smaller genome(s) which we are trying to detect must be included in the probe set. Assuring an appropriate density for the microarray causes a large scale and, more than likely, expensive experiment.
A simple yet unique solution to this problem requires a new method in probe design. Instead of affixing one particular probe to each “spot” on a microarray chip, multiple probes, or as we call them, composite probes can be used. By placing 10 to 100 different probes, the probability of locating a small genome in a mixed sample increases 10- to 100-fold without enlarging the chip size. Selecting the group of probes to be “spotted” together will be practically random; selection will necessitate some analysis of the group to verify that the probes will be able to bind to the target, i.e. they will not bind to each other. Moreover, it is not of concern if more than one target binds to a probe site. The properties of composite probes will be the same as those of a single probe for a microarray experiment.
Microarray experiments with composite probes are ideal for diagnosis. Quasi-random probes can be selected, grouped, and spotted to identify the presence of smaller genomes in mixed and unknown samples. In order to successfully use composite probes, the following two assumptions must preferably hold true:
In many cases, an efficient way to make composite probes is to incorporate positions at which mixed bases are incorporated during synthesis. A single position made to be 90% A and 10% G, for example, doubles the number of probes in a spot or on a particle. A single sequence far from the closes background sequence can be expanded into a large number (thousands) of probes by mixed synthesis at more than one position, taking into account the possibility of mixing at different locations. For example, there are 20 locations at which to introduce mixed synthesis into a 20-mer, but a very large number of possible arrangements of three modification positions in a 20-mer.
Other application of host- or background-blind probes and primers include the creation of labels known to have reduced probability of occurring in the background, detection of food (e.g. meat) adulteration, detection of human genetic rearrangements and mutations, monitoring of xenotransplant organs and patients, and enhancement of identification methods based on rRNA and intergenic regions.
N-mers specific for identifying specific organisms are on the accompanying CDs.
Aeropyrum pernix K1 complete genome.
Agrobacterium tumefaciens str. C58 linear
Agrobacterium tumefaciens str. C58 AT
Agrobacterium tumefaciens str. C58 Ti
Agrobacterium tumefaciens strain C58
Agrobacterium tumefaciens strain C58
Agrobacterium tumefaciens strain C58
Agrobacterium tumefaciens strain C58
Agrobacterium tumefaciens strain C58
Aquifex aeolicus complete genome.
Aquifex aeolicus plasmid ece1
Archaeoglobus fulgidus complete genome.
Bacillus halodurans C-125
Bacillus subtilis complete genome.
Bifidobacterium longum NCC2705
Borrelia burgdorferi.
Borrelia burgdorferi plasmid cp32-4
Borrelia burgdorferi plasmid cp32-6
Borrelia burgdorferi plasmid lp5
Borrelia burgdorferi plasmid lp28-2
Borrelia burgdorferi plasmid lp28-3
Borrelia burgdorferi plasmid lp28-4
Borrelia burgdorferi plasmid lp36
Borrelia burgdorferi plasmid lp54
Bradyrhizobium japonicum strain
Brucella melitensis strain 16M chromosome I
Brucella suis 1330 chromosome I complete
Buchnera aphidicola str. Sg (Schizaphis
graminum) complete genome.
Buchnera sp. APS complete genome.
Buchnera sp. APS plasmid pTrp DNA
Campylobacter jejuni complete genome.
Caulobacter crescentus complete genome.
Chlamydia muridarum
Chlamydia muridarum plasmid pMoPn
Chlamydia trachomatis complete genome.
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae J138
Chlorobium tepidum TLS complete
Clostridium acetobutylicum ATCC824
Clostridium acetobutylicum megaplasmid
Clostridium perfringens 13 DNA
Corynebacterium efficiens YS-314
Corynebacterium glutamicum ATCC 13032
Deinococcus radiodurans R1 complete
Deinococcus radiodurans R1 megaplasmid
Deinococcus radiodurans R1 plasmid CP1
Escherichia coli CFT073 complete genome.
Escherichia coli K-12 MG1655 complete
Escherichia coli O157:H7
Escherichia coli O157:H7
Fusobacterium nucleatum subsp. nucleatum
Haemophilus influenzae Rd complete
Halobacterium sp. NRC-1 complete
Halobacterium sp. NRC-1 plasmid
Halobacterium sp. NRC-1 plasmid
Helicobacter pylori
Helicobacter pylori 26695 complete
Lactococcus lactis subsp. lactis IL1403
Leptospira interrogans serovar lai str. 56601
Listeria innocua Clip11262
Listeria innocua plasmid pL1100
Listeria monocytogenes strain EGD
Mesorhizobium loti complete genome
Mesorhizobium loti plasmid pMLa DNA
Mesorhizobium loti plasmid pMLb DNA
Methanobacterium thermoautotrophicum
Methanococcus jannaschii complete
Methanococcus jannaschii large extra-
Methanopyrus kandleri AV19
Methanosarcina acetivorans str. C2A
Methanosarcina mazei strain Goe1
Mycobacterium leprae strain TN complete
Mycobacterium tuberculosis CDC1551
Mycobacterium tuberculosis complete
Mycoplasma genitalium G37 complete
Mycoplasma penetrans
Mycoplasma pneumoniae M129
Mycoplasma pulmonis (strain UAB CTIP)
Neisseria meningitidis serogroup A strain
Neisseria meningitidis serogroup B strain
Nostoc sp. PCC 7120 complete genome.
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid pCC7120zeta
Oceanobacillus iheyensis
Pasteurella multocida PM70 complete
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Pyrobaculum aerophilum strain IM2
Pyrococcus abyssi complete genome.
Pyrococcus furiosus DSM 3638
Pyrococcus horikoshii OT3 complete
Ralstonia solanacearum GMI1000
Ralstonia solanacearum GMI1000
Rickettsia conorii Malish 7
Rickettsia prowazekii strain Madrid E
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella typhimurium LT2
Salmonella typhimurium LT2 strain
Shewanella oneidensis MR-1 complete
Shewanella oneidensis MR-1 megaplasmid
Shigella flexneri 2a str. 301 complete
Sinorhizobium meliloti 1021 complete
Sinorhizobium meliloti plasmid pSymA
Sinorhizobium meliloti plasmid pSymB
Staphylococcus aureus MW2
Staphylococcus aureus strain Mu50
Staphylococcus aureus plasmid VRSAp
Staphylococcus aureus strain N315
Staphylococcus aureus subsp. aureus N315
Streptococcus agalactiae
Streptococcus agalactiae NEM316
Streptococcus mutans UA159 complete
Streptococcus pneumoniae complete
Streptococcus pneumoniae R6 complete
Streptococcus pyogenes MGAS315
Streptococcus pyogenes strain MGAS8232
Streptococcus pyogenes strain SF370
Streptomyces coelicolor A3(2)
Streptomyces coelicolor plasmid SCP1.
Streptomyces coelicolor plasmid SCP2.
Sulfolobus solfataricus complete genome.
Sulfolobus tokodaii complete genome.
Synechocystis PCC6803 complete genome.
Thermoanaerobacter tengcongensis strain
Thermoplasma acidophilum
Thermoplasma volcanium
Thermosynechococcus elongatus BP-1
Thermotoga maritima complete genome.
Treponema pallidum complete genome.
Ureaplasma urealyticum complete genome.
Vibrio cholerae chromosome I
Vibrio vulnificus CMCP6 chromosome I
Vibrio parahaemolyticus RIMD 2210633,
Wigglesworthia brevipalpis
Xanthomonas axonopodis pv. citri str. 306
Xylella fastidiosa Temeculal
Xanthomonas campestris pv. campestris str.
Xylella fastidiosa
Xylella fastidiosa plasmid pXF1.3
Xylella fastidiosa plasmid pXF51
Xylella fastidiosa plasmid pXF868
Yersinia pestis KIM complete genome.
Yersinia pestis strain CO92
Yersinia pestis plasmid pCD1.
Yersinia pestis plasmid pPCP1.
Yersinia pestis plasmid pPMT1.
Chlamydophila pneumoniae CWL029
Nitrosomonas europaea
Prochlorococcus marinus MED4
Prochlorococcus marinus MIT9313
Rhodopseudomonas palustris
Staphylococcus epidermidis ATCC 12228,
Synechococcus
Clostridium tetani E88, complete genome.
Lactobacillus plantarum WCFS1, complete
Pseudomonas syringae pv. tomato str.
Streptococcus pyogenes phage 315.3
Tropheryma whipplei str. Twist
Chlamydia psittaci bacteriophage chp1,
Saccharomyces cerevisiae virus L-BC (La),
Zygosaccharomyces bailii virus Z,
Drosophila x virus segment A, complete
Saccharomyces cerevisiae virus L-A,
Chlamydia phage PhiCPG1, complete
Chlamydia phage phiCPAR39, complete
Drosophila C virus, complete genome.
Lactococcus lactis bacteriophage u136,
Chlamydia phage 2 virion, complete
Pseudomonas phage Pf3, complete
Saccharomyces cerevisiae killer virus MI,
Streptococcus pyogenes phage 315.3
Lactococcus phage BK5-T, complete
Discula destructiva virus 2 segment 1,
Cymbidium ringspot virus, complete
Streptococcus pyogenes phage 315.1
Fringilla coelebs papillomavirus, complete
Discula destructiva virus 1 segment 1,
Trichomonas vaginalis virus, complete
Plantago asiatica mosaic virus, complete
Bacillus phage PZA, complete genome.
Staphylococcus aureus prophage phiPV83
Trichomonas vaginalis virus 3, complete
Staphylococcus aureus phage phi 12
Lactobacillus bacteriophage phi adh,
Staphylococcus aureus phage phi 13
Rhizoctonia solani virus RNA 1, complete
Ustilago maydis virus H1, complete
Gremmeniella abietina RNA virus L1,
Nilaparvata lugens reovirus segment 1,
Giardia lamblia virus, complete genome.
Streptococcus thermophilus bacteriophage
Planococcus citri densovirus, complete
Saccharomyces cerevisiae narnavirus 23S
Staphylococcus aureus temperate phage
Streptococcus thermophilus bacteriophage
Psittacus erithacus papillomavirus,
Eimeria brunetti RNA virus 1, complete
Staphylococcus aureus bacteriophage PVL
Streptococcus thermophilus temperate
Aconitum latent virus, complete genome.
Pseudomonas phage PP7, complete
Staphylococcus aureus phage phi 11
Streptococcus pyogenes phage 315.6
Streptococcus thermophilus bacteriophage
Streptococcus pyogenes phage 315.2
Sorghum chlorotic spot virus RNA 1,
Streptococcus thermophilus bacteriophage
Staphylococcus phage phiN315 provirus,
Streptococcus thermophilus bacteriophage
Sulfolobus islandicus filamentous virus,
Pseudomonas phage Pf1, complete
Streptococcus pyogenes phage 315.5
Saccharomyces cerevisiae narnavirus 20S
Rhopalosiphum padi virus, complete
Streptococcus pyogenes phage 315.4
Helicoverpa armigera stunt virus RNA 1,
Sphaeropsis sapinea RNA virus 1,
Haemophilus phage HP2, complete
Sphaeropsis sapinea RNA virus 2,
Xanthomonas phage Cf1c, complete
Vibrio harveyi bacteriophage VHML,
Haemophilus phage HP1, complete
Nudaurelia capensis beta virus, complete
Yersinia pestis phage phiA1122, complete
Lymantria dispar cypovirus 1 segment 1,
Pseudomonas aeruginosa phage PaP3,
Lactobacillus casei bacteriophage A2
Lymantria dispar cypovirus 14 segment I,
Salmonella typhimurium bacteriophage
Rana tigrina ranavirus, complete genome.
Citrus tristeza virus, complete genome.
Pseudomonas phage gh-1, complete
Shigella flexneri bacteriophage V,
Acyrthosiphon pisum bacteriophage
Pseudomonas phage phi-6 segment L,
Pseudomonas bacteriophage phi-I3
Pseudomonas phage phi KZ, complete
Salmonella typhimurium phage ST64B,
Adoxophyes honmai
Pseudomonas phage D3, complete genome.
Plutella xylostella granulovirus, complete
Sinorhizobium meliloti phage PBC5,
Rachiplusia ou multiple
Bombyx mori nuclear polyhedrosis virus,
Autographa califomica
Culex nigripalpus baculovirus, complete
Paramecium bursaria Chlorella virus 1,
Xestia c-nigrum granulovirus, complete
Epiphyas postvittana
Helicoverpa zea single nucleocapsid
Heliocoverpa armigera
Helicoverpa armigera nuclear polyhedrosis
Mamestra configurata
Mamestra configurata
Burkholderia cepacia phage Bcep781,
Orgyia pseudotsugata multicapsid
Choristoneura fumiferana MNPV,
Spodoptera exigua nucleopolyhedrovirus,
Ectocarpus siliculosus virus, complete
Cymbidium ringspot virus satellite RNA, complete
Adoxophyes honmai nucleopolyhedrovirus, complete
Autographa californica nucleopolyhedrovirus, complete
Bombyx mori nuclear polyhedrosis virus, complete
Choristoneura fumiferana MNPV, complete genome.
Culex nigripalpus baculovirus, complete genome.
Epiphyas postvittana nucleopolyhedrovirus, complete
Helicoverpa armigera nuclear polyhedrosis virus,
Helicoverpa zea single nucleocapsid
Heliocoverpa armigera nucleopolyhedrovirus G4,
Mamestra configurata nucleopolyhedrovirus A, complete
Mamestra configurata nucleopolyhedrovirus B, complete
Orgyia pseudotsugata multicapsid nucleopolyhedrovirus,
Plutella xylostella granulovirus, complete genome.
Rachiplusia ou multiple nucleopolyhedrovirus, complete
Spodoptera exigua nucleopolyhedrovirus, complete
Xestia c-nigrum granulovirus, complete genome.
Acyrthosiphon pisum bacteriophage APSE-1, complete
Bacillus phage GA-1 virion, complete genome.
Bacillus phage PZA, complete genome.
Burkholderia cepacia phage Bcep781, complete genome.
Haemophilus phage HP1, complete genome.
Haemophilus phage HP2, complete genome.
Lactobacillus bacteriophage phi adh, complete genome.
Lactobacillus casei bacteriophage A2 virion, complete
Lactococcus lactis bacteriophage u136, complete genome.
Lactococcus phage BK5-T, complete genome.
Lactococcus phage c2.
Lactococcus phage P335, complete genome.
Methanothermobacter wolfeii prophage psiM100.
Pseudomonas aeruginosa phage PaP3, complete genome.
Pseudomonas phage D3, complete genome.
Pseudomonas phage gh-1, complete genome.
Pseudomonas phage phiKZ, complete genome.
Salmonella typhimurium bacteriophage ST64T virion,
Salmonella typhimurium phage ST64B, complete
Shigella flexneri bacteriophage V, complete genome.
Staphylococcus aureus bacteriophage PVL provirus,
Staphylococcus aureus phage phiP68, complete genome.
Staphylococcus aureus prophage phiPV83 provirus,
Staphylococcus aureus temperate phage phiSLT virion,
Staphylococcus phage 44AHJD, complete genome.
Streptococcus phage Cp-1, complete genome.
Streptococcus thermophilus bacteriophage 7201, complete
Streptococcus thermophilus bacteriophage DT1, complete
Streptococcus thermophilus bacteriophage Sfi11,
Streptococcus thermophilus bacteriophage Sfi19,
Streptococcus thermophilus bacteriophage Sfi21,
Streptococcus thermophilus temperate bacteriophage
Vibrio harveyi bacteriophage VHML, complete genome.
Yersinia pestis phage phiA1122, complete genome.
Alteromonas phage PM2, complete genome.
Rana tigrina ranavirus, complete genome.
Sulfolobus islandicus filamentous virus, genome.
Equus caballus papillomavirus type 1, complete genome.
Fringilla coelebs papillomavirus, complete genome.
Phocoena spinipinnis papillomavirus, complete genome.
Psittacus erithacus papillomavirus, complete genome.
Ectocarpus siliculosus virus, complete genome.
Paramecium bursaria Chlorella virus 1, complete genome.
Amsacta moorei entomopoxvirus, complete genome.
Melanoplus sanguinipes entomopoxvirus, complete
Mycoplasma arthritidis bacteriophage MAV1, complete
Drosophila x virus segment A, complete sequence.
Pseudomonas bacteriophage phi-13 segment L, complete
Pseudomonas phage phi-6 segment L, complete sequence.
Atkinsonella hypoxylon partitivirus RNA 1, complete
Discula destructiva virus 1 segment 1, complete sequence.
Discula destructiva virus 2 segment 1, complete sequence.
Fusarium poae virus 1 RNA 1, complete sequence.
Gremmeniella abietina RNA virus MSI RNA 1, complete
Rhizoctonia solani virus RNA1, complete sequence.
Bombyx mori cypovirus 1 segment 1, complete sequence.
Lymantria dispar cypovirus 1 segment 1, complete
Lymantria dispar cypovirus 14 segment 1, complete
Nilaparvata lugens reovirus segment 1, complete
Trichoplusia ni cytoplasmic polyhedrosis virus 15
Eimeria brunetti RNA virus 1, complete genome.
Giardia lamblia virus, complete genome.
Gremmeniella abietina RNA virus L1, complete genome.
Helminthosporium victoriac virus 190S, complete
Saccharomyces cerevisiae virus L-A, complete genome.
Saccharomyces cerevisiae virus L-BC (La), complete
Sphaeropsis sapinea RNA virus 1, complete genome.
Sphaeropsis sapinea RNA virus 2, complete genome.
Trichomonas vaginalis virus, complete genome.
Trichomonas vaginalis virus 3, complete genome.
Trichomonas vaginalis virus II, complete genome.
Ustilago maydis virus H1, complete genome.
Zygosaccharomyces bailii virus Z, complete genome.
Diaporthe ambigua RNA virus 1, complete genome.
Ageratum enation virus, complete genome.
Ageratum yellow vein China virus, complete genome.
Ageratum yellow vein Sri Lanka virus segment A,
Ageratum yellow vein Taiwan virus, complete genome.
Ageratum yellow vein virus, complete genome.
Chloris striate mosaic virus, complete genome.
Dicliptera yellow mottle virus DNA A, complete
Eupatorium yellow vein virus, complete genome.
Ipomoea yellow vein virus, complete genome.
Macroptilium mosaic Puerto Rico virus DNA A, complete
Macroptilium yellow mosaic Florida virus DNA A,
Macroptilium yellow mosaic virus, complete genome.
Malvastrum yellow vein virus, complete genome.
Miscanthus streak virus, complete genome.
Panicum streak virus, complete genome.
Papaya leaf curl virus, complete genome.
Sida golden mosaic Costa Rica virus DNA A, complete
Sida golden mosaic Florida virus, complete genome.
Sida golden mosaic Honduras virus DNA A, complete
Sida golden mosaic virus DNA-A, complete sequence.
Sida golden yellow vein virus, complete genome.
Sida mottle virus, complete genome.
Sida yellow mosaic virus, complete genome.
Sida yellow vein virus DNA A, complete sequence.
Stachytarpheta leaf curl virus, complete genome.
Acholeplasma phage MV-L1, complete genome.
Propionibacterium phage phiB5, complete genome.
Pseudomonas phage Pf1, complete genome.
Pseudomonas phage Pf3, complete genome.
Spiroplasma phage 1-C74, complete genome.
Spiroplasma phage 1-R8A2B, complete genome.
Vibrio cholerae filamentous bacteriophage fs-2, complete
Vibrio cholerae O139 fs1 phage, complete genome.
Vibrio cholerae phage VGJphi virion, complete genome.
Xanthomonas phage Cf1c, complete genome.
Chlamydia phage 2 virion, complete genome.
Chlamydia phage phiCPAR39, complete genome.
Chlamydia phage PhiCPG1, complete genome.
Chlamydia psittaci bacteriophage chp1, complete genome.
Spiroplasma phage 4, complete genome.
Pseudomonas phage PP7, complete genome.
Sinorhizobium meliloti phage PBC5, complete
Lactobacillus casei bacteriophage A2
Listeria phage 2389 virion, complete
Methanothermobacter wolfeii prophage
Pseudomonas phage gh-1, complete
Staphylococcus aureus bacteriophage PVL
Streptococcus thermophilus bacteriophage
Yersinia pestis phage phiA1122, complete
Propionibacterium phage phiB5, complete
Pseudomonas phage Pf1, complete
Vibrio cholerae filamentous bacteriophage
Chlamydia phage 2 virion, complete
Pseudomonas phage PP7, complete
Acyrthosiphon pisum
Pseudomonas aeruginosa phage
Streptococcus thermophilus
Vibrio harveyi bacteriophage
Alteromonas phage PM2,
Haemophilus phage HP1,
Shigella flexneri bacteriophage V,
Haemophilus phage HP2,
Pseudomonas phage phi-6
Xanthomonas phage Cf1c,
Salmonella typhimurium
Pseudomonas phage phiKZ,
Salmonella typhimurium phage
Pseudomonas phage D3, complete
Pseudomonas bacteriophage phi-
Sinorhizobium meliloti phage
Mamestra configurata
Choristoneura fumiferana MNPV,
Burkholderia cepacia phage
Spodoptera exigua
Choristoneura fumiferana MNPV,
Plutella xylostella granulovirus, complete
Acyrthosiphon pisum bacteriophage APSE-1, complete
Bacillus phage GA-1 virion, complete genome.
Bacillus phage PZA, complete genome.
Burkholderia cepacia phage Bcep781, complete genome.
Enterobacteria phage 186, complete genome.
Enterobacteria phage epsilon15, complete genome.
Enterobacteria phage HK022 virion, complete genome.
Enterobacteria phage Mu, complete genome.
Enterobacteria phage P2, complete genome.
Enterobacteria phage T4, complete genome.
Enterobacteria phage T7.
Haemophilus phage HP1, complete genome.
Haemophilus phage HP2, complete genome.
Lactobacillus bacteriophage phi adh, complete genome.
Lactobacillus casei bacteriophage A2 virion, complete
Lactococcus lactis bacteriophage u136, complete genome.
Lactococcus phage BK5-T, complete genome.
Lactococcus phage c2.
Lactococcus phage P335, complete genome.
Listeria phage 2389 virion, complete genome.
Methanobacterium phage psiM2, complete genome.
Methanothermobacter wolfeii prophage psiM100.
Mycobacterium phage L5.
Mycoplasma virus P1, complete genome.
Pseudomonas aeruginosa phage PaP3, complete genome.
Pseudomonas phage D3, complete genome.
Pseudomonas phage gh-1, complete genome.
Pseudomonas phage phiKZ, complete genome.
Salmonella typhimurium bacteriophage ST64T virion,
Salmonella typhimurium phage ST64B, complete
Shigella flexneri bacteriophage V, complete genome.
Staphylococcus aureus bacteriophage PVL provirus,
Staphylococcus aureus phage phiP68, complete genome.
Staphylococcus aureus prophage phiPV83 provirus,
Staphylococcus aureus temperate phage phiSLT virion,
Staphylococcus phage 44AHJD, complete genome.
Streptococcus phage Cp-1, complete genome.
Streptococcus thermophilus bacteriophage 7201, complete
Streptococcus thermophilus bacteriophage DT1, complete
Streptococcus thermophilus bacteriophage Sfi11,
Streptococcus thermophilus bacteriophage Sfi19,
Streptococcus thermophilus bacteriophage Sfi21,
Streptococcus thermophilus temperate bacteriophage
Vibrio harveyi bacteriophage VHML, complete genome.
Yersinia pestis phage phiA1122, complete genome.
Alteromonas phage PM2, complete genome.
Sulfolobus virus 1, complete genome.
Lymphocystis disease virus 1, complete genome.
Rana tigrina ranavirus, complete genome.
Sulfolobus islandicus filamentous virus, genome.
Equus caballus papillomavirus type 1, complete genome.
Fringilla coelebs papillomavirus, complete genome.
Phocoena spinipinnis papillomavirus, complete genome.
Psittacus erithacus papillomavirus, complete genome.
Ectocarpus siliculosus virus, complete genome.
Paramecium bursaria Chlorella virus 1, complete genome.
Acholeplasma phage L2, complete genome.
Amsacta moorei entomopoxvirus, complete genome.
Melanoplus sanguinipes entomopoxvirus, complete
Sulfolobus virus SIRV-1, complete genome.
Sulfolobus virus SIRV-2, complete genome.
Enterobacteria phage PRD1, complete genome.
Mycoplasma arthritidis bacteriophage MAV1, complete
Drosophila x virus segment A, complete sequence.
Pseudomonas bacteriophage phi-13 segment L, complete
Pseudomonas phage phi-6 segment L, complete sequence.
Atkinsonella hypoxylon partitivirus RNA 1, complete
Discula destructiva virus 1 segment 1, complete sequence.
Discula destructiva virus 2 segment 1, complete sequence.
Fusarium poae virus 1 RNA 1, complete sequence.
Gremmeniella abietina RNA virus MSI RNA 1, complete
Rhizoctonia solani virus RNA1, complete sequence.
Bombyx mori cypovirus 1 segment 1, complete sequence.
Kadipiro virus segment 1, complete sequence.
Lymantria dispar cypovirus 1 segment 1, complete
Lymantria dispar cypovirus 14 segment 1, complete
Nilaparvata lugens reovirus segment 1, complete
Trichoplusia ni cytoplasmic polyhedrosis virus 15
Eimeria brunetti RNA virus 1, complete genome.
Giardia lamblia virus, complete genome.
Gremmeniella abietina RNA virus L1, complete genome.
Helminthosporium victoriac virus 190S, complete
Leishmania RNA virus 1-1, complete genome.
Leishmania RNA virus 1-4, complete genome.
Leishmania RNA virus 2-1, complete genome.
Saccharomyces cerevisiae virus L-A, complete genome.
Saccharomyces cerevisiae virus L-BC (La), complete
Sphaeropsis sapinea RNA virus 1, complete genome.
Sphaeropsis sapinea RNA virus 2, complete genome.
Trichomonas vaginalis virus, complete genome.
Trichomonas vaginalis virus 3, complete genome.
Trichomonas vaginalis virus II, complete genome.
Ustilago maydis virus H1, complete genome.
Zygosaccharomyces bailii virus Z, complete genome.
Diaporthe ambigua RNA virus 1, complete genome.
Ageratum enation virus, complete genome.
Ageratum yellow vein China virus, complete genome.
Ageratum yellow vein Sri Lanka virus segment A,
Ageratum yellow vein Taiwan virus, complete genome.
Ageratum yellow vein virus, complete genome.
Dicliptera yellow mottle virus DNA A, complete
Eupatorium yellow vein virus, complete genome.
Ipomoea yellow vein virus, complete genome.
Macroptilium mosaic Puerto Rico virus DNA A, complete
Macroptilium yellow mosaic Florida virus DNA A,
Macroptilium yellow mosaic virus, complete genome.
Malvastrum yellow vein virus, complete genome.
Miscanthus streak virus, complete genome.
Panicum streak virus, complete genome.
Rhynchosia Golden Mosaic Virus chromosome DNA A,
Sida golden mosaic Costa Rica virus DNA A, complete
Sida golden mosaic Florida virus, complete genome.
Sida golden mosaic Honduras virus DNA A, complete
Sida golden mosaic virus DNA-A, complete sequence.
Sida golden yellow vein virus, complete genome.
Sida mottle virus, complete genome.
Sida yellow mosaic virus, complete genome.
Sida yellow vein virus DNA A, complete sequence.
Stachytarpheta leaf curl virus, complete genome.
Acholeplasma phage MV-L1, complete genome.
Enterobacteria phage l2-2.
Enterobacteria phage lf1, complete genome.
Enterobacteria phage lke.
Propionibacterium phage phiB5, complete genome.
Pseudomonas phage Pf1, complete genome.
Pseudomonas phage Pf3, complete genome.
Spiroplasma phage 1-C74, complete genome.
Spiroplasma phage 1-R8A2B, complete genome.
Vibrio cholerae filamentous bacteriophage fs-2, complete
Vibrio cholerae O139 fs1 phage, complete genome.
Vibrio cholerae phage VGJphi virion, complete genome.
Vibrio phage VSK, complete genome.
Xanthomonas phage Cf1c, complete genome.
Chlamydia phage 2 virion, complete genome.
Chlamydia phage phiCPAR39, complete genome.
Chlamydia phage PhiCPG1, complete genome.
Chlamydia psittaci bacteriophage chp1, complete genome.
Enterobacteria phage G4, complete genome.
Enterobacteria phage S13, complete genome.
Spiroplasma phage 4, complete genome.
Ageratum yellow vein virus-associated nanovirus DNA,
Acheta domesticus densovirus, complete genome.
Aedes albopictus densovirus, complete genome.
Bombyx mori densovirus 1, complete genome.
Bombyx mori densovirus 5, complete genome.
Casphalia extranea densovirus, complete genome.
Diatraea saccharalis densovirus, complete genome.
Galleria mellonella densovirus, complete genome.
Junonia coenia densovirus, complete genome.
Periplaneta fuliginosa densovirus, complete genome.
Planococcus citri densovirus, complete genome.
Lassa virus segment L, complete sequence.
Tacaribe virus segment L, complete sequence.
Tupaia paramyxovirus, complete genome.
Vesicular stomatitis Indiana virus, complete genome.
Perina nuda picorna-like virus, complete genome.
Enterobacteria phage fr, complete genome.
Enterobacteria phage GA, complete genome.
Enterobacteria phage KU1, complete genome.
Enterobacteria phage MX1.
Enterobacteria phage NL95.
Pseudomonas phage PP7, complete genome.
Cryphonectria parasitica mitovirus 1-NB631, complete
Saccharomyces cerevisiae narnavirus 20S RNA
Saccharomyces cerevisiae narnavirus 23S RNA
Epinephelus tauvina nervous necrosis virus RNA 1,
Sinorhizobium meliloti phage PBC5,
Staphylococcus aureus phage phi 11
Staphylococcus aureus phage phi 12
Staphylococcus aureus phage phi 13
Staphylococcus phage phiN315 provirus,
Streptococcus pyogenes phage 315.1
Streptococcus pyogenes phage 315.2
Streptococcus pyogenes phage 315.3
Streptococcus pyogenes phage 315.4
Streptococcus pyogenes phage 315.5
Streptococcus pyogenes phage 315.6
Saccharomyces cerevisiae killer virus M1,
Sclerophthora macrospora virus B,
Agrobacterium tumefaciens strain C58 circular
Bacillus subtilis complete genome.
Corynebacterium glutamicum ATCC 13032
Escherichia coli K-12 MG1655 complete genome.
Helicobacter pylori
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Agrobacterium tumefaciens strain C58 circular
Bacillus subtilis complete genome.
Campylobacter jejuni complete genome.
Corynebacterium glutamicum ATCC 13032
Escherichia coli K-12 MG1655 complete genome.
Helicobacter pylori
Pseudomonas aeruginosa PA01
Aeropyrum pernix K1 complete genome.
Aquifex aeolicus complete genome.
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid pCC7120zeta
Oceanobacillus iheyensis
Pasteurella multocida PM70 complete
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Pyrobaculum aerophilum strain IM2
Pyrococcus abyssi complete genome.
Pyrococcus furiosus DSM 3638
Pyrococcus horikoshii OT3 complete
Ralstonia solanacearum GM11000
Ralstonia solanacearum GM11000
Rickettsia conorii Malish 7
Rickettsia prowazekii strain Madrid E
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella typhimurium LT2
Archaeoglobus fulgidus complete genome.
Salmonella typhimurium LT2 strain
Shewanella oneidensis MR-1 complete
Shewanella oneidensis MR-1 megaplasmid
Shigella flexneri 2a str. 301 complete
Sinorhizobium meliloti 1021 complete
Sinorhizobium meliloti plasmid pSymA
Sinorhizobium meliloti plasmid pSymB
Staphylococcus aureus MW2
Staphylococcus aureus strain Mu50
Bacillus halodurans C-125
Staphylococcus aureus strain N315
Staphylococcus aureus subsp. aureus N315
Streptococcus agalactiae
Streptococcus agalactiae NEM316
Streptococcus mutans UA159 complete
Streptococcus pneumoniae complete
Streptococcus pneumoniae R6
Streptococcus pyogenes MGAS315
Streptococcus pyogenes strain MGAS8232
Streptococcus pyogenes strain SF370
Bacillus subtilis complete genome.
Streptomyces coelicolor A3(2)
Streptomyces coelicolor plasmid SCP1.
Streptomyces coelicolor plasmid SCP2.
Sulfolobus solfataricus complete genome.
Sulfolobus tokodaii complete genome.
Synechocystis PCC6803 complete genome.
Thermoanaerobacter tengcongensis strain
Thermoplasma acidophilum
Thermoplasma volcanium
Thermosynechococcus elongatus BP-1
Bifidobacterium longum NCC2705
Thermotoga maritima complete genome.
Treponema pallidum complete genome.
Ureaplasma urealyticum complete genome.
Vibrio cholerae chromosome I
Vibrio vulnificus CMCP6 chromosome I
Wigglesworthia brevipalpis
Xanthomonas campestris pv. campestris str.
Xylella fastidiosa
Xylella fastidiosa plasmid pXF1.3
Borrelia burgdorferi.
Xylella fastidiosa plasmid pX F51
Yersinia pestis KIM complete genome.
Yersinia pestis strain CO92
Yersinia pestis plasmid pCD1.
Yersinia pestis plasmid pPCP1.
Yersinia pestis plasmid pPMT1.
Nitrosomonas europaea
Prochlorococcus marinus MED4
Prochlorococcus marinus MIT9313
Rhodopseudomonas palustris
Staphylococcus epidermidis
Synechococcus
Clostridium tetani E88, complete genome.
Lactobacillus plantarum WCFS1, complete
Pseudomonas syringae pv. tomato
Streptococcus pyogenes phage
Tropheryma whipplei str. Twist
Xylella fastidiosa Temecula1
Agrobacterium tumefaciens str. C58 linear
Agrobacterium tumefaciens str. C58 AT
Borrelia burgdorferi plasmid lp28-2
Bradyrhizobium japonicum strain
Brucella melitensis strain 16M chromosome I
Agrobacterium tumefaciens str. C58 Ti
Brucella suis 1330 chromosome I complete
Buchnera aphidicola str. Sg (Schizaphis
graminum) complete genome.
Buchnera sp. APS complete genome.
Campylobacter jejuni complete genome.
Caulobacter crescentus complete genome.
Chlamydia muridarum
Chlamydia trachomatis complete genome.
Agrobacterium tumefaciens strain C58
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae J138
Chlorobium tepidum TLS complete
Clostridium acetobutylicum ATCC824
Clostridium perfringens 13 DNA
Corynebacterium efficiens YS-314
Corynebacterium glutamicum ATCC 13032
Deinococcus radiodurans R1 complete
Agrobacterium tumefaciens strain C58
Deinococcus radiodurans R1 megaplasmid
Deinococcus radiodurans R1 plasmid CP1
Escherichia coli CFT073 complete genome.
Escherichia coli K-12 MG 1655 complete
Escherichia coli O157:H7
Escherichia coli O157:H7
Fusobacterium nucleatum subsp. nucleatum
Haemophilus influenzae Rd complete
Halobacterium sp. NRC-1 complete
Halobacterium sp. NRC-1 plasmid
Agrobacterium tumefaciens strain C58
Halobacterium sp. NRC-1 plasmid
Helicobacter pylori
Helicobacter pylori 26695 complete
Lactococcus lactis subsp. lactis IL1403
Leptospira interrogans serovar lai str. 56601
Listeria innocua Clip11262
Listeria innocua plasmid pL1100
Listeria monocytogenes strain EGD
Mesorhizobium loti complete genome
Mesorhizobium loti plasmid pMLa DNA
Agrobacterium tumefaciens strain C58
Mesorhizobium loti plasmid pMLb DNA
Methanobacterium thermoautotrophicum
Methanococcus jannaschii complete
Methanopyrus kandleri AV19
Methanosarcina acetivorans str. C2A
Methanosarcina mazei strain Goe1
Mycobacterium leprae strain TN complete
Mycobacterium tuberculosis CDC1551
Agrobacterium tumefaciens strain C58
Mycobacterium tuberculosis complete
Mycoplasma genitalium G37 complete
Mycoplasma pneumoniae M129
Neisseria meningitidis serogroup A strain
Neisseria meningitidis serogroup B strain
Nostoc sp. PCC 7120 complete genome.
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Aeropyrum pernix K1 complete genome.
Agrobacterium tumefaciens str. C58 linear
Agrobacterium tumefaciens str. C58 AT
Agrobacterium tumefaciens str. C58 Ti plasmid
Agrobacterium tumefaciens strain C58 circular
Agrobacterium tumefaciens strain C58 circular
Agrobacterium tumefaciens strain C58 linear
Agrobacterium tumefaciens strain C58 plasmid
Agrobacterium tumefaciens strain C58 plasmid
Aquifex aeolicus complete genome.
Aquifex aeolicus plasmid ece1
Archaeoglobus fulgidus complete genome.
Bacillus halodurans C-125
Bacillus subtilis complete genome.
Bifidobacterium longum NCC2705 complete
Borrelia burgdorferi.
Borrelia burgdorferi plasmid cp9
Borrelia burgdorferi plasmid cp26
Borrelia burgdorferi plasmid cp32-1
Borrelia burgdorferi plasmid cp32-3
Borrelia burgdorferi plasmid cp32-4
Borrelia burgdorferi plasmid cp32-6
Borrelia burgdorferi plasmid cp32-7
Borrelia burgdorferi plasmid cp32-8
Borrelia burgdorferi plasmid cp32-9
Borrelia burgdorferi plasmid lp5
Borrelia burgdorferi plasmid lp17
Borrelia burgdorferi plasmid lp21
Borrelia burgdorferi plasmid lp25
Borrelia burgdorferi plasmid lp28-1
Borrelia burgdorferi plasmid lp28-2
Borrelia burgdorferi plasmid lp28-3
Borrelia burgdorferi plasmid lp28-4
Borrelia burgdorferi plasmid lp36
Borrelia burgdorferi plasmid lp38
Borrelia burgdorferi plasmid lp54
Borrelia burgdorferi plasmid lp56
Bradyrhizobium japonicum strain USDA110
Brucella melitensis strain 16M chromosome I
Brucella suis 1330 chromosome I complete
Buchnera aphidicola str. Sg (Schizaphis
graminum) complete genome.
Buchnera sp. APS complete genome.
Buchnera sp. APS plasmid pLeu DNA
Buchnera sp. APS plasmid pTrp DNA
Campylobacter jejuni complete genome.
Caulobacter crescentus complete genome.
Chlamydia muridarum
Chlamydia muridarum plasmid pMoPn
Chlamydia trachomatis complete genome.
Chlamydophila pneumoniae AR39
Chlamydophila pneumoniae J138
Chlorobium tepidum TLS complete genome.
Clostridium acetobutylicum ATCC824
Clostridium acetobutylicum megaplasmid
Clostridium perfringens 13 DNA
Clostridium perfringens plasmid pCP13 DNA
Corynebacterium efficiens YS-314
Corynebacterium glutamicum ATCC 13032
Deinococcus radiodurans R1 complete
Deinococcus radiodurans R1 megaplasmid MP1
Deinococcus radiodurans R1 plasmid CP1
Escherichia coli CFT073 complete genome.
Escherichia coli K-12 MG1655 complete
Escherichia coli O157:H7
Escherichia coli O157:H7
Fusobacterium nucleatum subsp. nucleatum
Haemophilus influenzae Rd complete genome.
Halobacterium sp. NRC-1 complete genome.
Halobacterium sp. NRC-1 plasmid pNRC100
Halobacterium sp. NRC-1 plasmid pNRC200
Helicobacter pylori
Helicobacter pylori 26695 complete genome.
Lactococcus lactis subsp. lactis IL1403
Leptospira interrogans serovar lai str. 56601
Listeria innocua Clip 11262
Listeria innocua plasmid pLI100
Listeria monocytogenes strain EGD
Mesorhizobium loti complete genome
Mesorhizobium loti plasmid pMLa DNA
Mesorhizobium loti plasmid pMLb DNA
Methanobacterium thermoautotrophicum delta
Methanococcus jannaschii complete genome.
Methanococcus jannaschii large extra-
Methanococcus jannaschii small extra-
Methanopyrus kandleri AV19
Methanosarcina acetivorans str. C2A
Methanosarcina mazei strain Goe1
Mycobacterium leprae strain TN complete
Mycobacterium tuberculosis CDC1551
Mycobacterium tuberculosis complete genome.
Mycoplasma genitalium G37 complete genome.
Mycoplasma penetrans
Mycoplasma pneumoniae M129
Mycoplasma pulmonis (strain UAB CTIP)
Neisseria meningitidis serogroup A strain
Neisseria meningitidis serogroup B strain
Nostoc sp. PCC 7120 complete genome.
Nostoc sp. PCC 7120 plasmid pCC7120alpha
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Nostoc sp. PCC 7120 plasmid
Oceanobacillus iheyensis
Pasteurella multocida PM70 complete
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Pyrobaculum aerophilum strain 1M2
Pyrococcus abyssi complete genome.
Pyrococcus furiosus DSM 3638
Pyrococcus horikoshii OT3 complete
Ralstonia solanacearum GMI1000
Ralstonia solanacearum GMI1000
Rickettsia conorii Malish 7
Rickettsia prowazekii strain Madrid E
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella enterica serovar Typhi
Salmonella typhimurium LT2
Salmonella typhimurium LT2 strain
Shewanella oneidensis MR-1 complete
Shewanella oneidensis MR-1
Shigella flexneri 2a str. 301 complete
Sinorhizobium meliloti 1021 complete
Sinorhizobium meliloti plasmid pSymA
Sinorhizobium meliloti plasmid pSym B
Staphylococcus aureus MW2
Staphylococcus aureus strain Mu50
Staphylococcus aureus plasmid VRSAp
Staphylococcus aureus strain N315
Staphylococcus aureus subsp. aureus
Streptococcus agalactiae
Streptococcus agalactiae NEM316
Streptococcus mutans UA 159 complete
Streptococcus pneumoniae complete
Streptococcus pneumoniae R6 complete
Streptococcus pyogenes MGAS315
Streptococcus pyogenes strain
Streptococcus pyogenes strain SF370
Streptomyces coelicolor A3(2)
Streptomyces coelicolor plasmid SCP1.
Streptomyces coelicolor plasmid SCP2.
Sulfolobus solfataricus complete
Sulfolobus tokodaii complete genome.
Synechocystis PCC6803 complete
Thermoanaerobacter tengcongensis
Thermoplasma acidophilum
Thermoplasma volcanium
Thermosynechococcus elongatus BP-1
Thermotoga maritima complete
Treponema pallidum complete genome.
Ureaplasma urealyticum complete
Vibrio cholerae chromosome I
Vibrio vulnificus CMCP6 chromosome
Wigglesworthia brevipalpis
Xanthomonas axonopodis pv. citri str.
Xanthomonas campestris pv. campestris
Xylella fastidiosa
Xylella fastidiosa plasmid pXF1.3
Xylella fastidiosa plasmid pXF51
Xylella fastidiosa plasmid pXF868
Yersinia pestis KIM complete genome.
Yersinia pestis strain CO92
Yersinia pestis plasmid pCD1.
Yersinia pestis plasmid pPCP1.
Yersinia pestis plasmid pPMT1.
Chlamydophila pneumoniae CWL029
Nitrosomonas europaea
Prochlorococcus marinus MED4
Prochlorococcus marinus MIT9313
Rhodopseudomonas palustris
Staphylococcus epidermidis ATCC
Synechococcus
Clostridium tetani E88, complete
Lactobacillus plantarum WCFS 1,
Pseudomonas syringae pv. tomato str.
Streptococcus pyogenes phage 315.3
Tropheryma whipplei str. Twist
Vibrio parahaemolyticus RIMD
Xylella fastidiosa Temeculal
Agrobacterium tumefaciens strain C58 circular chromosome
Bacillus subtilis complete genome.
Campylobacter jejuni complete genome.
Corynebacterium glutamicum ATCC 13032
Escherichia coli K-12 MG1655 complete genome.
Helicobacter pylori
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Table A gives preferred values for some of the parameters of the invention.
Y. pestis, HIV1, B.
A feature of this invention is our newly developed ability to calculate the occurrence of all sub-sequences within DNA genomes. The present invention builds on this ability with the recognition that if we can calculate the occurrence of all sub-sequences in the human genome, then we can design diagnostics targeting only sequences known to be absent from that genome, thereby avoiding the enormous background of human DNA which plagues many clinical diagnostic activities.
There are many other applications that look like this one, and these fall into two general categories. The first application, which we will call “rifle shot”, application consists of targeting a known organism or sequence of interest in the presence of (usually an excessive quantity of) some background DNA. For example, we could use the known sequences of the hepatitis C and human genomes to design primers which PCR-amplify hepatitis C nucleic acids while not amplifying human DNA. It bears emphasis that this design is relatively easy once the enormous computational achievement of identifying all the primers which can amplify human DNA has been accomplished. There are many, many extensions and modifications of this class of application. First, it should be noted that human genome is not a single entity but includes a large range of variation known as single nucleotide polymorphisms (“SNPs”). These variations can be taken into account in the computations, and the computations will be updated as more are reported. For the moment, the computations will normally produce candidate probes and primers that will then be experimentally tested because of variations in hybridization efficiency and the occurrence of unknown genetic variations in humans. In addition to designing PCR primers or DNA probes for known pathogens, it is also possible to identify organisms of agricultural interest in animal-derived samples, pathogens in meat (e.g. avoiding cow sequences, once all or most of the cow genome is known), tracking organisms of environmental interest, distinguishing a small quantity of human Y-chromosome—derived material from excess female genetic material in forensic cases, diagnosing plant diseases, monitoring malaria parasite DNA in mosquitoes, testing for contamination in transplant tissues and pharmaceutical manufacturing samples, etc.
The other class of applications is those in which the background sequences are known, but the precise sequence to be looked for is not yet completely identified. These we term “butterfly net” applications. One recent example would be the outbreak of SARS. In principle we could design rather large sets of primers capable of randomly amplifying non-human DNA present in human blood samples. Application of this PCR primer set to blood samples taken from SARS patients could result in the amplification of previously unknown DNA sequences that would be the SARS genome, rapidly advancing the identification and the diagnosis of this disease. Biodefense and bioterrorism applications of this approach are also evident. Another butterfly net application of historical importance would be the identification of Helicobacter pylori as a causative agent of human gastric ulcers. In this case, we could have amplified Helicobacter genomic DNA from biopsy specimens (or possibly gastric, intestinal, or colon content samples). Currently interesting applications include identification of emerging diseases and biodefense agents, and causative agents of upper respiratory infections, vascular disease, cancers, etc.
It bears emphasis that specific primers and probes to be developed here can be used in a variety of manifestations. In general, these divide into amplification schemes and hybridization assays.
Amplification schemes will include especially PCR but also RT-PCR, NASBA, LCR, transcription, and a variety of other known amplification schemes. It also bears notice that RNA and DNA may be interconverted as an early step in the assay, and that isolation of nucleic acids may in many cases be quite useful to facilitate the hybridization or reactions used, by concentration and removal of inhibitors.
The second general class of applications is in hybridization assays in which sequences are detected directly. Hybridization may be preceded by several purification and/or amplification. Hybridization may be performed in a spotted or grown array of probes, and these probes may be oligonucleotides, PCR products or spotted natural nucleic acids. Arrays may commonly be in the usual planar or 3-D gel pad format, but they may also consist of arrays of beads scored either by imaging or cytometric technique. Both probes and primers may consist of non-natural nucleic acids or nucleic acid analogs modifications may include alterations of backbone, fluorescence reporters, beacon probes, aminopurine and other fluorescent reporter nucleotides, and perhaps most importantly variations such as LNAs and PNAs.
In most applications the computational techniques based on the Fofanov R-Q set calculations will yield a candidate set of primers of interest which will then be polished experimentally to avoid difficulties with unknown variations in background DNA or unexpected behavior in hybridization. Polishing of the candidate set will usually be experimental and will consist of hybridization or test amplification reactions. The calculations will produce an abundance of candidates, and it is possible and desirable to apply additional criteria, especially for predicted hybridization/melting temperature, mismatch location, etc. Constraining the range of predicted melting temperature of probes or primers facilitates their use in subsequent amplifications and assays, and this constraint is fairly easily applied by searching among the candidate probes or primers for those containing similar G+C content, or by more sophisticated melting temperature estimates which take into account neighbor-neighbor interactions (e.g., using nearest neighbor parameters published in Breslauer, Frank, Bloeker and Marky, Proc. Natl. Acad. Sci. USA, vol 83, pp 3746-3750) and mismatch (MM) location. In solution-based assays such as PCR, it will often be important to filter sequences by possible complemetarity to each other, to separation distance of target, or by chromosomal/episomal location of target sequence.
It is worth noting that there are a variety of modifications and extensions of the technology including miniaturization, parallelization, scoring by beacon fluorescence, scoring by radionuclide incorporation, scoring of hybridization by fluorescence, by antibody, by chemiluminescence, by electrical signals such as capacitance or impedance or conductivity. A huge variety of technologies are known in the literature for conducting amplification and/hybridization based assays and nearly all of these are applicable with the present invention.
Another class is the compositions of matter of the primers of interest, especially primers known not to hybridize to or amplify known human sequences including the publicly available human genome sequence and its SNP-derived variants (or primers selected not to hybridize to other background(s) which may be present). A particularly interesting group of primers are those of selected length (e.g., 6-, 9-, 14-, 15-, 16-, 17-, or 24-mers) selected for similar melting temperatures to make a set of primers and/or probes which could be used under similar conditions and in parallel or high throughput manner. These sequences have known utility and are well defined in terms of composition, we submit them on CD to allow for the large numbers which we'll identify.
The database of subsequences of microbial and viral genomes (6-20+-mers) will have broad applications by a variety of communities. In addition to the area of microbial identification, other areas of application include evolution and phylogeny, genome stability, viral strain origins, environmental microbiology, infectious diseases, food safety, DNA-binding proteins, restriction and modification systems (and practical applications in molecular biology), siRNA, forensics, biodefense, and ribosome function.
The database of subsequences of microbial and viral genomes (6-25+-mers) that are absent in the human genome will provide convenient and accurate genomic signatures for a large and ever increasing number of microbial and viral genomes. One can quickly obtain these signatures and develop a set of primers or probes to target the microbial or viral species of interest. Furthermore, since the subset of subsequences that are present in the human genome are eliminated, one can develop probing methods not requiring removal of human tissue from the sample. Any research lab that is interested in identifying and investigating one or more microbial or viral species can benefit from this resource.
With our database of genome subsequences, one can develop a set of microarray probes which will provide a distinctive pattern in the presence of any known pathogenic microbial or viral species, even without isolation of pathogen DNA from infected human tissue. Similar applications in mRNA expression monitoring can be envisioned, where RNA processing gives rise to sequences not present in the genomic DNA, but which can be calculated using standard bioinformatics methods, or in many cases accessed from cDNA databases.
We develop a set of novel algorithms that make it possible to analyze the occurrence frequency of all short subsequences (“n-mers”) of length 5 to 25+ nucleotides in any genome within a reasonable time (minutes). These algorithms are used to perform a comparative statistical analysis of the presence of all possible “n-mers” in genomes of more than 250 microbial, viral and multicellular organisms (including humans)26,27 The results show a remarkable similarity of presence/absence distributions for different n-mers in all genomes. It suggests that the presence/absence distribution of n-mers in all genomes considered (provided that the condition M<<4n holds, where M is the total genome sequence length) can be treated as nearly random. The massive computational analysis of the presence/absence of short subsequences in more than one genome simultaneously is performed for all published (before May, 2002) microbial and virus genomes28,29, and is repeated for the 1600+ genomes which were available by May 2003. Results show that for organisms that are not close relatives, the presence/absence of different 7-20-mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers appears, but is not as strong as expected. The low level of correlation among the n-mers present in different genomes leads to the possibility of using random (quasi-random) sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms (and possibly individual genomes of the same species including humans) with a low probability of error29,30.
We work on the first generation of microbial/viral biosensor microarrays based on these special properties of genomic n-mers. The general long-term object is to develop laboratory technology to rapidly and conclusively identify any organism. As a part of this effort, considerable improvement of the n-mers computation algorithms is achieved. It is now possible to exclude in the array design step all subsequences which are present in the human genome and publicly available human SNP database. This improvement makes it possible to refocus the project on development of “human-blind” probes and primers, instead of just random primers. This new approach can readily be extended to develop assays which are insensitive to the background of a host such as a food animal or plant, host organism, or any environmental background for which genomic sequence information is available. It can be used to design and manufacture a DNA microarray, which will allow one to distinguish pathogen (microbial or viral) DNA from a sample that contains a host DNA (for instance, blood, and tissue samples from an animal or human). Indeed, in the event of microbial infection, it would be of value to have a diagnostic system that could readily identify any bacterium that is present in the host organism, regardless of prior expectations of what might be found. It would be essential to determine as closely as possible, the genetic identity of the organism (microbial or viral) that is causing the problem as this will clarify where the organism came from, what treatments are likely to be effective, etc. Since more than 1600 genomes have been collected at the UH Bioinformatics lab and more than 72,600 16S rRNA sequences are available from the RDP, it appears possible to find all “human blind” subsequences of size 8 to 25+.
Critical to development of this technology is the ability to identify and use sequences which do not hybridize to human DNA, including known SNPs (there will be some false positives due to unknown SNPs, but these will be few and the number will decrease with SNP database growth). The basic idea is to use a set of human-blind signature n-mer sequences for microbial/viral identification. Even if each such sequence is not exclusive, and could be present in many genomes, the combination of sequences present is exclusive for each genome. As a consequence, it is possible to use a random subset of human-blind n-mers (with appropriately chosen n) to design a microarray to diagnose to which organism a given DNA/RNA sample belongs. As shown in29,30, the size of this microarray would be several thousand oligonucleotides, which is well within the range of current experimental methods.
An important advantage of this approach is that it can be used without a priori knowledge of the genomic sequence of the target organism. In other words, one can identify organisms for which the total genome sequences have not yet been determined. It is sufficient to obtain the purified DNA, hybridize it on the microarray chip and check which n-mers show up (they provide a signature of an organism on our chip). Taking into account the availability of standard samples of microbes and viruses, how easily each microarray can be produced, and the fact that one does not need to determine quantitative values of expression (only a yes/no answer is required)—a universal human-blind microbial/viral DNA identification chip can be readily produced.
The amount of computation necessary to find all the short subsequences present in any target genome, and absent in the human or another host genome, is substantial and cannot be readily performed whenever needed. Given that those interested in such data will typically be experimental laboratories rather than computational ones, the need to provide public access to the results is clear. Various pathogen genomes, the ability to find information regarding the location of each signature in each genome, the ability to find sets of unique signatures and PCR primers that are expected to allow ultra-specific microbe genetic material amplification will add useful information to scientific community. This technology may be useful in developing new nucleic acid-based diagnostics, forensic assays, veterinary and agricultural assays. It will also have considerable utility in rapidly identifying unknown pathogens or bio-warfare agents.
Statistical analysis of the appearance of short subsequences of length n, called motifs or n-mers, in different DNA sequences from individual genes to full genomes, is of interest in terms of evolutionary biology, analysis of regulatory sites on DNA promoter sequences, etc. In addition, knowledge of the distribution of presence of n-mers is necessary for PCR primer32,33 and microarray probe design34. Several attempts35-41 have been made to employ the frequency distributions of n-mers to identify species with relatively short genome sizes (microbes). In such an approach, the shape of the frequency distribution for particular short subsequences (2-4-mers35-39,42 and 8-9-mers40,41) has been proposed as a criterion to determine which microbial genome is present, based on a given random piece of a genome or a whole genome. It has been reported that when genome size M is greater than 4n, the appearances of n-mers in various genomes are not random35,36,38,39,43. The basic motivation of our recent analysis was to explore the statistical properties of the presence of longer n-mers if genome size M is much less than 4n 26,27,29-31.
We examine the number of all distinct 5-20-mers present in more than 1600 viral and microbial genomes available26,27,29-31. We also analyze the genomes of several multicellular organisms, including human44. Tables 1 and 2 contain typical results for some of the analyzed genomes (microbial and viral), for n=8 and 12. The numbers of 7-20 mers present in the human genome can be found in Table 3.
Bacillus subtilis
Escherichia coli K12
Salmonella typhimurium
Staphylococcus aureus
Streptococcus pneumoniae
Thermoplasma volcanium
Ureaplasma urealyticum
When n increases, the total number of possible n-mers (4n) strongly exceeds the total sequence length M and most of the possible n-mers do not appear at all because the maximum number of n-mers contained in this sequence is M−n+1≈M. Moreover, for a reasonably high ratio, 4n/M, most of the n-mers which appear tend to appear only once, in accordance with the fact that the number of n-mers present becomes very close to M. This is why we have chosen to use the statistics for “presence/absence” (frequency of presence) in our analysis instead of the usual “frequency of appearance”, which is reasonable for short n-mers (total sequence length M>4n).
Based on data from Example 2,
In fact, the frequencies of presence of n-mers, f, in various genomes nearly belong to the same universal curve representing the “random boundary” (always being below it). Assuming equal probabilities of appearance of every nucleotide, one can analytically find (in full agreement with the Monte-Carlo simulations which we also performed as an independent test),
where f0 is M, x is the ratio of the total number of possible n-mers to the number of n-mers in the sequence in consideration, M−n+1≈M. This “random boundary” is shown in
The relative deviation
of real results from the random boundary can be used as a definition of “non-randomness”, or “self-similarity” of a given genome.
Corresponding data for several genomes are given in Tables 1-2 and in supplementary data to Refs.30,31. We observe that, in general, shorter genomes are more random (based on this definition) than longer ones.
As shown in
The analysis of the presence/absence of short subsequences in several genomes simultaneously was performed for 250+ microbial, viral, and multicellular organism genomes26,27,29-31, and further repeated our calculations for the 1600+ genomes. We examined all possible short nucleotide subsequences of various lengths (5<n<20). To the best of our knowledge, no such studies appear in the literature for n>11. This type of calculation is challenging for larger n because of the exponential growth of time/memory required by brute force algorithms. To be able to perform calculations for n>11, new algorithms and specialized data structures have been developed and implemented26,27. Our goal was to find out how independent/correlated the appearances of n-mers are in different genomes. To answer this question we used the multiplication property for the joint probability of the intersection of events, according to which two events A, and B, can be treated as independent if p(A∩B)=p(A)p(B).
Consider a simple example based on 3 different genomes: (1) Salmonella typhi (NC_003198), (2) Mycobacterium tuberculosis H37Rv (NC_000962), and (3) Bacillus subtilis (NC_000964). A complete set of n-mers would contain 4n n-mers, which, for n=12, is 412=16,777,216. We use both strands of the complete genome sequences for our calculations, and N(n,G) stands for the number of different n-mers in genome G. In Table 4 we present the number of different 12-mers that occur in each of these three genomes.
Based on the data of Examples 1-4, to estimate the probability of finding randomly chosen 12-mers in each genome, we calculate the frequency of presence of 12-mers in each genome (p=N(12,G)/4n). These values are also presented in Table 4. A moderate percentage can be observed when compared with the maximum of possible sequences, 4n.
tuberculosis H37Rv
The number N(n, G1, G2) of n-mers (n=12) that appear in each pair of genomes (G1, G2) was also computed (Table 5). Based on this we were able to compare the probabilities of finding randomly picked 12-mers in each pair of genomes with the probabilities calculated using the multiplication rule. As is shown in Table 5, the actual and calculated (expected) probabilities do not differ greatly from each other. This allows us to treat the presence/absence of randomly picked 12-mers in these 3 genomes as independent events.
Mycobacterium tuberculosis H37Rv, and (3) Bacillus subtilis
Based on Example 5, we calculate the actual and expected probabilities for each pair of genomes in the three groups (microbes, DNA and RNA based viruses)29-31 for values of n from 7 to 20. For all organisms which are not extremely closely related to each other, we find that in a certain range of n the presence/absence of n-mers can be treated as almost independent events. Moreover, this property is valid for almost any subset of n-mers such as, for instance, n-mers with a given C+G content, a subset of n-mers chosen according to Tm (melting temperature), etc.
Our calculations show that very similar results can be achieved using a “host blind” subset of n-mers chosen not to be present in the human genome. The only problem in this case occurs not because of the amount of host DNA, but because of the large difference between the lengths (sequence complexities) of human and microbial DNAs. For example, using 11-mers to detect microbes against a human DNA background is nearly impossible, because practically all 11-mers are present in the human genome. Nevertheless, it can be done be using 14 or 15-mers, (only 84.5% of 14-mers and 61.7% of 15-mers are present in the human genome; see Table 3 for more details). In this case one needs to increase the number of probes on the microarray to 10,000-15,000, which is technically possible. Such an approach also can be used to detect unknown viruses, but it requires the use of a still larger (but technically feasible) array.
As described above, the steady accumulation of genomic sequence information can transform the utility of DNA probes for microbial identification. An array (or set of PCR primers) designed to be blind to the background DNA allows extremely efficient and powerful microbial identification. The important advantage of such an approach is that it can be used without a priori knowledge of the target organism's genomic sequence. This implies that there is no need to perform the expensive and time-consuming process of sequencing before array design. It is enough to obtain the purified DNA, hybridize it to a sufficiently random microarray chip and check which n-mers are present. Considering the rapid growth in genomic databases, the increasing availability of custom DNA probe arrays, and the fact that we do not need to determine quantitative values of expression (we need just a yes/no answer) we believe it is possible to produce an essentially universal microbial/viral DNA chip. The oligonucleotide calculations and database proposed here would radically advance microbial detection and identification, as well as serving a wide variety of applications in the analysis of genome structure, evolution, and the regulation of gene expression.
Subsequences of each of the known sequences of Ebola virus are determined, and sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. Sequences absent from the human genome and from all but one known Ebola sequence are also identified. All-strain- and single-strain-characteristic complementary probe sequences are immobilized on a microscope slide, and hybridized with labeled human clinical samples. Positive hybridization to both types of probes identifies the presence of the Ebola virus, and the presence of the particular strain of interest.
Subsequences of each of the known sequences of Ebola virus are determined, and sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. Sequences absent from the human genome and from all but one known Ebola sequence are also identified. All-strain- and single-strain-characteristic probe sequences selected for similar predicted melting temperature are immobilized on a microscope slide, and hybridized with labeled RNA amplified from human clinical samples. Negative hybridization to both types of probes confirms the absence of the Ebola virus.
Subsequences of each of the known sequences of Ebola virus are determined, and sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. 19-mer sequences absent from the human genome and from all but one known Ebola sequence are also identified. Among these sequences, the subset not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are selected and probes to them are immobilized on a microscope slide, and hybridized with labeled human clinical samples. Positive hybridization to both types of probes identifies the presence of the Ebola virus, and the presence of the particular strain of interest.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C. and limited propensity to hybridize to each other is synthesized and blended in to a PCR mixture. Nucleic acids isolated from liver biopsies from patients showing hepatitis of unknown origin are added to the mixture and aliquots subjected to PCR under a range of conditions. Sequences from a previously unknown virus distantly related to the Hepatitis C virus are amplified.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence, SNP-modified sequence or the genes encoding production of a therapeutic antibody by any combination of 1, 2 or 3 mismatches are identified, and primers to a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C. is synthesized and blended in to a PCR mixture. Nucleic acids isolated from human cell line cultures used in production of the therapeutic antibody are added to the mixture and aliquots subjected to PCR under a range of conditions. The absence of amplification (and successful amplification of spiked positive control mixtures) is used as a quality control measure before release of the production batch.
Sequences predicted to be present in human mRNA based on cDNA databases and genome annotation, but absent from the human genomic DNA sequence and all known human SNPs and not convertible into a perfect match for any human genomic DNA sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. Probes to a subset of these sequences with predicted DNA probe/mRNA hybridization melting temperature in the range 63-70 degrees C. are synthesized as DNA probes and spotted onto a microscope slide, and hybridized with labeled mRNA amplified from human leukemia samples. The pattern of gene expression inferred from the hybridization results is used to choose the course of treatment of the leukemia.
19-mer sequences in the genomic DNA of a pathogenic organism but absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. The process is repeated for each of 30 known human pathogens, and for each pathogen probes to a subset of these sequences with predicted hybridization melting temperature in a selected range are synthesized. An array of all these sequences is used in the analysis of human cerebrospinal fluid for diagnosis of meningitis.
Subsequences of each of the known sequences of Ebola virus are determined, and sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. 19-mer sequences absent from the human genome and from all but one known Ebola sequence are also identified. Among these sequences, the subset not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are selected and corresponding probes are immobilized on a microscope slide, and hybridized with labeled human clinical samples. Although not all Ebola-specific probes give positive signals, statistical analysis of the hybridization pattern identifies the presence of the Ebola virus, and the presence of the particular strain of interest.
Sequences present in some human Y chromosomes but otherwise absent from the human genome and all known human SNPs, and not convertible into a perfect match for any non-Y human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and 8-mer LNA probes to a subset of these sequences are synthesized and immobilized on individual beads of a bead-based hybridization assay. The assay is applied to pooled human female DNA samples, and a few nominally Y-specific probes are found to give false positive results. Results obtained from these probes are subsequently ignored. Nucleic acids isolated from female-subject forensic samples are assayed using this probe set, and Y-derived sequences are found, serving as evidence in criminal prosecution.
Probes complementary to sequences absent from the known genome sequence of a pregnant human female are synthesized and arrayed on a 3-dimensional gel pad. Amplified nucleic acids derived from a sample of the subject's blood are hybridized to the array, and the probes giving hybridization are compared with the archival data for similarly-amplified nucleic acids derived from a sample of the subject's blood taken before her pregnancy. The differences observed are used in conjunction with data from male subjects for the establishment of paternity.
Probes complementary to sequences absent from the known genome sequence of a pregnant human female are synthesized and arrayed on a 3-dimensional gel pad. Amplified nucleic acids derived from a sample of the subject's blood are hybridized to the array, and the probes giving hybridization are compared with the archival data for similarly-amplified nucleic acids derived from a sample of the subject's blood taken before her pregnancy. The differences observed are used for prenatal diagnosis of the status of the fetus.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human DNA or RNA sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and probes to a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C., complementarity to known HIV-2 sequences, few-hundred base spacing, facing orientation, and limited propensity to hybridize to each other are synthesized and blended into a PCR mixture. Nucleic acids reverse transcribed from blood samples of patients showing immunodeficiency of unknown origin are added to the mixture and aliquots subjected to PCR under a range of conditions. Positive amplification results are diagnostic of HIV-2.
Sequences absent from the bovine genome and all known bovine SNPs, and not convertible into a perfect match for any bovine sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches, and also not present in any sequence found by 700 megabases of shotgun sequencing of total DNA from bovine rumen contents are identified, and probes to a subset of these sequences with predicted complement hybridization melting temperature in the range 62-65 degrees C. are synthesized and immobilized on an array. Hybridization of rumen contents DNA to these arrays is used for routine monitoring of the health and nutritional status of cows, and for determining responses to changes in diet.
Sequences absent from the bovine genome and all known bovine SNPs, and not convertible into a perfect match for any bovine sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches, and also not present in any sequence found in known human food pathogens are identified, and probes to a subset of these sequences with predicted complement hybridization melting temperature in the range 62-65 degrees C. are synthesized and immobilized on an array. Hybridization of nucleic acids derived from meat samples to these arrays is used for food safety assurance.
Sequences absent from all but one gene of the human genome and all known human SNPs, and not convertible into a perfect match for any other human sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. An array of DNA probes complementary to these sequences and 1-base modifications of them is used for resequencing of a large portion of the gene in multiple individuals.
Sequences absent from all but the mitochondrial portion of the human genome and all known human SNPs, and not convertible into a perfect match for any other human sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. An array of DNA probes complementary to these sequences and 1-base modifications of them is used for sequencing of a large portion of the mitochondrial genome in multiple individuals.
Sequences complementary to portions of the RNA genome of a virus and absent from the human genome and all known human SNPs, and not convertible into a perfect match for any other human sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. Several of these sequences are synthesized as DNA oligonucleotides fused to T7 RNA polymerase promoter sequences. These fused oligonucleotides are added to total nucleic acids isolated from patients suspected to be infected with the virus, with each oligonucleotide being added to the clinical sample in a separate reaction microwell. Treatment with reverse transcriptase followed by T7 RNA polymerase along with cy5-labeled nucleotides produces a fluorescent RNA product in 5 of 7 wells; this is interpreted as positive evidence that the patient is infected with the virus.
19-mer sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and primers recognizing a subset of these sequences, with very limited propensity to hybridize to each other, is synthesized and blended into a PCR mixture. Nucleic acids isolated from atherosclerotic vessel biopsies from human patients are added, along with a primer directed to a highly-conserved region of eubacterial 16S ribosomal RNA and aliquots subjected to PCR under a range of conditions in a gradient thermocycler with 2 cycles of PCR under low stringency, followed by 30 cycles at higher stringency. Sequences from a previously unknown bacterium are amplified; sequencing produces enough 16S rRNA sequence for approximate classification of the organism.
Subsequences of each of the known genomic DNA sequences of C. jejuni are determined, and sequences common to nearly all strains of the organism and not found in the human DNA sequence or known SNPs are identified. The masses of DNA fragments of 8-25 nucleotides predicted to result from simultaneous digestion with multiple restriction endonucleases of DNA containing these subsequences are calculated. Nucleic acids isolated from human clinical samples are subjected to digestion with these restriction endonucleases and the masses of fragments determined by MALDI-TOF mass spectrometry. Matching of a predicted fragment mass is used to identify the presence of C. jejuni.
Subsequences of each of the known genomic DNA sequences of C. jejuni are determined, and sequences common to nearly all strains of the organism and not found in the human DNA sequence or known SNPs are identified. The masses of DNA fragments of 8-25 nucleotides predicted to result from simultaneous digestion with multiple restriction endonucleases of DNA containing these subsequences are calculated. Nucleic acids isolated from human clinical samples are subjected to digestion with these restriction endonucleases and the masses of fragments determined by MALDI-TOF mass spectrometry. Matching of a set of predicted fragment masses is used to identify the presence of C. jejuni.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C. is immobilized in a 10,000-element array. A library of hybridization patterns on this array patterns produced by each of 50 pathogens of concern is developed in control experiments such that the terrorist agent can be unequivocally identified by matching with the pattern library.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and a subset of these sequences with predicted hybridization melting temperature in the range 65-70 degrees C. is immobilized in 8,000 elements of a 10,000-element array. The remaining 2,000 elements of the array are devoted to not-human-complimentary 16S rRNA signature sequences (Zhang, Z., Willson, R. C., and Fox, G. E. “Identification of Characteristic Oligonucleotides in the 16S Ribosomal RNA Sequence Dataset,” Bioinformatics, 18: 244-250 (2002)) chosen to be characteristic of various pathogenic subgroups. A library of hybridization patterns on this array patterns produced by each of 50 pathogens of concern is developed in control experiments such that a bioterrorism agent can in most cases be unequivocally identified by matching with the pattern library. However, a terrorist might seek to evade this highly sensitive system by transferring pathogenic genes to a closely related but normally non-pathogenic strain or by using a novel pathogenic strain. In either case, a pattern would be generated that would not be recognizably similar to any of the control patterns. In order to over come this problem, the phylogenetically characteristic signature probes are used such that when a pattern is produced which is not in the library, it is still possible to determine the genetic affinity of the pathogen being used. This is done by determining which of the phylogenetically-informative hybridization probes gave a positive result. Hence we might learn that the pathogen is a close relative of B. anthracis.
Subsequences of each of the known 16S rRNA sequences of B. anthracis are determined, and sequences common to nearly all strains of the bacterium and not found in the human sequence, other bacterial sequences, or known SNPs are identified. Locked nucleic acid probes to these sequences are synthesized and are immobilized on a collection of optically coded Luminex xMAP beads. Total RNA isolated from unknown bacterial samples are fragmented, labeled, and hybridized to the bead array. The bead array is analyzed, and some samples are found to be B. anthracis.
19-mer sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any known human, microbial or viral sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and probes to a subset of these sequences with predicted hybridization melting temperature lying in a range spanning 5 degrees C. is synthesized in an array of 300,000 probes arranged in 50,000 spots with 6 types of probes in each spot. Nucleic acids isolated from blood samples from patients showing fevers of unknown origin are labeled and hybridized to the array, as are samples from control subjects. Spots absent in the control samples and often found in the fever-associated samples are found, and separate analysis of the probe sequences found in these spots identifies probes useful in detecting the causative agent of the fever.
A set of 19-mer probes not complementary to any sequence in the mosquito or human sequences or any 1- or 2-base modifications of these sequences, and complementary to the genome sequences of known mosquito-born viral pathogens is immobilized on an array. Nucleic acids isolated from blood-feeding mosquitoes are hybridized to the array and the results used to identify the pathogenic viruses carried by mosquitoes in a study area irrespective of whether or not the mosquito has recently ingested human blood.
A human/mosquito blind array is constructed that has a set of random 19-mer probes not complementary to any sequence in the mosquito or human sequences or any 1- or 2-base modifications of these sequences. The array has 40,000 spots each containing 10 random sequences. Nucleic acids isolated from blood-feeding mosquitoes are hybridized to the array and the results used to identify the pathogenic viruses carried by mosquitoes in a study area.
A human/mosquito blind array is constructed that has a set of random 19-mer probes not complementary to any sequence in the mosquito or human sequences or any 1- or 2-base modifications of these sequences. The array has 40,000 spots each containing 10 random sequences. Characteristic hybridization patterns for a variety of known viruses are cataloged, and nucleic acids isolated from blood-feeding mosquitoes are hybridized to the array and the results used to identify the pathogenic viruses carried by mosquitoes in a study area.
Subsequences of each of the known sequences of Ebola virus are determined, and sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. 19-mer sequences absent from the human genome and from all but one known Ebola sequence are also identified. Among these sequences, the subset not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are selected and are immobilized on a microscope slide, and hybridized with labeled human clinical samples. Positive hybridization to both types of probes identifies the presence of the Ebola virus, and the presence of the particular strain of interest.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches are identified, and a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C. is synthesized and blended in to a PCR mixture. Nucleic acids isolated from liver biopsies from patients showing hepatitis of unknown origin are added to the mixture and aliquots subjected to PCR under a range of conditions. Sequences from a previously unknown virus distantly related to the Hepatitis C virus are amplified.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence, SNP-modified sequence or the genes encoding production of a therapeutic antibody by any combination of 1, 2 or 3 mismatches are identified, and a subset of these sequences with predicted hybridization melting temperature in the range 62-65 degrees C. is synthesized and blended in to a PCR mixture. Nucleic acids isolated from human cell line cultures used in production of the therapeutic antibody are added to the mixture and aliquots subjected to PCR under a range of conditions. The absence of amplification (and successful amplification of spiked positive control mixtures) is used as a quality control measure before release of the production batch.
Company X is using a novel strain of a bacterium to process an organic waste in an above ground bioreactor. Nearby, several children become extremely sick from a bacterial infection that can be traced to apparent contamination of a private swimming pool. Unfortunately, the problem bacterium can not be cultivated or isolated. A highly reputable private laboratory was hired to investigate further. Using PCR probes targeting 16S rRNA they showed that organisms with the same 16S rRNA sequence as the strain used in the company's bioreactor are present in soil samples taken near the swimming pool. The parents are now suing the company for large sums of money. In order to determine if it is really the company strain that is or is not present in the soil samples the genome of the company strain is sequenced. Next, sequences of length N which are blind to all known bacteria are identified. A hybridization array containing 1000 randomly chosen bacterial blind sequences of length N is constructed. If the company strain is present, essentially every probe will give a positive result. However, because the genomes change significantly for even closely related strains, if the problem organism is not the company strain a large number of probes will not give a signal. Key controls will be soil samples known to contain the company strain, soil known to contain the company DNA and a third sample containing a noncompany strain of the problem organism.
It is of concern that any of a number of bacterial pathogens might be used as bioterrorist agents and therefore a universal identification system is desirable. We therefore use the methodology above to develop an array of 1000 random human blind N-mers of similar hybridization properties. For bacterial size genomes when these oligomers hybridize they produce a unique pattern that changes very rapidly so that even closely related strains can be distinguished. For example, all three E. coli strains whose genome has been sequenced can be readily distinguished. A library of patterns produced by each pathogen of concern (e.g. NIAID A, B and C lists) is developed in control experiments such that matching with the pattern library can unequivocally identify the terrorist agent. However, a terrorist might seek to evade this highly sensitive system by transferring pathogenic genes to a closely related but normally non-pathogenic strain or by using a novel pathogenic strain. In either case, a pattern would be generated that would not be recognizably similar to any of the control patterns. In order to overcome this problem, a large number of human blind16S rRNA signature sequences are calculated. Human-blind signature sequences for various pathogenic subgroups with appropriate hybridization properties are then incorporated into the universal array. This adds a phylogenic component to the array such that when a pattern is produced which is not in the library; it is still possible to determine the genetic affinity of the pathogen being used. Determining which of the phylogenetically informative hybridization probes gave a positive result does this. Hence we might learn that the pathogen is a close relative of B. anthracis.
Environmental issues necessitate a better understanding of microbial ecosystems. In order to understand a complex ecosystem one needs to know what the major microbial components of the ecosystem are. One can identify many of the major components with culture studies or by sequencing 16S rDNA fragments produced by PCR. However, it is in either case difficult to know the extent to which the population diversity has actually been determined. This is addressed by constructing an array of probes which are blind to the genomes of all known bacteria in the ecosystem of interest. If a unique genome of n megabases is present it will be expected to react positively to a predictable number of random all bacterial blind probes. The number of probes that do react will allow statistical estimation of the number major ecosystem components that remain unidentified.
Subsequences of each of the known sequences of the yellow fever virus are determined, Sequences common to nearly all strains of the virus and not found in the human sequence, known human SNPs, or the known genomic sequence of Aedes aegypti (the yellow fever mosquito) are identified. Molecular beacon fluorescent probes complementary to 100 all-strain- and single-strain-characteristic sequences are synthesized. All the molecular beacons will emit fluorescent light at the same wavelength upon hybridization with their target sequence. This allows amplification of the total signal while accounting for the presence of essentially any yellow fever strain in the mosquitoes being tested. In order to conduct a test, homogenized mosquitoes will simply be added to a premixed set of molecular beacon probes and fluorescence monitored. The test will be readily used in remote areas where sophisticated instrumentation is not readily available. It will provide a positive result for any yellow fever strain and will still work even if the mosquitoes have recently ingested human blood.
Environmental issues necessitate a better understanding of microbial ecosystems. In order to understand a complex ecosystem one needs to know what the major microbial components of the ecosystem are. One can identify many of the major components with culture studies or by sequencing 16S rDNA fragments produced by PCR. However, it is difficult to know the extent to which the population diversity has actually been determined. This is addressed by constructing an array of probes which are blind to the genomes of all known bacteria in the ecosystem of interest. If a unique genome of n megabases is present it will be expected to react positively to a predictable number of random all bacterial blind probes. The number of probes that do react will allow statistical estimation of the number of major ecosystem components that remain unidentified.
Sequences differing in one position from the most-closely matching DNA of the mitochondrial portion of the human genome and all known human SNPs, and not convertible into a perfect match for any other human sequence or SNP-modified sequence by any combination of 1 or 2 or 3 mismatches are identified. An array of DNA probes complementary to these sequences and 1-base modifications of them is used for sequencing of a large portion of the mitochondrial genome in multiple tissue samples and biopsy specimens from a single individual to monitor the appearance of mutations.
Sequences predicted to be present in virus-encoded mRNA based on cDNA databases and genome annotation, but absent from the mouse genomic DNA sequence and all known mouse SNPs and not convertible into a perfect match for any mouse genomic DNA sequence or SNP-modified sequence by any combination of 1 or 2 mismatches are identified. Probes to a subset of these sequences with predicted DNA probe/mRNA hybridization melting temperature in the range 63-70 degrees C. are synthesized as DNA probes and spotted onto a microscope slide, and hybridized with labeled mRNA amplified from virus-infected mouse samples. The pattern of gene expression inferred from the hybridization results is used to study the course of viral infection and the effects of candidate therapeutics.
Sequences absent from the human genome and all known human SNPs, and not convertible into a perfect match for any human sequence or SNP-modified sequence by any combination of 1, 2 or 3 mismatches, and also not present in any sequence found by 300 megabases of shotgun sequencing of total DNA from human oral biofilm contents are identified, among this set sequences specific to a particular organism thought to act as a competitive inhibitor of human dental caries formation are identified. Probes to a subset of these sequences with predicted complement hybridization melting temperature in the range 62-65 degrees C. are synthesized and immobilized on an array. After administration of the organism, hybridization of oral rinse-derived DNA to these arrays is used for monitoring of the survival and spread of the caries-prevention organism in the mouth.
Sequences not found in 900 megabases of random sequencing of DNA isolated from a hazardous-waste spillage site and not convertible into a perfect match for any such sequence by any combination of 1, 2 or 3 mismatches are identified, and among this set sequences specific to a particular plasmid known to carry genes for waste degradation are identified. Probes to a subset of these sequences with predicted complement hybridization melting temperature in the range 60-70 degrees C. are synthesized and immobilized on an array. Hybridization of soil-derived DNA to these arrays is used for monitoring of the survival and spread of the plasmid in this ecosystem.
Subsequences of each of the known sequences of Ebola virus are determined, and 18-mer sequences common to nearly all strains of the virus and not found in the human sequence or known SNPs are identified. 22-mer DNA probes complementary to these 18-mer sequences, but extended by 2 bases (equimolar mixtures of the 4 natural DNA nucleotides) at the 3′ end, and by two nucleotides (inosines) at the 5′ end, are immobilized on a microscope slide, and hybridized with labeled human clinical samples. Positive hybridization to these probes identifies the presence of the Ebola virus.
Other objects of the inventions comprise the determination and identification of other microorganisms, valuable or deleterious, which are comprised in a host, or the confirmation of their absence.
Specific compositions, methods, or embodiments discussed are intended to be only illustrative of the invention disclosed by this specification. Variations on these compositions, methods, or embodiments are readily apparent to a person of skill in the art based upon the teachings of this specification and are therefore intended to be included as part of the inventions disclosed herein. For example, the growth of genome and SNP databases will readily allow the development of additional probe/primer sequences for detection of newly-sequenced organisms, or “blind” to these organisms, or blind to additional genetic variants by the methods disclosed here. Also it will be obvious to skilled persons that longer N-mers than those recited herein may be used to great advantage in specific applications of the invention.
There are many variations to the invention; including its processes, processes of using sequences listed in the Contents and in the CDs, the sequences (subsequences) themselves, the related compositions and probes, the arrays, and the algorithms set forth in the Specification.
For example, the invention comprises a process for identifying whether a specific target organism (including viruses) is present in a sample, by the process comprising in combination: a. identifying at least one nucleic acid sequence (n-mer) characteristic of the target organism; b. identifying at least one nucleic acid which may be present in the sample and is characteristic of one or more other (nontarget) organisms which may be present in the sample; c. identifying nucleic acid sequence(s) of the organism or virus differing in at least N positions (N mismatches, where N is 1-10) from the nucleic acid sequence(s) of one or more other (non target) organisms which may be present in the sample; and d. determining the presence in the sample of one or more sequence(s) from step a which do not meet the criteria of step b and do meet those of step c, whereby the presence of such sequences indicates the presence of the target organism.
Preferred are such processes in which at least some of the nucleic acid sequences of steps a, b, and c are identified be selecting from a database; or in which at least some of the nucleic acid sequences of steps a, b, and c are identified by selecting from a database listed in the attached CDs; or in which at least some of the nucleic acid sequences of step d are not assayed to be present, and the target organism is determined to be present by the application of a mathematical decision formula; or in which at least some of the nucleic acid sequences of step b are assayed to be present, and the target organism is determined to be present by the application of a mathematical decision formula; or in which at least some of the nucleic acid sequences of step b are assayed to be present, and at least some of the nucleic acid sequences of step d are not assayed to be present, and the target organism is determined to be present by the application of a mathematical decision formula. Especially preferred are the use of 5, 10, 100 or 1000 or more sequences (subsequences).
The invention also comprises processes for identifying whether an organism or virus is present in a sample, the processes comprising in combination: a. identifying nucleic acid signature(s) of the organism or virus, b. identifying at least 100 nucleic acid signature(s) of one or more other organisms and/or viruses which may be present in the sample, c. identifying nucleic acid signature(s) of the organism or virus differing in at least N positions (N MM) from nucleic acid signature(s) of one or more other organisms and/or viruses which may be present in the sample, where N is selected to give a desired improbability of mistaken detection of the target organism, d. determining the presence in the sample of sequence(s) from step a which do not meet the criteria of step b and do meet those of step c, whereby the presence of such sequences indicates the presence of the target organism. Particularly preferred are such processes wherein the organism of steps b or c is an animal host, whereby there is created a set of host-blind sequences; or wherein N is about 1 to 5; or wherein said host-blind sequences comprise n-mers, wherein n is from 5 to 25.
Also the invention comprises processes using a specific human-blind sequence from [File] SEQ ID NO 1 to 17 to identify or detect the associated organism in the presence of human RNA or DNA; and each specific human-blind sequence 11- to 18-mer human-blind sequence from SEQ ID No 1-17; and each specific human-blind sequence 18-mer shown in folder microbial SEQ ID NO 1 [CD1]; and each specific host-blind sequence 11- to 13-mer shown in folder human mismatches SEQ ID NO 2 [CD1]; and each specific human-blind sequence 14-mer shown in folder human mismatches SEQ ID NO 3 [CD2]; and each specific human-blind sequence 18-mer shown in folder virus 18 mers 2MM SEQ ID NO 4 [CD3]; and each specific human-blind sequence 17-mer shown in folder virus 17 mers 2MM SEQ ID NO 6 [CD3]; and each specific human-blind sequence 17-mer shown in folder virus 17-mers 1MM SEQ ID NO 7 [CD3] and each specific human-blind sequence 17-mer shown in folder selected microbe 17 mers SEQ ID NO 8 [CD3]; and each specific human-blind sequence 18-mer shown in folder selected microbe 18 mers SEQ ID NO 9 [CD3]; each specific human-blind sequence 18-mer shown in folder Microbial 17-mers 2MM SEQ ID NO 10 [CD4]; each specific human-blind sequence 17-mer shown in folder Microbial 17-mers 1MM SEQ ID NO 11 [CD4]; and each specific human-blind sequence 17-mer shown in folder Microbial 17 mers 1MM SEQ ID NO 12 [CD5]; and each specific human-blind sequence 17-mer shown in folder Microbial 17-mers 1MM SEQ ID NO 13 [CD6]; and each specific human-blind sequence 17-mer shown in folder Microchip 17-mers 1MM SEQ ID NO 14 [CD6]; and each specific human-blind sequence shown in CD7, Folder nmerseqsWNV, SEQ ID NO 15 [CD7] [contains 465 West Nile virus], Folder nmerseqsDV, SEQ ID NO 16 [CD7] [contains 1815 Dengue virus], and/or Folder nmerseqsAllFlavi, SEQ ID NO 17 [CD7] [contains 765 Flavivirus].
The invention also comprises collections of at least 10 sequences from SEQ ID NO 1 to 17; and preferred, collections of probe sequences comprising at least 5, more preferably 10, and most preferably 100 or more of the sequences associated with a single organism in SEQ ID NO 1 to 17 [CDs 1-7]; and collections of probe sequences comprising at least 5% (more preferably 25 and most preferably 90%) of the sequences associated with a single organism in SEQ ID NO 1 to 17.
Still further, the invention comprises compositions comprising target organism nucleic acids (or nucleotides) hybridized with probe sequences selected from SEQ ID NO 1 to 17; preferably compositions comprising at least 20 probe sequences selected from SEQ ID NO 1 to 17 and more preferably located in a different position from each other; more preferably with detectable labels and/or polymerase.
Valuable embodiments of the invention comprise other chemical forms of probes besides ordinary DNA, and also cover longer and shorter sequences derived from the ones in the enclosed CDs. For example, composition comprising at least 10 DNA, RNA, LNA, PNA, modified backbone DNA, or modified base probe sequences selected from SEQ ID NO 1 to 17, and a detectable label and composition comprising at least 10 probe sequences selected from the sequences, extensions of the sequences, and truncated derivatives of length at least 12, of the sequences SEQ ID NO 1 to 17.
The invention offers processes for monitoring gene expression of a target organism or virus in a sample, the process comprising in combination: a. identifying nucleic acid signature(s) of the organism or virus, b identifying at least 100 nucleic acid signature(s) of one or more nontarget organisms and/or viruses which may be present in the sample, c. identifying nucleic acid signature(s) of the organism or virus differing in at least N positions from nucleic acid signature(s) of one or more nontarget organisms and/or viruses which may be present in the sample, where N is selected to give a desired certainty of non-hybridization to nucleic acids of the nontarget organism; d. determining the concentration in the sample of sequence(s) from step a which do not meet the criteria of step b and do meet those of step c, whereby the presence of such sequences indicates gene expression of the target organism.
Also encompassed by the invention are processes for detecting the presence of a virus or other non-human organism in human-derived samples, comprising in combinations: a. preparing an array of oligonucleotide probe sequences selected not to hybridize to any known human sequence or any sequence derivable from any known human sequence by any combination of 3 or fewer changes, b. hybridizing labeled nucleic acids isolated from a human blood sample to the array, c. observing the resulting hybridization pattern. And also processes for detecting the presence of a virus or other non-human host organism in human-derived samples, comprising in combination: a. preparing a mixture of 14- to 20-mer oligonucleotide primer sequences including sequences selected not to hybridize to any known human sequence or any sequence derivable from any known human sequence by any combination of 3 or fewer changes, but within five sequence changes (mismatches) of sequences complimentary to known target sequences, with the primer sequences selected not to give amplification with flavivirus-free human nucleic acid samples, b. Performing RT-PCR with these primers and nucleic acids isolated from a human blood sample, c. observing amplification with blood samples from humans infected with the target virus. Especially preferred in any of the above is the process wherein the target is flavivirus.
For scanning, the invention can employ processes for identifying whether any other organism/guest/parasite/ is present in a biologic host, the process comprising in combination: a. scanning for non-host signatures, and b. scanning for N-error-removed non-host signatures; where N is selected to give the desired statistical certainty of the presence or absence of any parasite in the host and preferably wherein N=1 to 10, more preferably 2-8 and most preferably 3-5.
As to sequences, the invention comprises each specific human-blind motif [n-mer/signature/sequence/subsequence shown in CD1 to CD7 preferably those of n=7 to 25-mers and also each correlation of n-mer vs. genome (sequence/signature/n-mer/organism shown in CD1 to CD7 [SEQ ID 1-17].
And also arrays [“chips”] of at least 10 [preferably 5 and more preferably 2 of the signatures shown in CD1 to CD7.
The mathematical aspects of the invention include each algorithm shown in this application, especially those for comparing yes/no testing to identify an ultraspecific microorganism.
Specific fragments can include an isolated polynucleotide fragment comprising a nucleic acid sequence selected from the group consisting of: SEQ ID NOs 1 to 17 and an isolated polynucleotide complementary to the polynucleotide of any of the claims; preferably those isolated polynucleotides which polynucleotide comprise a heterologous nucleic acid sequence; more preferably an isolated polynucleotide of claim 3, where the heterologous nucleic acid sequence encodes a heterologous polypeptide. Also valuable are uses of the invention for making a recombinant vector comprising inserting the isolated polynucleotide of claim 1 into a vector.
Reference to documents made in the specification is intended to result in such patents or literature being expressly incorporated herein by reference. Four unpublished drafts and a list of references cited above are enclosed and incorporated by reference. Larger print copies are available on request to the Attorney.
The present application claims priority of pending U.S. Ser. No. 60/519,417 filed Nov. 12, 2003 and U.S. Ser. No. 60/532,210 Filed Nov. 17, 2003 (Attorney Dockets 014APR/UH2003-48 and 014A′PR).
Number | Date | Country | |
---|---|---|---|
60519417 | Nov 2003 | US | |
60532210 | Dec 2003 | US |