The Sequence Listing written in file Sequence-Listing_ST25.txt created on May 12, 2020, is 784 bytes, machine format IBM-PC, MS-Windows operating system, in accordance with 37 C.F.R. §§ 1.821-1.825, is hereby incorporated by reference in its entirety for all purposes.
Infectious diseases affect the lives and health of millions of patients annually. Failure to obtain a laboratory-confirmed diagnosis for many acute infectious diseases directly contributes to poor patient outcomes and a high cost burden to the health care system. Key areas of unmet clinical need include neurological infections (encephalitis and meningitis), pulmonary infections (e.g., pneumonia), blood infections, and sepsis.
Traditional diagnostic methods, including culture, antigen detection, and nucleic acid amplification, are limited in scope in cases for which there is little clue regarding the identity of the causative agent. DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. DNA sequencing can be helpful in identifying potential causes of disease in patients. For example, alignment processes may be used to identify matching portions of a sample sequence with a reference database of classified reference sequences.
However, DNA sequencing of a sample and the alignment of such sequences for identification of a potential source of disease, typically includes long processing times due to the large amount of information to compare and process in order to identify a matching sample sequence and reference sequence. Additionally, because of the vast amount of data within sequencing data, sequence alignment results can return large numbers of false positives where portions of sequence reads may appear to match to portions of a reference genome that is not in fact present in the biological sample. As such, many alignment results from a DNA sequence alignment process may not be accurate and raw results of such alignments are not useful as-is in a clinical environment because an expert and/or clinician must analyze the results and manually interpret the returned alignment results to interpret the results of the sequencing reads.
Thus, challenges in the field include developing accurate pipelines (or processes) that can quickly analyze millions of reads that include millions of data points that come out of a DNA sequence system, as well as interpret the data so that it is clinically useful to laboratory scientists and/or a physician. Accordingly, there is a need for systems that are capable of quickly and efficiently identifying and interpreting next generation sequencing data for detection of potential causes of disease and/or any other potential applications of DNA sequence alignment information.
Embodiments of the present invention solve these and other problems individually and collectively.
Embodiments are directed to systems and methods for pathogen detection using next-generation sequencing (NGS) analysis of a sample. Embodiments may apply alignment algorithms (e.g., SNAP and/or RAPSearch alignment algorithms) to align individual sequence reads from a sample in a next-generation sequencing (NGS) dataset against reference genome entries in a classified reference genome database. Various embodiments can filter, classify, and display results to a clinician to identify a pathogen or other genetic material in a sample that is being tested. Embodiments can provide various systems that are configured to filter the results of a sequencing alignment and classify a sample quickly and accurately.
As an example, two alignment techniques (one being faster than the other) can be used together to speed up alignment, without sacrificing accuracy. An initial alignment technique can identify which reference genomes in a database match to which sequence reads. For a matching reference genome, an optimally-aligning sequence read can be identified. For the optimally-aligning sequence read, a different alignment technique can be applied, and it can be determined whether any of the new alignment scores to other reference genomes exceed the optimal alignment score for the matching reference genome. If an new alignment score exceeds the optimal alignment score for the optimally-aligning sequence read, the matching reference genome can be removed from a set of matching reference genomes. The set of matching reference genomes can then be output.
As another example, sequence reads can be assigned to a particular classification level, so as to provide accuracy identified of a particular pathogen. Sequence reads can be identified that match to two or more matching reference genomes of the classified reference genomes with at least the minimum alignment threshold. For such a sequence read, a taxonomy identifier can be assigned from classification information to each of the two or more of the classified reference genomes. The taxonomy identifier can include at least two levels of classification. The assigned taxonomy identifier of each of the two or more classified reference genomes can be compared at each of the at least two levels of classification, with levels that do not match being removed. The lowest level shared between the two or more reference genomes can be assigned to the sequence read. Updated alignment results can be provided and include a number of corresponding sequence reads for each of the plurality of taxonomy identifiers.
As another example, background contaminants can be identified and removed from a set of potential (candidate) pathogens that are clinically-relevant. A negative control sample can be used to identify sequence reads from potential contaminating organisms. A ratio can be taken of a first amount of sequence reads from a test sample that align to a matching genome and a second amount of sequence reads from the negative control sample that align to the matching genome. The ratio can be compared to a threshold to identify a set of one or more matching reference genomes have the ratio exceed the threshold. An output can identify a set of one or more matching reference genomes as potential pathogens in the test sample.
Other embodiments are directed to systems and computer readable media associated with methods described herein.
A better understanding of the nature and advantages of the present invention may be gained with reference to the following detailed description and the accompanying drawings. Reference to the remaining portions of the specification, including the drawings and claims, will realize other features and advantages of the present invention. Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with respect to the accompanying drawings. In the drawings, like reference numbers can indicate identical or functionally similar elements.
An Appendix includes Supplementary Tables 1-4.
Embodiments can provide processes for rapid analysis of next generation sequencing (NGS) data for pathogen detection. For example, embodiments may be used in a broad, comprehensive pathogen diagnostic for infectious diseases by analyzing sequencing results against many reference genomes, as a metagenomics analysis. The use of unbiased metagenomic next-generation sequencing (mNGS) can provide for detection of all potential pathogens in a single assay. An advantage is the ability to detect all viruses, bacteria, fungi, and parasites in a single, standardized universal test directly from diverse clinical sample types such as cerebrospinal fluid (CSF), bronchoalveolar lavage (BAL), and plasma, thereby maximizing the potential impact on patients with acute, life-threatening infections by early and more accurate diagnosis.
Embodiments may apply alignment algorithms (e.g., SNAP and/or RAPSearch alignment algorithms) to align individual sequence reads from a sample in a next-generation sequencing (NGS) dataset against reference genome entries in a classified reference genome database. For example, as a result of sequencing of a sample, the system may obtain a list of results of reference sequences that align with the sequencing reads of a sample. Individual sequences in the reference genome database (also referred to as a GenBank) are referred to as reference genome sequences and may be identified by genome identifiers (GI). The genome identifiers include reference identifiers used to identify reference sequence genomes stored in the GenBank database. A best GI match can be assigned to each read, and the taxonomy assigned to each GI according to the classified reference genome in the GenBank can be assigned to each read.
Such unbiased analysis (e.g., using the large number of reference genomes in the GenBank) can make accurate detection of specific pathogens difficult as many human samples include flora or background colonization organisms. For instance, a nasal swab has a lot of bacteria in it because that's going to be bacteria that sort of colonize your respiratory tract. Thus, when the system makes a detection, it is difficult to know whether the system has detected a pathogen or a colonizer. Further, because the analysis does not bias or target any individual pathogen, the system could be detecting, for instance, a bacterial contamination of enzyme preps that are used in making the sequencing libraries. So effectively the system may identify laboratory reagent contamination instead of matching to a particular infectious disease.
Accordingly, while the sequencing analysis and matching is sensitive, the analysis may return false-positive matches (e.g., a human read will align to a viral GI due to database misannotations). And, reference genome databases may include many GI entries that are poorly annotated (e.g., “uncultured eukaryote” instead of a particular identification) or, even worse, incorrectly annotated (e.g., [http://][www].ncbi.nlm.nih.gov/nuccore/KC506764.1, annotated as GBV-B when this is actually GBV-C virus). These databases typically allow anyone to add reference genomes and there is no standard for annotating samples.
Further, many of the alignments are not accurate and raw results alone are not useful as-is clinically because an expert and/or clinician must analyze the results and manually interpret the returned alignment results to interpret the results of the sequencing reads. Typically, such interpretation required an expert genomicist or bioinformaticist or someone who is well-experienced with infectious diseases or laboratory medicine and who also understands the bioinformatics. Accordingly, such raw results are not helpful or usable in a clinical laboratory where such processes can provide great value to clinicians caring for patients.
To address this issue (non-specific identification), embodiments can filter and classify sequencing reads to obtain more accurate results and avoid false-positives, maximizing the specificity of detection. Embodiments can filter, classify, and display results to a clinician that can then quickly and easily obtain the results of the sequencing to identify a pathogen or other material that is being tested. Filtering can be helpful because the genome databases may be rife with misannotations and false positive matches.
Further, typical sequencing methods compare sequences to reference databases and then identifying hits based on what the sequence is aligned to. However, the actual interpretation of those hits is more subtle because an expert actually has to analyze the data to determine if some hits are not accurate and/or do not provide enough information to be classified as a particular pathogen. Embodiments can improve the classification of reads by applying a rapid taxonomic classification algorithm to alignment results to provide more accurate and clinically useful results. Additionally, embodiments can annotate the data within the genome reference database so that the system knows what reference genomes are a pathogen, a colonizer, what is considered likely contamination, etc. Embodiments can also provide a user-friendly visualization interface that can be used by clinicians to quickly and easily identify pathogens and other results.
Embodiments can provide a number of advantages including limiting the manual annotation and interpretation of sequencing analysis results to allow quick, efficient, and useful clinical and public health surveillance. For example, embodiments may be used where a patient is sick and a clinician may take one or more samples from the patient to determine what pathogens are present in the samples. The clinician may desire to know whether the patient has a viral infection, a bacterial infection, a fungal infection, a parasite, etc. Any type of biological sample that obtains DNA material could be used to identify potential causes of the disease. For instance, blood, cerebrospinal fluid, respiratory secretion, tissue, stool, etc., may be used to obtain a sequence read of the samples. Additionally, embodiments may be used in blood bank testing, food and water quality testing, environmental testing, animal testing, animal health, or any other area that may be assisted by quickly and efficiently determining potential sequence matches within a sample. However, a sample will likely include a mixture of multiple organisms. Some of these organisms may be from a person, viral, bacterial, and many samples are mostly mixtures of different organisms.
The mNGS assay has been clinically validated for diagnosis of encephalitis and meningitis in cerebrospinal fluid. The assay can incorporate: (1) the analytic wet bench process including handling of patient samples, nucleic acid extraction, mNGS library preparation, and sequence generation on an Illumina HiSeq instrument, and (2) bioinformatics analysis of sequence data and clinical interpretation by trained microbiologists/pathologists. Results of diagnostic mNGS testing from cerebrospinal fluid are reportable in the medical chart and can be used for clinical management. The clinical implementation of the mNGS assay has a direct, positive impact on patient outcomes by increasing the number and proportion of patients with an accurate, clinically actionable infectious disease diagnosis that allows for timely management and treatment. The assay is also able to detect rare, unexpected, or slow growing or uncultivable microorganisms, for which diagnosis is often delayed or missed. The mNGS assay may have particular utility in identifying culture-negative pathogens due to prior treatment with antibiotics. The mNGS assay may also be useful as a “rule-out” test for infectious diseases, which may impact management by increasing clinical confidence in working up and treating non-infectious causes of encephalitis/meningitis such as autoimmune disease with steroids and/or immunosuppressive medications.
The mNGS data can also yield additional information besides whether or not a given microorganism or microorganism type is present or absence. The number and proportion of reads, typically expressed as a “reads per million (RPM)” metric normalized to a negative “no-template” control sample run in parallel, can provide some degree of quantitative or at least semi-quantitative information. In some cases, the pathogen genome coverage may be sufficient to facilitate (1) precise genotyping or strain identification, (2) analysis of single-nucleotide polymorphisms or mutations, (3) the generation of predicted antibiotic/antiviral resistance profiles.
Once a clinical sample has been obtained (e.g., from a patient or from an environmental samples, such as a water sample), the sample may be sequenced, and the sequencing results can be analyzed. Various systems and processes can be used. For example, nucleic acid extraction can be performed. A cDNA/DNA library preparation may include adding adapters, although other preparation for other types of sequencing may be performed, e.g., for nanopore sequencing or other single molecule sequencing. The library of templates can be fed into a sequencing device to provide sequencing information, e.g., base calls or raw signals that are used to determine base calls, thereby obtaining sequence reads. The analysis of the sequence reads can include host subtraction; adapter, quality, and low-complexity trimming; and alignment against reference databases such as NCBI (National Center for Biotechnology Information) GenBank.
The sample analysis system 110 may be configured to receive a sample 150 (e.g., after library preparation) from a clinician or other operator, sequence the sample to obtain a plurality of sequence reads of genetic material for the sample, and submit the plurality of sequence reads to sequence identification computer 120. A sequencing device 115 can correspond to any sequencing device, such as those produced by Illumina, Pacific Biosciences, or Oxford Nanopore. A processor 111 (e.g., a CPU) can control aspects of sequencing device 115, such as one or more cameras for taking images of the nucleic acids or electrical components, both receiving sequencing signals corresponding to nucleotides of the nucleic acids being sequenced.
A memory 112 (e.g., flash memory, hard drive, DRAM, cache, etc.) can store software for controlling processor 111, which can control sequencing device 115. In some embodiments, a sample collection module 113 can control robotic processes for obtaining a sample (e.g., via a syringe connected to a robotic arm), and perform any automated preparation processes. A sample sequencing module 114 can instruct processor 111 to perform the sequencing, using sequencing device 115. Sample analysis system 110 can perform analysis of the raw sequencing signals (e.g., fluorescent or electrical signals) to identify basecalls of sequence reads, or send such raw sequencing signals to sequence identification computer 120.
Sequence identification computer 120 can process, align, and identify genetic material that is present in the sequence reads to identify pathogens and/or other genetic material that is present in the sample. An alignment module 123 in memory 122 can instruct processor 121 to align the sequence reads to a plurality of reference genomes, e.g., as stored in reference genome database 130. One or more alignment techniques (e.g., a local aligner, such as SNAP) can be used to obtain alignment results, e.g., initial alignments results, as well as subsequent alignment results. A classification module 124 can classify the alignment results to include an accurate taxonomic classification level. A filtering module can filter the alignment results to remove false positives. Sequence identification computer 120 can return results for identifying pathogens and/or other genetic material present in the sample.
Reference genome database 130 can contains a plurality of reference genomic sequences that have been identified and classified as being associated with a particular biological organism. For example, reference sequences of different viruses, bacteria, fungi, human, animal, and/or any other reference DNA sequences of any other biological material may be stored in the classified reference genome database. Sequence identification computer 120 may apply one or more alignment techniques to align the received sequence reads from the sample to the classified reference genome sequences stored within reference genome database 130.
Although
System 100 can be used to identify and detect pathogens in clinical samples, e.g., for the purposes of surveillance, such as epidemiologic surveillance. For example, it may be desired to look at cases of sequences generated from patients with acute diarrheal illness or fever to identify and detect pathogens. Surveillance can also be performed to detect novel pathogens, thereby allowing pathogen discovery, e.g., when no match is found in reference genome database 130. Another application is use as a diagnostic tool, e.g., by identifying a known pathogen for which treatment is known.
Metagenomic sequencing can include alignment to multiple reference genomes, as may be stored in reference genome database 130. One example of such a database is GenBank, which has more than 20 gigabases in size. With such a large size, the alignment of aligning millions of sequence reads can provide many hits (matching alignments) to different genomes, which is why classification and/or filtering can be useful. There may actually be a pathogen in that sample that is identified by a hit, but there can be false hits, as may occur due to problems in the databases. Databases themselves are not well curated, and there are incorrect or errors in how they are constructed. For example, there may be sequences annotated as hepatitis C virus but may actually be a human sequence for instance and vice versa. The database itself can be cleaned up, as well, e.g., when misannotations are identified, as may be done using BLAST.
Additionally, a sample can include a large amount of background nucleic acids. Techniques can be used to filter out such background sequences, as may be done with a no template control (NTC) sample. NTC database 135 (also called a contaminant database) can store background sequences that have been identified in recent samples, e.g., samples within the last month. Levels (e.g., number of reads) that a background sequence has been quantified can be stored, as is described below.
For the taxonomic classification, individual sequences can be classified according to where they fit on the taxonomic level. For instance, if you have sequences that are aligned to certain regions (e.g., viral or bacterial genome) those individual reads may not be specific for that species. That is, it may not be specific to that viral genome of a viral species. In such a case, the read can be classified using a least common ancestor algorithm to the next higher taxonomic level. Once the sequence reads are classified, the system can then point out what species are specific to the sample, thereby allowing a good prediction that an actual viral species is in the sample.
In some embodiments, the classification can use the results of an initial alignment technique, e.g., SNAP, thereby allowing the classification to start while alignment is being performed for other sequence reads. The efficient alignment can relatively concurrent classification can allow a taxonomic classification in one to two hours. Accordingly, if the alignment results provide 100 hits, the classification can narrow to about 10 to 15. However, for clinical purposes, it is desirable to have only a few hits, e.g., two. The 10-15 hits may be real, but may not be clinically significant. For example, the hits may include microorganisms that are part of the laboratory background contaminants or part of the skin background as part of drawing blood. Techniques to filter out such contaminants are described in more detail below, as well as alignment and classification techniques.
Once the system has obtained the raw sequence reads from the sequencing module, the raw sequence reads are analyzed to obtain an identification of matching reference sequences from one or more classified reference genome databases. Embodiments can provide data with respect to what sequences from different microbial pathogens are detected.
At step 210, the raw sequence reads arc received from the sequencing module. The sequencing process may return a large number (e.g., 1, 2, 3, 5, 10, 20, 50, or 100 million reads, or more) of DNA sequence reads from a sample. As examples, the raw sequence reads may be received in fasta, fastq, sam, or bam files. The sequence reads can be obtained in various ways, e.g., from a sequencing of nucleic acids or probe-based methods, such as hybridization arrays. The use of random sequencing that essentially sequences all nucleic acids can be preferable so that a significant number of pathogens (e.g., as many as reference genomes are available) can be detected.
At step 220, the raw sequence reads are preprocessed to trim the reads to remove low quality sequences, remove low complexity sequences (e.g., as would not be very informative), remove adaptors that can be retained at the ends, etc. Thus, the reads may be cleaned-up to ensure that the remaining reads are of high quality. If adaptors are used, the adaptors (adapters) may be ligated to the end of the nucleic acids and used as primers for sequencing, and thus may not relate to the actual nucleic acids. The low complexity reads (e.g., with many repeats) can be difficult to align, and thus removed. The quality of a read can be determined based on the quality of the basecalls, which may be determined based on the sequencing signals.
At step 230, an alignment module (e.g., alignment module 123) performs a sequence alignment analysis for the sample sequence reads to the reference genomes. In some embodiments, the alignment analysis may be performed in reference to a host genome database first and any sample sequence reads that align to the host may be removed from the sequence. For example, in embodiments that are identifying pathogens associated with a human sample, the sequence reads may be aligned with a reference human genome database and any matches may be removed (since no pathogens should be present in the human reference genome database). Comparison can be made to multiple reference human genomes.
Thus, embodiments can use computational subtraction to speed up the identification processes to remove any sequence reads that align with the host (and thus are not helpful for identifying biological material from other entities (i.e., pathogens)). In embodiments where a single large reference database is used, any results that are identified as matching with a human reference genome could be removed. However, by applying the full 100% of the sequence reads and not subtracting them before aligning with the full reference database, the process may be slower than if those sequence reads were removed before comparing to the large database of reference genomes.
The host database (e.g., human genome database) may include fewer genomes than the entire reference genome database. Accordingly, sequencing can be faster than comparing to all possible reference genomes. Thus, for relatively high quality sterile samples, applying such a host filtering method may remove most of the background reads quickly. For example, for a good sample, the removal of the host matches could take the analysis from potentially 100% of the reads down to maybe less than 10% of the reads. Thus, 90% of the sample sequence reads can be removed by alignment to a host (e.g., human) database.
Additionally, in some embodiments, a host database could include similarly related genomes that are not specifically from the host. For example, for the human example provided above, the system could also include primates that have similar genomic references to humans. This provides a more comprehensive host analysis and provides even faster and more effective subtraction of host matches.
Next, depending on the mode of the identification process, either a comprehensive mode or a fast mode can be performed to identify one or more matching reference genomes associated with the sample sequence reads. The fast analysis may only align the sample reads to a reference genome database of bacterial and viral reference genomes. Thus, the fast mode analysis may not identify all potential genomic matches in the sample sequence reads but will focus on the analysis on potential bacterial and viral matches due to the focus of the process on identifying pathogens within the samples. The comprehensive mode may align the sample reads to an entire nucleotide classified reference genome database that includes reference genomes from bacterial 240A, fungal 240B, parasitic 240C, viral 240D, and other 240E reference genomes.
At step 240 of the comprehensive mode, the alignment module performs a sequence alignment analysis for the sample sequence reads to the reference genome database, e.g., to reference genomes 240A-240E, Steps 241 and 242 of the fast mode may only perform alignment to the bacterial database 240A and viral database 240E, respectively. Any number of different sequencing algorithms may be used in embodiments of the present invention. For example, the “Scalable Nucleotide Alignment Program (SNAP)” algorithm includes a nucleotide aligner that takes raw sequence data and aligns it to nucleotide reference databases. SNAP is extremely fast and by using fast sequencing algorithms, the analysis processes of the present invention may return results extremely fast. While other analysis methods (i.e., “pipelines”) may analyze a sample sequence data within days, weeks, or months, embodiments of the present invention can analyze the sample sequence data in minutes to hours. Fast sequencing analysis is critical in a number of applications including, for example, infectious diseases analysis that can be paired with a next generation sequencing essay that can diagnose infectious diseases and get results to physicians regarding patient care within 8-12 hours, or as soon as possible
For example, in some embodiments, the system may align the sequence reads from the sample to all classified reference genomes in GenBank using a SNAP alignment algorithm. However, the GenBank is growing very fast (e.g., doubling every year) so it can be important to limit the number of reads being analyzed to improve the speed of the alignment analysis with the reference database (e.g., GenBank). For example, in some embodiments, a specific subset of the GenBank may be used to improve analysis speeds further. For example, if the clinician is not concerned about potential plant matches, such reference genomes may not be applied to the alignment analysis and/or a separate subset or different database may be used to avoid potential plant genetic matches. Accordingly, some embodiments may align to a tailored search for potential pathogens, bacteria, etc. and avoid aligning to reference sequences that are not a part of the tailored search.
At step 250, in the comprehensive mode, de novo contig assembly may be performed to obtain contigs associated with the results of the assignment from the classified reference genome database. Example assembly software include ABySS and Minimo.
At step 260, in the comprehensive mode, the reads and the contigs are used to align the translated nucleotides using another alignment algorithm (e.g., RAPSearch) and compared to a viral protein reference database. To generate translated nucleotides, the sequencing reals are translated in all 6 reading frames and the resulting amino acid sequences are then compared to a protein reference database. The output of RAPsearch is similar to that of BLAST but lists alignments that exceed a predesignated E-value significance threshold.
At step 270, the results of the alignment algorithms are obtained, and may include a summary table 270A that provides the matching viral, bacterial, fungal, parasitic, and other reference genomes that have aligned to the sample sequence reads. Thus, in response to the alignment, the system can provide a table that shows alignments to bacterial and viral reference genomes. Summary table 270A may list all the hits to the different species in the sample and the different genera in the different families of organisms found in the sample, e.g., according to taxonomic classification 270C. For example, the system may divide the results into viruses, bacteria, non-cordate eukaryotes (basically fungi and parasites that do not have a backbone), human, as well as an “other” category. Cordates are all higher level eukaryotes, eukaryotic organisms that have a backbone. Eukaryotic organisms that do not have a backbone are more often microorganisms. These categories are used in order for the system to identify fungi and parasites as well as other microorganisms that do not have backbone (e.g., invertebrates like worms). The system may find matches that are not necessarily microorganisms, for instance, the system may be capable of diagnosing a tape worm infection.
Thus, at the end of the alignment process and in an initial table, the system may provide matches for different bacteria and may include the number hits for each type of bacteria, virus, etc. Furthermore, in some embodiments, de novo assembly may be applied to actually recreate the genome and coverage maps 270B of the genome may be provide to indicate how well the reads cover the genome of one the genomes in the list.
Accordingly, the results of the alignment process of
Accordingly, embodiments of the present invention provide (1) filtering, (2) taxonomic classification, and (3) best match algorithms for identifying and providing the most useful data to a clinician. Additionally, embodiments provide additional features including RNA sequence removal and on-the-fly annotation of database entries to further assist interpretation of alignment results.
At step 310, a system may receive a sample from a patient and/or other biological item and/or entity. The sample may be provided by a clinician or other operator of the sample analysis system. The sample may form nucleic acids that have already been prepared into a library for sequencing.
At step 320, the system may obtain sequence reads of nucleic acids from the sample. Any suitable method for obtaining DNA sequence reads may be used, e.g., as described herein.
At step 330, the system may preprocess the sequence reads. The preprocessing can include trimming sequence reads and removing some reads, e.g., low quality or low complexity reads.
At step 340, the system may apply an alignment algorithm to the preprocessed sequence reads and obtain alignment results for the sequences reads. Each read can be aligned to a classified reference genome database, e.g., including millions of reference genome sequences. The alignment may return millions of “hits” or alignments to the classified reference genome database.
Some of the matches may not be perfect and may include sequence reads that match a portion of a reference genome sequence. A quality value may be returned with the alignment results that indicates a measurement and/or magnitude of similarity between the sequence read and the reference genome sequence. Additionally, individual reads may align with more than one reference genome. Accordingly, alignment results may include an identifier of the reference genome that was matched, classification information associated with the annotated information provided when a reference genome was uploaded to the classified reference genome database, and a similarity measurement indicating the quality and/or closeness of the alignment between the sequence read and the reference genome.
At step 350, the system may provide a taxonomic re-classification of the alignment results for reads that have aligned to multiple reference genome sequences with different taxonomic classifications. For example, the system may compare levels of classification between multiple aligned reference genomes for a particular sequence read and may remove one or more classification levels that do not match between the reference genome sequences aligned with a particular sequence read. Additional details and steps associated with the rapid taxonomic classification process is provided in reference to
At step 360, the system may filter the alignment results to remove false positives and mis-annotated or mis-classified reference genome sequences that aligned to the sequence reads of the sample. For example, the system may select a best match sequence read for an identified reference genome and may apply a second alignment technique to the best matching sequence read to identify if the same reference genome aligned with the second independent alignment. If the alignment of the sequence read does not match the same reference genome, the read and the reference genome may be removed from the alignment results as the reference genome is likely a false positive hit. Additional details and steps associated with the filtering steps are provided in reference to
In some embodiments, the filtering may include removal of a reference genome as a viable pathogen when the taxonomic classification is listed in the results of a no template control (NTC) for the current experiment or a database of previous NTC results. Further details are provided in a section below.
At step 370, the system can use the updated alignment results including the filtered and taxonomically re-classified sequence reads to identify one or more best matches for the sequence reads from the sample. For example, the best match may be selected based on the reference genome sequence that aligned with the most sequence reads.
At step 380, the system provides the results of the best matching analysis to the analysis system being operated by the clinician. The results may be provided via a network connection. The results may include the most probable one or more pathogens that may be responsible for a patient's illness.
At step 390, the analysis system provides and/or displays the results to the clinician. The results can be provided in any variety of ways, e.g., visually or by audio. The results can indicate a treatment to be provided, thereby providing a therapeutic intervention.
An objective of embodiments can be to pick the most informative one or more genome identifiers (GI) out of the GIs matched by all the sample reads of a single taxonomic assignment, a particular species. As part of the analysis, a score can be determined for each of the matched reference sequences, specifically each GI. In some implementations, the scores are determined for only GIs at a given taxonomic level, e.g., species. In other implementations, the total scores are determined for GIs across various (e.g., all relevant) taxonomic levels, with a top score being picked across all the GIs. The total scores can be determined from respective scores of various properties.
In some embodiments, each of the matched reference sequences can be scored as to any or all of length, coverage, identity, and percent identity. Thus, four scores can be determined and used to determine a total score. Examples of how the scores can be determined are provided below.
Length is the length of the reference sequence. Thus, a length score is larger for longer reference sequences. Therefore, a species of microorganism that has a larger genome would have a longer length score. In some embodiments, matches can be ranked by whether or not complete versus partial genomes are available in the database, where use of only a partial genome would affect the length score.
For each reference sequence, a coverage score can be determined. In one embodiment, the coverage score can be determined as a cumulative score representing a sum of the counts of aligned reads at each genomic position (e.g., base position) along the reference sequence. In another embodiment, the coverage score can be determined using a sum of genomic positions that have at least one read aligned to that position, e.g., regardless if there is a match that position. The coverage score may be a percent of the genome that is “covered” by at least one aligned read, and thus the sum of genomic positions that have at least one read aligned to that position can be divided by the total number of positions in the database for that reference genome.
For reference sequence, the identity score can be determined as a cumulative score representing a sum of fractional scores at each genomic position. A fractional score can be calculated as the number of aligned reads with a nucleotide matching the reference sequence divided by the number of aligned reads whether or not matching the reference sequences (i.e., the total number of aligned reads covering the position). Thus, a genomic position will have a larger fractional score when there are fewer mismatches of reads at that position.
Percent identity can be calculated by dividing the identity score by the coverage score.
Once the individual scores are determined, an overall score can be calculated for each reference sequence (taxonomic identifier). In one embodiment, the overall score can be determined by adding ‘length’ and ‘identity’ and multiplying the sum by ‘percent identity’. The reference sequences can be ranked by the overall score, e.g., in descending order. And, a top score(s) can be selected, e.g., by choosing the first reference sequence in the list. In some embodiments, more than one reference sequence can be selected, e.g., all having a total score above a score threshold. In other embodiments, the top N or X % of taxonomic identifiers can be identified.
Accordingly, for each sample, a score can be determined for each relevant GI (e.g., a set of taxonomy identifiers) at one or more taxonomic levels (e.g., for each species, genus, family, etc.). The scores may be determined only at one taxonomic level, or at a given level and all GIs at lower levels. As an example for each GI, the reads aligned to that GI can be grouped. For each GI, the coverage and identity score can be determined in the following manner. For each alignment of a read, the following calculation can be performed for each alignment position: map the read to the genomic position of the GI, the position coverage score is increased, and the position identity score is increased if there is a match. Across all genomic positions for the GI, the total coverage score can be increased by one for each position that has a position coverage score greater than zero. The total identity score can be determined as the sum of the fractional scores, computed as the position identity score divided by the position coverage score. The percent identity score can be determined as the total identity score divided by the total coverage score.
In some embodiments, the following pseudo code can be used:
In various embodiments, an output (e.g., a file) can contain the picked reference sequence (GI) and a consensus sequence of the reads mapped to it. The term “coverage” can be used in two contexts: (1) one of the sub-metrics that make up the overall score for picking a GI; and (2) an overall metric, inferred from a coverage map file, for how comprehensively a set of mapped reads cover the picked GI.
For picking a GI, a particular species can have multiple genomic identifiers for different reference sequences in a database. The database of reference sequences used can vary or be constructed from some combination of separate databases. On database is the NCBI ‘nt’ database. It is highly redundant, and often mis-annotated. A single species can have thousands of entries, or GIs. A GI can be picked using embodiments of the best match algorithm above.
After picking a GI for each species, scores can be determined for each of the selected GIs (e.g., one for each species). The scores can be determined using all reads assigned to the species level, and can include reads assigned to higher levels. The scores can be determined as described above, and scores exceeding a threshold, e.g., an absolute number, a value at which the top N scores are above (where N is an integer), or a value at which the top X % of scores are above.
Assigned reads at the subspecies, species, genus, or family level can also be mapped to the selected GI for visualization. Mapping of reads at higher taxonomic levels (genus or family) will increase the coverage of the GI (the percentage of nucleotides that have at least one read mapped to it), but can also incur a risk of erroneous mapping from different species or genera in the sample. Real-time visualization of the coverage maps and pairwise identity plots can facilitate expert interpretation for clinical results reporting.
Some embodiments may analyze clinical or environmental metagenomic data (e.g., metadata) to classify the origin of each of the millions of next-generation sequencing reads (e.g. bacterial, viral, human, etc.) that are returned by the alignment results. However, individuals reads may be informative only at a given taxonomic level (species, genus, family, etc.). For instance, a read in the conserved matrix gene of influenza viruses may be identical (or nearly identical) in influenza A and influenza B, and thus this read cannot be used to distinguish between influenza A and B. This is problematic if one is interested in species or strain level identification, which is critically important in infectious disease diagnosis (examples: Bacillus anthracia/anthrax versus Bacillus cereus; enterovirus versus rhinovirus; Ebola Zaire versus Ebola Reston).
As such, some embodiments may provide a taxonomic classification algorithm that removes some of the hits from consideration through the metagenomic data provided through the classification of the reference genomes. The taxonomic classification algorithm may apply a least common ancestor (LCA) approach where taxonomic identifiers for each of the associated reference genomes are analyzed to identify the lowest level of shared ancestry between two or more reference genomes. For instance, if a sequence has hits to both a human and a virus, and the match is conserved between a human and a virus, then the algorithm would move the result up to the proper taxonomic level, which would in this case be above the kingdom level. In such embodiments, the system would assume that it mapped to the human and would remove the result since the taxonomic classification moved up to the kingdom level.
As another example, if the system returned hits for a single virus but different versions of the virus (e.g., influenza A and influenza B), and the hits are indistinguishable between influenza A and influenza B, then those sequences would be assigned to the influenza genus, and not to the species level (i.e., Influenza A and B). Thus, the system would move the classification for this reference genome up to a higher level (e.g., family enterovirus instead of mapping to two different enterovirus species—influenza A and influenza B).
Such classification allows a clinician to look at the results at different levels and provide potential options for treatment. Using the influenza A or influenza B example above, the result may indicate that the system matched either influenza A or influenza B instead of influenza A. For example, if the system is analyzing an influenza A sample (i.e., a sample taken from a patient infected with influenza A), the sequence results will likely return that 90% of the sample sequencing reads are going to align influenza A and 10% are going to align to influenza B because of the indistinguishable portions of the genetic material between influenza A and B. Reads may be from regions that are highly conserved between these two species of influenza. Thus, a clinician may not know, looking at the raw data, whether this patient is infected with both influenza A and B, is infected with influenza A alone and it just happens to be that B was misaligned to influenza B, or vice-versa. However, by applying the taxonomic classifier and by properly classifying those reads that were initially assigned to influenza B, the system moves up the classification to the genus level, and the remaining reads assigned to influenza A would allow a clinician to make a determination that the patient has an influenza A infection instead of the erroneous call, which would be this a dual infection with influenza A and B.
Thus, after shared reference genomes have been classified at the appropriate level, the remaining reads in the alignment results should be specific at each taxonomic level and thus, the remaining hits that are species-specific are known to be particular to that species. By removing the alignment results that are shared between multiple reference genomes sharing a genus, family, etc., the results will be more accurate and usable by the clinician. The reads that remain after shared reference genomes of the same genus have been reassigned to a genus as opposed to a particular species associated with the aligned reference genome. Identifying species-specific reads can be extremely informative to clinicians where the clinicians know that the read is specific to the particular species, sub-species, strains, etc. It can also be useful to know the higher taxonomic level information including the number of reads at a particular genus, family, etc. Thus, embodiments provide more informative and accurate information for a clinician.
At step 410, the system receives the alignment results from applying an alignment algorithm to the sequence reads from the sample. In some embodiments, the alignment results may have been previously filtered using techniques described herein. The rapid taxonomic classification process may be performed concurrently with the alignment process and thus, before the filtering process has been completed. Thus, the system can receive a plurality of sequence reads obtained from a sequencing of DNA molecules from the sample of biological material where the sample includes DNA molecules from a plurality of organisms and aligns the sequence reads using an alignment technique to align the plurality of sequence reads to a plurality of classified reference genomes in a database.
The system may obtain initial alignment results that include, for each of at least a portion of the sequence reads, a matching reference genome to which the sequence read aligns. The initial alignment results may further include classification information for each of the matching reference genomes where the classification information includes a taxonomic identifier including multiple classification levels for each reference genome. For example, a reference genome (e.g., enterovirus D) may include a species (e.g., enterovirus D), germs (e.g., enterovirus), and family (e.g., picornaviridae) classification level within a taxonomic identifier. Many other classification levels may be also be assigned and included in the taxonomic identifier and corresponding classification information with the initial alignment results for each aligned reference genome.
At step 420, the system identifies a set of the sequence reads that aligned to two or more of the classified reference genomes. For example, in
Steps 430-490 can be performed for each of the sequence reads identified in step 420.
At step 430, the system identifies two or more matching reference genomes of the classified reference genomes to which the sequence read aligns with at least the minimum alignment threshold. For example, using the example provided above in reference to
At step 440, the system selects a set of reference genomes for a sequence read that matches to multiple reference genomes. For example, using the example provided above, the system may select the human rhinovirus and the human enterovirus K1105 reference genomes that may have been associated with the same read. The selected set of matching reference genomes may correspond to those identified in step 430. In other embodiments, other factors may be used to determine the selection of the set of reference genomes to use. Such other factors can include the magnitude of the differences between the “best” and “second-best” alignments, the genomic sequence diversity corresponding to the detected microorganism (e.g. viral genomes are most diverse than bacterial genomes), and the completeness of the available reference databases, including the availability of “near-neighbor reference genomes (i.e. genomes that are closely matched in the reference database).
At step 450, the system assigns a taxonomy identifier from the classification information to each of the two or more of the classified reference genomes. The taxonomy identifier may include at least two levels of classification (e.g., species, genus, family, etc.) and the levels may have a hierarchy such that there is a lower level (e.g., species) and at least one higher level (e.g., genus, family, etc.). For example, using the example of
At step 460, the system compares the assigned taxonomy identifier of each of the two or more classified reference genomes at each of the at least two levels of classification. For example, using the example of
At step 470, the system removes each level of the at least two levels from the assigned taxonomy identifier that do not match between the two or more classified reference genomes. For example, using the using the example of
At step 480, the system may assign to the sequence read the lowest level of the at least two levels of the assigned taxonomy identifiers that is shared between the two or more classified reference genomes. For example, using the example of
As shown in
At step 490, the system determines whether all of the reads matching multiple reference genomes sequences have had taxonomic classification processing applied. If there are additional reads matching multiple reference genomes to be analyzed, the system may repeat the process for the next sequence read in the initial alignment results (and may return to step 430). This process may continue until all of the sequence reads matching to two or more reference genome sequences have been analyzed to ensure the reference genomes have the appropriate level of classification.
At step 495, the system provides an identification of one or more taxonomy identifiers corresponding to one or more candidate pathogens based on numbers of corresponding sequence reads assigned to each of the plurality of taxonomy identifiers. The identification can be the taxonomy identifiers with or without additional information. For example, the number of reads assigned to each of the identifiers may or may not be included. The fact that the identifier(s) are being provided can indicate that the identifier(s) are candidate pathogens, i.e., a sufficiently high likelihood of existing in the sample based on the number of reads assigned. Various criteria can be used to determine whether a pathogen is identified as a candidate pathogen, e.g., using best match algorithm for picking one or more identifiers (as described herein) and criteria for determining whether the best matched taxonomy identifier provides a sufficient match to be identified as a candidate pathogen.
In some embodiments, sequence reads having an assigned taxonomy identifier can be used to identify the one or more taxonomy identifiers corresponding to one or more candidate pathogens. For each of a plurality of taxonomy identifiers, a total score can be determined based on at least coverage of a reference genome corresponding to the taxonomy identifier. The total scores can be ranked, and one or more taxonomy identifiers exceeding a threshold can be identified.
In one embodiment, updated alignment results can be provided to a clinician where the updated alignment results include a number of reads for each of the plurality of taxonomy identifiers. As shown in
In some embodiments, the classification of a read can begin once alignment results for that read are obtained. Thus, the alignment results for all reads are not required before classification can begin. Accordingly, the two processes may be performed in parallel (e.g., in separate threads) with the output of the alignment thread being used by the classification thread, with the classification thread operating slightly delayed as needing to wait for the alignment results for a first read.
Once the candidate pathogen(s) are identified, additional available clinical and laboratory data can be helpful in determining whether or not the detected organism is pathogenic (i.e. causing disease) in the host organisms (e.g. human patient). The detection of the presence of a potential pathogen in a clinical sample does not necessarily mean that it is causing disease; the potential pathogen could be a colonizer, for instance, or a bystander and have nothing to do with the host organism's illness. If the candidate pathogen is deemed to be pathogenic by clinical and other criteria, the detection can be used to guide clinical interventions, which can include (1) drug therapy (e.g. prescribing or administration of a targeted antimicrobial agent), (2) drug discontinuation (e.g. discontinuing a drug that was administered empirically in the absence of a definitive diagnosis), (3) vaccination, if a vaccine is available and efficacious after infection (e.g. rabies), and (4) medical procedures (e.g. valve replacement in cases of fungal endocarditis, for which antifungal therapy alone is ineffective). The failure to detect a candidate pathogen may also be clinically useful to “rule out” infection as the cause of illness, which can guide clinicians to work up and treat for non-infectious causes (e.g. administering intravenous immunoglobulin and steroids for autoimmune disease).
Accordingly, when the sample of biological material is from a host organism, a clinical intervention can be performed for the host organism based on the identification of the one or more taxonomy identifiers corresponding to the one or more candidate pathogens. The clinical intervention can include the examples above. The intervention can include actually performing/administering any of the above procedures or prescribing them.
The alignment results can provide a distance (e.g., an edit distance) between the sequence read and the matching region of a reference genome. The distance can provide a measure of how many nucleotide differences are there between the actual sequence read and the reference. An edit distance can be the minimum number of editing operations (e.g., insertion, deletion, and substitution) to transform the read into the reference genome, or vice versa. Using the edit distance can help to perform the classification quickly.
Once a distance is obtained, the reference genomes can be ranked by the distance. The taxonomic classification can place the read at different species, genus, or family level based on edit distances to different matching reference genomes. Thus, the taxonomic locations for each of the reference sequences can be determined.
The desired edit distance for determining matched alignments will vary depending on the sequence diversity of the organism. Thus, the minimum alignment threshold can vary depending on which reference genomes are involved. For instance, viruses have more divergent genomes (meaning there is a lot of diversity). For viruses, a relative large threshold can be used, e.g., up to 12 mismatches in the fixed read length. For bacteria and fungi, a smaller distance may be tolerated because bacterial genomes that fit in the same species are highly identical; we want to be able to not only identify the species but we want to be specific.
For instance, if there are two different subtypes of enteroviruses, 70 and 71, there can actually be up to 30% difference by sequence between those two viruses. On the other hand, two bacterial species, such as staph aureus and staph epidermidis, can be 99% identical across the genome. Edit distances can be used as a criteria for classifying these reads, and different taxonomic groups you would require different edit differences. A suitable matching threshold distance can be determined empirically meaning from clinical data or data generated from positive and negative controls where it is known what is in the sample, e.g., what pathogens are in the sample.
One embodiment of the present invention is directed to filtering out sequences that appear to be false positives and/or are incorrectly annotated and/or classified. For example, some reads may initially appear to align to a particular classified reference genome using a first alignment algorithm (e.g., a global alignment algorithm like, for example, SNAP or Needleman & Wunsch) but a more comprehensive alignment (e.g., using a local alignment algorithm like, for example, “Basic Local Alignment Search Tool” (BLASTn) or Smith & Waterman) shows that even a best read can actually align to a different reference sequence using different alignment algorithms.
Furthermore, some reference genomes with a GenBank or other large database of classified reference genomes may be mis-annotated taxonomically (i.e., are classified incorrectly). Additionally, some reference genomes may include portions of sequences that can be assigned to multiple taxonomies. If a sequence in the dataset aligns to the “erroneous” portion of that reference genome sequence, it will be mis-assigned by an alignment process. For example, an HIV viral sequence with flanking human integration sites may be annotated as HIV and may cause erroneous matches and identification of potential pathogens. As such, embodiments are directed at filtering these “erroneous” reference genomes using a filtering algorithm.
Accordingly, embodiments of the present invention can be applied after a sequencing alignment analysis to help filter, categorize, and interpret the results of the sequencing analysis to identify a smaller list of potential pathogens to a more manageable and easily interpretable result. Thus, embodiments may be directed at filtering results once a system has the raw reads, raw contigs, and assembled contigs are determined. Embodiments can take the reads and annotate them with what the hits were from the alignment analysis. For example, the system may take the results of the analysis and determine how accurate and/or good of a hit a result is. The system can take that annotated raw output and further filter it to remove mis-annotated results and false positive aligned reference genomes. Generated result tables showing the alignments and the coverage maps can be determined and displayed after filtering.
At step 510, alignment results are obtained in response to applying an initial alignment technique (e.g., using a global alignment algorithm) to a plurality of sequence reads from a sample to align the sequence reads to reference genomes in a global database. The alignment results include a matching reference genome for each of at least a portion of the sequence reads in the sample that the sequence read aligns. For example, the alignment technique may include a global alignment algorithm that takes each sequence read and searches a database of classified reference genomes for aligned portions of the sequence reads to portions of the classified reference genomes. The results may include multiple aligned reference genomes for each of the reads, and each reference genome sequence may have many different reads that align to a portion of the reference genome.
The alignment results may include a plurality of sequence reads, sequence read identifiers, a plurality of aligned classified reference genomes, reference genome identifiers from the database of reference genomes, an alignment and/or similarity measurement for each of the alignments, a taxonomy identifier and/or other classification information associated with each of the reference genomes, and any other suitable information that may be used in alignment and identification processes. For example, the source and/or database identifier where the aligned reference genome was stored, the source where the reference genome was provided from (e.g., hospital, clinic, company, provider, etc.), and the data and/or time of the upload may also be provided.
Additionally,
Further, the results table may include tag (740) for each reference genome that may identify the database that the reference genome was aligned with and the type of reference genome (e.g., human, plant, bacteria, etc.). Finally, the result table may include the number of reads that aligned to each reference genome for each sample. For example, for the first sample (column 750) shown in
At step 520, the system identifies all reference genomes that matched to a read from the sample. For example, for the alignment results of sample #1 shown in
At step 530, the system identifies an optimally-aligning sequence read that aligns to the matching reference genome with an optimal alignment score that exceeds or is equal to alignments scores of other sequence reads that align to the matching reference genome. The optimally-aligning sequence read arid optimal alignment score can be identified based on the initial alignment results. The optimally-aligned sequence read may be determined by analyzing the reads that aligned with the reference genome and comparing their alignment value for the closest match. Any process may be used to identify the best match, for example, the read with the most significant match may be defined in SNAP by minimum edit distance and RAPSearch by minimum e-value.
At step 540, the system applies a second alignment technique for the optimally aligning sequence read to the plurality of classified reference genomes to obtain a plurality of new alignment scores. The second alignment technique may include a different alignment technique, a different reference database, and/or a combination thereof. For example, a local alignment algorithm may be applied for the second alignment technique that may take longer to process for each sequence read but because there are fewer sequence reads to process (e.g., only the identified optimally-aligning sequence reads from step 530), the processing time may not overly delay the identification process.
For example, the first alignment technique may include a global alignment algorithm (e.g., SNAP/RAPSearch) that provides faster analysis, but may be sensitive to poor quality sequence regions. The second alignment technique may include applying a local aligner (e.g., BLASTn, Bowtie2, etc.) that may be much slower (1,000×-10000× slower), but that is less sensitive to poor reads and may provide a different alignment of the sequence read that can be compared to the results of the first alignment technique. Thus, the use of the two different alignment techniques allows the system to correct rare “false positives” that occur from the first alignment technique (e.g., SNAP alignment). By only selecting one “test hit” read per reference genome for the second alignment technique (e.g., BLASTn confirmatory alignment), the system can greatly speed up the turnaround time of the process.
At step 550, the system compares the re-alignment results obtained using the second alignment technique for the optimally-aligned read with the initial alignment results obtained using the first alignment technique. For example, different reference genomes may be aligned with the sequence read using the second alignment technique that were not aligned and/or different alignment scores may be returned with the re-alignment than the original alignment.
At step 560, the system determines whether the alignment results using the first alignment technique match the alignment results using the second alignment technique. The comparison of the first alignment results and the second alignment results to determine a match may be accomplished through any suitable manner. For example, the system may determine whether any of the new alignment scores exceeds the optimal alignment score for the optimally aligning sequence read. In other embodiments, the system may determine whether the same reference genome was aligned with the sequence read. Further, other embodiments may determine whether the aligned reference genome sequences share the same taxonomic identifier (e.g., the two reference genome sequences share the same family, genus, and species classifications).
At step 570, if the alignment results for the sequence read are determined not to match, the system may remove the reference genome sequence from the results and all corresponding sequence reads that aligned with the reference genome sequence. In some embodiments, if a sequence read aligned with multiple different reference genomes, only the association of that read with the currently analyzed reference genome sequence may be removed and the read itself (along with the alignment result to the other reference genome sequence) may not be removed. Thus, for the exemplary comparison method of comparing the alignment scores discussed above, when one new alignment score exceeds the optimal alignment score for the optimally aligning sequence read, the system may remove the matching reference genome from the set of matching reference genomes and the corresponding sequence reads associated with the reference genome. In some embodiments, if the two reference genomes share a same taxonomic identifier, the matching reference is not removed.
For example,
At step 580, if the alignment results for the sequence read are determined to match, the system may consider the reference genome a real match that is not mis-annotated and may maintain the reference genome and aligned sequence reads in the alignment results.
At step 590, the system determines whether all of the reference genomes have had the filtering process applied. If there are additional reference genomes to be analyzed, the system may repeat the process for the next aligned reference genome in the initial alignment results (and may return to step 530). This process may continue until all of the aligned reference genomes have been analyzed for false-positive matches and are removed and/or confirmed as matches.
At step 595, once the system has applied the filtering process to all of the aligned reference genomes, the process may provide the updated alignment results (e.g., not including the selected reference genomes from step 570) to the sample alignment system for display to an operator/clinician. In some embodiments, just the set of matching reference genomes is output. In some embodiments, additional processing may be applied to the updated alignment results before returning the updated alignment results to the sample analysis computer, as described herein.
Accordingly, embodiments may apply a filtering algorithm to the results that identifies all the false positive hits, and systematically remove false positive hits corresponding to that genome. The system can identify the particular genomes that provided a sufficient quality alignment. The system can then further analyze those reads to remove further alignments that do not appear to be high quality. The filtering algorithm can apply an even more stringent criteria to remove hits that are likely not real and instead are due to a misannotation of a reference genome database and/or a misclassification.
For example, a particular read might have hit two, five, or more hits. If a genome had a single hit then it is likely an accurate hit, but if a read hit ten different reference genomes, then it is not likely that the hit is of a high quality. Thus, the filtering could remove all those hits and a new summary table and/or coverage map for the genome. Further, if the coverage map for a particular genome has now drastically reduced due to removing the false positive hits, then the system may remove the genome as a potential pathogen or cause of the illness. Accordingly, embodiments of the present invention may filter what was fifty results (e.g., fifty different bacteria) down to the five bacteria that are the most high quality and likely real results. This allows for much easier and faster clinical analysis and/or interpretation.
In some embodiments, methods 400 and 500 can be combined in a single pipeline, e.g., in a research pipeline. In other embodiments, method 500 may not be performed, e.g., a clinical pipeline might not include method 500. For example, the taxonomic classification may be sufficiently accurate that filtering may not be needed. And, such a pipeline can operate faster without the filtering step, particularly one that uses BLASTn. The turnaround time for a clinical pipeline may be desired within 4-6 hours, whereas more time is acceptable for a research pipeline. Thus, filtering, RAPSearch translated nucleotide alignment, and contig assembly may be retained in a research pipeline, but dropped for a clinical pipeline.
Additionally, in some embodiments, ribosomal RNA sequences may be removed from the alignment results. Sequences derived from ribosomal regions are difficult to speciate/classify accurately due to conservation of the ribosomal sequences. Accordingly, embodiments may remove ribosomal sequences from the alignment output by aligning sequence reads that had been assigned to bacteria against bacterial ribosomal 16S/23S reference genome sequences. Additionally, the aligned sequence reads assigned to eukaryotes (fungi and parasites) may be further aligned to 18S/28S GIs. Any sequence reads that align to these bacterial or eukaryotic ribosomal reference genomes may be removed from the alignment results.
Further, due to the rapidly expanding number of reference genomes stored in the classified reference genomic sequence database, such databases may double every 3 years or faster. However, there are very few controls over who or the quality of those providing reference genomic sequences. Accordingly, embodiments can includes automated scripts to (1) download all sequences from the GenBank (currently ˜72 GB of data), (2) remove all entries that are poorly annotated by screening out key words in the description of a reference genome sequence entry (e.g., “uncultured”, “unclassified”, “environmental”), (3) remove all vector sequences, and (4) chop up the classified reference genome sequence database (e.g., GenBank) into “chunks” and convert them into reference databases to be used by local and global alignment algorithms (e.g., SNAP and RAPSearch). A database may be chopped up so as to fit into memory.
As mentioned above, a negative control can be used to identify contaminants or other microorganisms that are not clinically relevant. Such remove of background microorganisms is particularly relevant to clinical applications, where diagnosis and treatment of a patient are a goal. Embodiments can provide one or more criteria for discriminating between a pathogen and a contaminant.
In some embodiments, one or more control samples can be analyzed in parallel (simultaneously) with one or more patient samples (e.g., 5-8 patient samples). A positive control can consist of several different organisms spiked into negative matrix, e.g., spinal cord fluid. A negative control sample (also a called a no-template control) has no spiked organisms. The negative control can be just the buffer that is used for a PCR reaction, which is part of the library preparation for sequencing.
The positive control can be used to confirm that the microorganisms in the spiked contamination are identified. The positive control can ensure that the procedure is robust. For instance, if the sample is like a bloody fluid, the heme in blood may actually inhibit PCR. There can be a negative result, but it might be a false negative. Thus, if the positive control is negative, then the sample can be identified as a false negative. The external positive control can ensure that there are a certain number of reads for each of the spiked organisms (e.g., 7 different organisms), in order for the run to pass quality control.
With the no template control (e.g., a buffer for a PCR reaction or other library preparation step), any reads classified (i.e., hits) as aligning to a microorganism can be identified as corresponding to background. As examples, such false hits can correspond to: reagent contamination, contamination introduced in a laboratory, or contamination introduced from other samples (cross-contamination). In this manner, the number of hits can be reduced, and proper treatment for an actual pathogen is more likely to be performed.
An internal spiked control can include organisms spiked into a patient sample. For example, a DNA phage and/or an RNA phage can be spiked in specific concentrations in a clinical sample. Embodiments can check to make sure that the phages can be detected in the DNA library, e.g., a sufficient number of sequences from the DNA phage. Similarly, a check can make sure that the phages are detected in the RNA library, e.g., a sufficient number of sequences from the RNA phage. This control can be an internal control on every sample, in addition to external control.
After analyzing the data from the sequencing of the samples (e.g., aligning, classifying, filtering), a list of potential candidate pathogens can be obtained. The list can be a table showing a number of reads seen for each of a plurality of species or other taxonomic classification. The number of reads can be normalized. For instance, the number of reads can be determined per million reads. If there are 10 hits and 10 million reads, the RPM (reads per million) will be one. The total reads can be raw reads that were generated per clinical sample. But, there can be further normalization.
An RPM can also be determined in the negative control. An RPM ratio can be determined from the reads per million (i.e., of a particular taxonomic classification) in the sample divided by your reads per million in the negative control. For example, assume there are 10 normalized reads for a virus HIV) in the sample, and two normalized reads in the no-template control. In that case, the RPM ratio would be five. If there are no reads in the negative control, then there would be a division by zero error. In such cases, a value of one can be used as the RPM from the NTC sample. In other words, the RPM ratio can be the reads per million in the sample.
The RPM ratio allows for easier discrimination of what is really in the sample and what is background. A threshold (e.g., 10) can be used for the discrimination. Thus, pathogens with an RPM ratio of greater than 10 can be identified as clinically significant. Using this criteria, very good sensitivity and specificity can be obtained. The use of the RPM ratio reduces the number of false positive. In particular, very good test performance is obtained relative to using the Bayesian probability, which is based on just the number of reads in a sample. Such probabilistic method can be used when determining an error model, and do not involve normalization from reads in a negative control.
Tables 1-4 below shows revised accuracy data with and without discrepancy testing for results-based and sample-based testing. Discrepancy testing refers to the use of an orthogonal (i.e. different kind of) test to resolve discrepant results between the mNGS assay and conventional clinical microbiology laboratory testing. For instance, if a sample is found to be positive for Mycobacterium tuberculosis by mNGS but is negative by culture (because some Mycobacterium tuberculosis strains grow slowly or not at all in culture depending on pathogen titer and sample type), a Mycobacterium tuberculosis PCR test can be used as an orthogonal test for discrepancy testing, and the results of the orthogonal test are taken as correct for the purposes of determining accuracy.
For sample-based testing, for each sample, performance is evaluated only as it pertains to detection of the original organism reported by the reference lab. For samples completely negative by reference lab testing, the ability to detect all 5 organism types is evaluated.
For result-based testing, for each sample, performance is evaluated as it pertains to detection of all 5 organism types (bacteria, fungi, DNA virus, RNA virus and parasites). Only results for acceptable samples (passed quality control metrics of >5,000,000 reads per library and >10 RPM of either an internally spiked DNA control, T1 bacteriophage, for DNA libraries, or an internally spiked RNA control, M2 bacteriophage, for RNA libraries; sufficient sample volume) are provided in the tables below.
Table 1 shows sample-based accuracy (most clinically significant result per sample) without discrepancy testing.
Table 2 shows results-based accuracy (includes multiple positive i negative results per sample) without discrepancy testing.
Table 3 shows revised sample-based accuracy after discrepancy testing.
Table 4 shows revised results-based accuracy after discrepancy testing. Only results for acceptable samples are provided.
After discrepancy testing and including results passing IC with sufficient sample volume, overall sensitivity increases to 86.1% (sample-based) and 85.5% (results-based), while maintaining high specificity. Given the difficulty making an etiologic diagnosis in many cases of encephalitis/meningitis, a test sensitivity of 80-90% with specificity>95% is very useful in patient management decisions. He above data shows that the overall test accuracy is acceptable for clinical use.
The tables can be displayed in a visualization program. The tables can be provided in a user interface, in a normalized or non-normalized manner. A normalized table is easier to identify pathogens than to look at a list of hundreds of hits. For example, many hits can be removed after normalization to obtain the RPM ratio, and use of a threshold.
In addition to the use of the no-template control for a given experimental run, embodiments can use a database of no-template controls (NTC database). For every run, the database can be reviewed before a call is made about a pathogen being in the sample, e.g., to make sure that organism is not in the NTC database. The NTC database includes historical data about what pathogens were identified in NTC samples, e.g., within a specific amount of time, such as 30 days.
For example, a herpes virus might be identified in an NTC sample a month ago (e.g., due to contamination resulting from the same lab doing herpes virus testing). Herpes virus 1 is a common cause of meningitis, and thus can be a dangerous pathogen, if it actually existed. The NTC database can include herpes virus 1, until it is not seen in an NTC sample for at least 30 days. In some embodiments, if the taxonomic classification is found on the skin like poly papilloma viruses, they can be excluded outright. But, they can still be included as part of the NTC database.
Very often, background includes skin flora. For instance, papilloma viruses may be seen in clinical samples in the no-template control. If we do see such a virus in a no-template control, it is added to the database. Embodiments may not actually report organisms that were found in the no-template control unless they are present at a much higher level. For example, assume the RPM was 10 in the NTC for one run, then the taxonomic classification gets added to the database with an RPM of 10. For the next run, if there was a herpes virus detected at 10 RPM (i.e., an RPM ratio of 1), it would not be reported when a threshold of greater than 10 for the RPM ration was used. It would have to be at an RPM of 100 to provide an RPM ration of 10 times the highest level that was present in the NTC database.
As it is not desirable to keep these pathogens in the database because that would affect the performance of the test, as the NTC database decreases the sensitivity, since the bar is set higher to make a positive call for the pathogens in the NTC database. Thus, the NTC database is an evolving database. For example, if a pathogen has not been present for longer than three months (or other specified time period), it then gets removed from the database. Steps can be performed to remove contaminants from NTC samples, e.g., replacing reagents. The amount (e.g., RPM) in the NTC database for a particular contaminant can be updated from the initial value to a new value if the new value is greater than the initial value. Accordingly, in one embodiment, the amount in the NTC database for a contaminant can be the maximum that is observed.
At block 1110, a plurality of sequence reads are obtain from a sequencing of DNA molecules from the test sample of biological material. The sequence reads can be received at the computer system in a variety of ways, e.g., via a network or a removable storage device. The computer system can also receive information about sequence reads from a negative control sample. For example, the computer system can receive an amount of sequence reads for a plurality of difference reference genomes, e.g., as classified into various taxonomic levels.
At block 1120, an alignment technique is used to align the plurality of sequence reads to a plurality of classified reference genomes in a database. The reference genomes may be considered classified in that the genomes are classified to at least one classification level in a taxonomy. Alignment results can include, for each of at least a portion of the sequence reads, at least one matching reference genome to which the sequence read aligns. The alignment results can include classification information for each of the matching reference genomes.
Blocks 1130-1160 may be performed for each of a group of one or more matching reference genomes.
At block 1130, a first amount of sequence reads from the test sample that align to the matching genome is determined. As an example, the first amount may be a total number of sequence reads aligned to the particular matching reference genome. The first amount may be normalized based on the number of sequence reads obtained from the test sample, or the total number of sequence reads aligned to at least one reference genome
At block 1140, a second amount of sequence reads from a negative control sample that align to the matching genome is determined. In one embodiment, the second amount of sequence reads can be determined from a negative control database. In another embodiment, the second amount of sequence reads can be determined from a parallel sequencing of a negative control sample. Amounts from a parallel negative control sample and the negative control sample can be combined to provide a single amount.
At block 1150, a ratio of the first amount and the second amount is determined. The ratio can take various forms, such as the first mount divided by the second amount, or the second amount divided by the first amount. Further, a numerator or denominator can include a sum of the two amounts.
At block 1160, the ratio is compared to a threshold to determine whether the ratio exceeds the threshold. A set of one or more matching reference genomes can have the ratio exceed the threshold. The threshold can be determined based on empirical data (e.g., values of samples with a known composition), as will be appreciated by one skilled in the art. The selection of the threshold can be based on a desired balance between sensitivity and specificity.
At block 1170, an output is provided that identifies the set of one or more snatching reference genomes as potential pathogens in the test sample. As examples, the output can be a list of the set of matching reference genomes, e.g., presented in classification levels. Such a list could include more matching reference genomes, where the set is indicated, as may be done with a marker or the ratio. For instance, an RPM ratio can be provided.
Once the false-positives and inaccurate reads have been removed, the system may apply post-filtering analysis and provide the results to the sample analysis system for display the clinician. In some embodiments, the filtered and classified results may be used to generate new coverage maps. The filtered and classified results may be analyzed to determine a best match and whether a sufficient match obtained. For example, a best match score (or any one of the individual score) can be required to be above a specified threshold. In some embodiments, if the coverage is below a certain level, than the system will dismiss the organism as being a possibility. In other embodiments, all possible identification results could be provided to allow a clinician to make an identification.
Further, in some embodiments, a best match algorithm may be applied to select, for instance, the best match among a variety of closely related matches. For instance, the results may have hits to multiple different strains of influenza. The best match algorithm may choose the most closely matched sequence in the reference database and select that classified element as the best match for the sample sequence.
Examples of a best match algorithm are described herein, e.g., in section II.C. In another example, for each species, the reference genome sequence with the greatest number of reads assigned to it can be used as a reference for mapping of all of the species-specific reads. A “consensus sequence” may be generated from the mapping. The global pairwise identity using the Needleman-Wunsch algorithm may then calculated between the “consensus sequence” and each reference genome sequence. The reference genome sequence with the greatest global pairwise identity may be selected as the top reference genome sequence. The reference genome sequences under consideration can also be prioritized as follows: (1) complete genomes, (2) complete sequences, or (3) partial sequences/individual genes.”
A positive pathogen identification may be reported where for each top reference genome a positive identification metric may be evaluated and provided to the clinician in a return message to the sample analysis computer. For example, the positive identification metric may include coverage of at least 3 genes/open reading frames (ORFs), or where there are fewer than 3 ORFs in the virus, then coverage that exceeds greater than 3 times the read length (e.g., for 100 bp sequencing, coverage of greater than 300 bp).
Additionally, systems may implement a tagging mechanism for the reference genomes that are matched to the sample sequence reads in order to make pathogen identification easier. For instance, the system may add the “host” to viral reference genome sequences (GIs), where the “host” is bacteria, for instance, the system knows that the virus is a phage (which the system may want to mask out sequences to phages). Similarly, the system can apply arbitrary labels such as “pathogens”, “colonizers”, and “laboratory contaminants” to specific bacterial GIs that can be helpful in downstream visualization of results for clinical interpretation. These results may be provided with the final results of the alignment, filtering, and classification process.
A diagnosis of brucellosis can be difficult because routine culture and serological methods exhibit variable sensitivity and specificity. At present, the standard laboratory diagnosis of brucellosis is based on isolation of the bacteria from clinical specimens and/or serological detection of Brucella antibodies.
Neurobrucellosis is a known complication of systemic Brucella infection. It remains a difficult diagnosis to make and can mimic other fastidious infections such as TB [1]. Manifestations of neurobrucellosis are widely variable and include meningoencephalitis, cerebrovascular disease, peripheral and cranial neuropathies, or myelitis. Neuroimaging studies of neurobrucellosis can also vary greatly among individuals [6]. A previous study from Turkey reported a high prevalence of neurobrucellosis at 37% among 128 patients with brucellosis. Although culture is the gold standard for diagnosis, Brucella species are relatively fastidious and slow-growing; cultures fail to recover the organism 30-90% of the time, and also present a risk of laboratory-acquired infection [7]. Serology is more sensitive for detection, but can lead to false-positive results and may not distinguish between active and prior infection.
Molecular methods based on detection of nucleic acid such as PCR [8] and now potentially metagenomic NGS [13] can offer increased sensitivity and specificity over conventional diagnostic testing. We present the use of a metagenomic next-generation sequencing assay to diagnose a case of neurobrucellosis from cerebrospinal fluid (CSF), resulting in the institution of appropriate antibiotic treatment and a favorable clinical outcome.
Initially, microbiological testing was positive for Epstein-Barr virus (EBV) and human herpesvirus 7 (HHV-7) from CSF. Testing for other pathogens, including Brucella by IgM antibody, was negative. The patient was treated with parenteral acyclovir, followed by oral ganciclovir to complete a 14-day course. Two weeks after hospital discharge, she developed back pain and worsening headache. Upon hospitalization, the patient's vital signs were normal without fever. Physical examination was remarkable for multifocal myoclonus throughout her body. There was hyperreflexia and decreased proprioceptive sensation of bilateral lower limbs. A complete blood count was notable for a white blood cell count of 4.82×109/L (44% neutrophils, 44% lymphocytes, 10% monocytes).
A repeat lumbar puncture was performed, with additional microbiological testing unrevealing. Cytologic examination of the CSF revealed no malignant cells. Despite a negative tuberculin skin test and QuantiFERON-TB test, she was started on 4-drug therapy with isoniazid, rifampin, pyrazinamide, and ethambutol given a high concern for TB disease based on the patient's suggestive CSF profile and risk factors for TB.
On day 8 of hospitalization, the result of CSF for Mycobacterium tuberculosis (MTB) by PCR was negative. Given the lack of response to empiric TB therapy, there was a concern for drug-resistant TB. Ethambutol was thus changed to ethionamide, and levofloxacin was added to the regimen. She improved substantially after changing her antibiotic regimen, and was discharged home on 5 anti-TB medications. At follow up 1 week and 1 month after discharge, her headache had resolved, but she continued to have fatigue, mild back pain, and intermittent episodes of shaking of her extremities. When the result of the mycobacterial culture was finalized as negative at 6 weeks, she returned to Mexico to continue her anti-TB therapy with INH and rifampin alone.
RNA and DNA mNGS libraries from 600 μL of the patient's CSF sample were constructed. From a total of 23,638,587 raw reads in the DNA library, there were 277 (0.0012%) reads corresponding to the Brucella genus in the DNA library, corresponding to an RPM ratio of 15.6, with all species-specific reads aligning to Brucella melitensis (see Tables 1200 of
Similarly, no viruses, fungi, or parasites met criteria for reporting. In contrast, all 7 microorganisms in the positive control were detected at levels above the established reporting threshold. Importantly, no Brucella isolates or positive clinical samples from suspected or confirmed cases had been present in the clinical laboratory prior to the mNGS testing, nor had Brucella sequences ever been detected in the NTC. The presence of Brucella in the patient's cerebrospinal fluid was confirmed by Brucella-specific PCR testing and Sanger sequencing of the amplicon. The PCR reaction for 177 bp region of Brucella IS711 gene was run on a 2% agarose gel. Other samples included: a no-template-control; 7 organism positive control mixture of CMV, HIV, Streptococcus agalactiae, Klebsiella pneumoniae, Cryptococcus neoformans; water; and 1 kb Plus DNA ladder (Invitrogen).
After detection of Brucella reads in her CSF sample, the patient was contacted and instructed to seek further medical evaluation. She returned to the hospital in Los Angeles for a second admission. Although she had completed INH and rifampin therapy one week prior, she reported persistent back pain, nausea, and fatigue. Repeat MRI of the brain and spine was remarkable, and CSF was normal. A confirmatory serum Brucella agglutinin titer was positive at 1:80. Because of persistent symptoms, mNGS and PCR testing showing Brucella, and positive confirmatory serology, she was diagnosed with chronic neurobrucellosis and started on targeted therapy with doxycycline and rifampin. Two weeks after starting therapy, she reported that her symptoms had fully resolved. Notably, Brucella spp. was not isolated from either CSF or blood culture, nor was Brucella DNA detected in CSF submitted for 165 rRNA gene sequencing.
This highlights the clinical impact of mNGS for diagnosis of infections such as neurobrucellosis that are challenging to accurately diagnose and treat. Although initially covered for TB meningitis using antibiotics that partially treat neurobrucellosis (rifampin+/−levofloxacin), the patient continued to be symptomatic and was not placed on targeted therapy for Brucella until comprehensive metagenomic sequencing revealed the presence of bacterial DNA in her CSF and neurobrucellosis was confirmed by Brucella agglutinin testing.
In table 1300, the reported viruses are cytomegalovirus (CMV) in the “DNA PC” column and human immunodeficiency virus 1 (HIV-1) in the “RNA PC” column. These are the viruses spiked into the PC sample. No viruses are reported from the patient CSF because human papillomaviruses, part of skin flora and presumed to be a contaminant, are not reported by the mNGS assay.
Supplementary Table 1 of the Appendix shows the number of reads from the DNA library aligning to bacterial sequences. Results are shown from both the patient's sample, NTC, and the PC sample. In Supplementary Table 1, “*” indicates re-classified to a higher taxonomic rank because reads aligned equally well to multiple different organisms that shared the same species, genus or family. The abbreviations are as follows: NTC, no template control; PC, positive control; and CSF, cerebrospinal fluid.
Supplementary Table 2 of the Appendix shows the number of reads from the DNA library aligning to fungal and parasite sequences. Results are shown from both the patient's sample, NTC, and the PC sample. In Supplementary Table 2, “*” indicates re-classified to a higher taxonomic rank because reads aligned equally well to multiple different organisms that shared the same species, genus or family.
Supplementary Table 3 of the Appendix shows bacterial taxa identified by embodiments using an RPM ratio metric for reads shown in Supplementary Table 1. In Supplementary Table 3, “*” indicates re-classified to a higher taxonomic rank because reads aligned equally well to multiple different organisms that shared the same species, genus or family; “#” indicates positive result according to an RPM (reads per million) ratio 10, where the RPM ratio=RPM(sample)/RPM(NTC); and “&” indicates reads correspond to the Streptococcus agalactiae PC. The abbreviations are as follows: PC, positive control; CSF, cerebrospinal fluid; mNGS, metagenomic next-generation sequencing; and RPM, reads per million.
In Supplementary Table 3, cells that show a positive result by mNGS assay are highlighted in yellow. Under the column designated “DNA PC”, the two bacterial species reported at the pre-established threshold criterion of >10 RPM are Streptococcus agalactiae and Klebsiella pneumoniae, which were spiked into the PC sample. Other positive rows are not reported because they represent either (1) classification at a higher taxonomic rank (genus) or (2) another bacterial species in the same genus (i.e. Streptococcus suis) that does not meet the pre-established requirement of having at least 1/10 of the number of RPM corresponding to the predominant species (Streptococcus agalactiae) to call a co-infection from at least 2 different species. Under the column designated “DNA Patient CSF”, the only bacterial taxa with >10 RPM is Brucella genus, which is the reported organism reported in the patient's CSF.
Supplementary Table 4 of the Appendix shows fungal and parasitic taxa identified by embodiments using an RPM ratio metric for reads shown in Supplementary Table 1. In Supplementary Table 4, “*” indicates re-classified to a higher taxonomic rank because reads aligned equally well to multiple different organisms that shared the same species, genus or family; and “#” indicates positive result according to an RPM ratio≥10, where the RPM ratio=RPM(sample)/RPM(NTC). The abbreviations are: PC, positive control; CSF, cerebrospinal fluid; mNGS, metagenomic next-generation sequencing; RPM, reads per million.
In Supplementary Table 4, cells that show a positive result by mNGS assay are highlighted in yellow. Under the column “DNA PC”, the two fungal/parasitic species reported at the pre-established threshold criterion of >10 RPM are Cryptococcus neoformans, Aspergillus niger, and Toxoplasma gondii, which were spiked into the PC sample. Other positive rows are not reported because they represent classification at a higher taxonomic rank (genus or family).
Supplementary Tables 3 and 4, which are used to interpret and report mNGS results, show the effect of taxonomic classification and normalization using an “RPM ratio” metric in comparison to the NTC in simplifying the clinical interpretation, as compared to Supplementary Tables 1 and 2, respectively.
Of particular relevance for this case, two chronic granulomatous infections, tuberculosis and brucellosis, have not only overlapping clinical or radiographic features but also histologic characteristics in common [10]. As such, misdiagnosis of tuberculosis in patients with brucellosis has been reported in the literature [10]. This matter is further complicated by the fact that neither a negative CSF mycobacterial culture nor tuberculosis PCR-based assay excludes the diagnosis of TB meningitis if the clinical suspicion is high. Additionally, false-positive Brucella seroreactivity, ELISA and agglutination titer, in patients with active TB have also been reported. In this patient's case, the negative Brucella IgM but positive IgG was incorrectly attributed to false-positive Brucella seroreactivity in the setting of TB and not to active Brucella infection. It is possible that the patient may not have mounted a detectable IgM antibody response, or that Brucella IgM levels had waned by the time of hospital admission. Confirmatory agglutinin testing may have been helpful in making the diagnosis of Brucella earlier.
This patient had temporary clinical improvement after initiation of anti-TB medications. Rifampin and levofloxacin are two anti-TB agents that are also active against Brucella spp. [1]. However, rifampin, while active against Brucella, should always be used in combination with other agents (e.g. doxycycline, trimethoprim-sulfamethoxazole, or quinolones) since monotherapy, as was inadvertently administered to this patient initially, has been associated with high relapse rates [1]. In hindsight, the patient's prolonged and indolent course was more likely to be associated with Brucella, since more rapid clinical deterioration would have been expected for patients with active TB who are inadequately treated [11]. But, Brucella at that time was not high on the differential. It is only after discharge and because of the patient's persistent symptoms that an alternative diagnosis was considered. Taken altogether, knowledge of these pitfalls is essential for clinicians to reduce diagnostic errors.
Metagenomic next-generation sequencing (mNGS) is an emerging approach in diagnostic microbiology with the ability to detect all microorganisms—viruses, bacteria, fungi, and parasites—in a single assay [2, 4, 9, 2, 3]. Here mNGS was used to provide an accurate diagnosis of neurobrucellosis and to guide the institution of targeted therapy, leading to complete resolution of the patient's illness.
DNA and RNA metagenomic libraries were constructed from the patient's CSF sample as previously described [2, 3]. After bead-beating using Lysis matrix B (MP Biomedicals, Santa Ana Calif.) at 6 m/s for 30 seconds, total nucleic acid was extracted using the Qiagen EZ1 Viral kit (Qiagen, Valencia Calif.). Half of the nucleic acid from CSF was treated with Turbo DNase (Ambion, Waltham Mass.), followed by reverse transcription of the RNA to cDNA using random hexamers and NGS library preparation using the Nextera XT DNA Library Prep Kit (Illumina, San Diego Calif.). The remaining half was treated with the NEBNext Microbiome DNA Enrichment Kit to enrich for microbial DNA (New England Biosciences, Ipswich Mass.)), followed by Nextera XT library preparation. Dual-indexed, barcoded NGS libraries were quantitated on the BioAnalyzer (Agilent, Santa Clara Calif.) and run on the Illumina HiSeq (1.times.160 bp run).
The patient's CSF sample was processed and sequenced as part of a recently developed standardized operating procedure (SOP) for clinical mNGS testing from patient samples in the University of California, San Francisco (UCSF) Clinical Microbiology Laboratory, a CLIA (Clinical Laboratory Improvement Amendments)-licensed laboratory (Naccache, et al., manuscript in preparation). For each sequencing run, the SOP includes running two external controls: (1) a negative “no-template” control (NTC) sample consisting of elution buffer, (2) and a positive control sample consisting of a quantified mixture of 7 representative pathogens (CMV, HIV, Streptococcus agalactiae, Klebsiella pneumoniae, Cryptococcus neoformans, Aspergillus niger, and Toxoplasma gondii). Each is spiked into negative CSF matrix at a concentration 1-2 log above the estimated limits of detection for that microorganism by probit analysis (Naccache, et al., manuscript in preparation).
Metagenomic NGS data was analyzed for pathogens using a modified version of the SURPI (“sequence-based ultra-rapid pathogen identification”) computational pipeline, which identifies pathogen sequences on the basis of nucleotide alignments to National Center for Biotechnology information (NCBI) nt reference database (March 2015 build) [4]. A clinical version of the SURPI pipeline, named SURPI+, which employs taxonomic classification for more accurate read assignments and establish normalized metrics and thresholds for clinical results reporting, was used for automated interpretation. Briefly, for identification of viral sequences, both mNGS data from bath DNA and RNA sample libraries were used, whereas only mNGS data from DNA libraries alone were used for identification of sequences from bacteria, fungi, and parasites. An edit distance cutoff of 16, indicating the number of single nucleotide insertions, deletions, or mismatches allowed between the read and the reference sequence, was used for virus detection, whereas a more stringent edit distance cutoff of 1 was used far bacterial, fungal, and parasitic detection [2, 3]. Following alignment, a rapid taxonomic classification algorithm based on the lowest common ancestor algorithm was used to assign viral, bacterial, and non-chordate eukaryotic (fungal or parasitic) NGS reads to the species, genus, or family level, as previously described [2, 3].
As part of clinical validation of mNGS testing for pathogen detection, we have established threshold cutoffs for automated reporting of positive results (Naccache, et al., manuscript in preparation). Briefly, for reporting of bacteria, fungi, and parasites, the cutoff is defined as an RPM (reads per million) ratio of ≥10, where the RPM ratio is defined as the RPMsample/RPMNTC for any given taxon (species, genus, or family). If the taxon is not present in the NTC, then the RPMNTC is 1. For reporting of viruses, the criteria can include coverage of ≥2 non-contiguous/non-overlapping gene regions. Viruses with non-vertebrate hosts, that are found in the NTC, or that constitute normal body flora (e.g. anelloviruses) are not reported. A “gene region” can be defined as a non-overlapping and non-contiguous region of length<=read length (e.g., 141 bp).
Confirmation of Brucella detection was performed by PCR using the following primer set targeting the IS711 gene [5]: BrucellaGenus_F-2_GCCTTGGATCTGAGCCGTT (SEQ ID NO:1); BrucellaGenus_IS711_R GGCCTACCGCTGCGAAT (SEQ ID NO:2). The reaction was carried out using the Qiagen One-Step RT-PCR Kit in a 25 μL total reaction volume by addition of 4 μL Q solution, 4 μL. 5.times. buffer, 1 μL dNTP, 1 μL enzyme, 1 μL of each primer at 12 μmol, and 2 μL of template extracted DNA, and water. Cycling conditions were as follows: 40 cycles of 94° C., 30 s/53° C., 30 s/72° C., 30. The recovered sequence was confirmed to be Bruce by Sanger sequencing of the PCR amplicon.
Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in
The subsystems shown in
A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 81, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component. In some embodiments, computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components.
Aspects of embodiments can be implemented in the form of control logic using hardware (e.g. an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner. As used herein, a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present invention using hardware and a combination of hardware and software.
Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++. C #, Objective-C, Go, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like. The computer readable medium may be any combination of such storage or transmission devices.
Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internetc. As such, a computer readable medium may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g. a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective steps or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means for performing these steps.
The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the invention. However, other embodiments of the invention may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects.
The above description of example embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the teaching above.
A recitation of “a”, “an” or “the” is intended to mean “one or more” unless specifically indicated to the contrary. The use of “or” is intended to mean an “inclusive or,” and not an “exclusive or” unless specifically indicated to the contrary. Reference to a “first” component does not necessarily require that a second component be provided. Moreover reference to a “first” or a “second” component does not limit the referenced component to a particular location unless expressly stated.
All patents, patent applications, publications, and descriptions mentioned herein are herein incorporated by reference in their entirety for all purposes. None is admitted to be prior art. Such references include:
Cryptococcus neoformans
Filobasidiella
Aspergillus niger
Aspergillus
Toxoplasma gondii
Toxoplasma
*
Filobasidiella
*
*
Neospora caninum
Neospora
Hammondia triffittae
Hammondia
*
*
*
Hammondia
*
Aspergillus
Hammondia hammondi
Hammondia
Aspergillus fumigatus
Aspergillus
Aspergillus awamori
Aspergillus
Bartheletia paradoxa
Bartheletia
Cryptococcus gattii
Filobasidiella
Aspergillus kawachii
Aspergillus
Spirometra erinaceieuropaei
Spirometra
Mucor racemosus
Mucor
Aspergillus oryzae
Aspergillus
Setosphaeria turcica
Setosphaeria
Caenorhabditis remanei
Caenorhabditis
Aspergillus tubingensis
Aspergillus
Anisakis simplex
Anisakis
Penicillium solitum
Penicillium
Plectosphaerella sp. 93 OA-2013
Plectosphaerella
*
*
Malassezia globosa
Malassezia
Gongylonema pulchnun
Gongylonema
Wallemia sebi
Wallemia
*
*
Candida parapsilosis
Candida
Elaeophora elaptii
Elaeophora
Lichtheimia hongkongensis
Lichtheimia
Brugia timori
Brugia
Parastrongyloides trichosuri
Parastrongyloides
Sordaria macrospora
Sordaria
Bipolaris sorokiniana
Bipolaris
Malassezia restricta
Malassezia
Cladosporium oxysporum
Cladosporium
Strongylocentrotus purpuratus
Strongylocentrotus
Caenorhabditis elegans
Caenorhabditis
Penicillium griseoroseum
Penicillium
Fusarium graminearum
Fusarium
Leptosphaeria biglobosa
Leptosphaeria
Chaetomium globosum
Chaetomium
Myceliophthora thermophila
Myceliophthora
Alternaria tenuissima
Alternaria
*
Saccharomyces
Aphanomyces euteiches
Aphanomyces
Ophiostoma piliferum
Ophiostoma
Rhodotorula taiwanensis
Rhodotorula
Trichosporon domesticum
Trichosporon
Phoma herbarum
Phoma
Erysiphe alphitoides
Erysiphe
*
Leptosphaeria
Zymoseptoria tritici
Zymoseptoria
Debaryomyces hansenii
Debaryomyces
Alternaria altemata
Alternaria
*
Phoma
*
Umbilicaria
*
Penicillium
Wuchereria bancrofti
Wuchereria
Cladosporium cladosporioides
Cladosporium
Saccharomyces bayanus
Saccharomyces
Fusarium phaseoli
Fusarium
Dasybranchus sp. DH1
Dasybranchus
Neofusicoccum parvum
Neofusicoccum
*
Cladosporium
Penicillium rubens
Penicillium
Ustilago maydis
Ustilago
Mucor circinelloides
Mucor
Albugo laibachii
Albugo
*
Brucella
Brucella melitensis
Brucella
Streptococcus agalactiae
Streptococcus
Klebsiella pneumoniae
Klebsiella
*
*
Escherichia coli
Escherichia
Propionibacterium acnes
Propionibacterium
*
Streptococcus
*
Klebsiella
*
*
Streptococcus suis
Streptococcus
Streptococcus salivarius
Streptococcus
Streptococcus pyogenes
Streptococcus
Serratia marcescens
Serratia
Lactococcus lactis
Lactococcus
*
Escherichia
Pseudomonas sp. TKP
Pseudomonas
*
Pseudomonas
Streptococcus macedonicus
Streptococcus
Enterobacter cloacae
Enterobacter
Klebsiella oxytoca
Klebsiella
Thermoanaerobacterium
Thermoanaerobacterium
thermosaccharolyticum
Streptococcus sp. VT 162
Streptococcus
Enterococcus faecium
Enterococcus
Pseudomonas protegens
Pseudomonas
Micrococcus luteus
Micrococcus
Streptococcus infantarius
Streptococcus
Staphylococcus epidermidis
Staphylococcus
Enterobacter asburiae
Enterobacter
Cupriavidus metallidurans
Cupriavidus
Pseudomonas putida
Pseudomonas
Enterococcus casseliflavus
Enterococcus
Streptococcus mitis
Streptococcus
Streptococcus equi
Streptococcus
Salmonella enterica
Salmonella
Streptococcus oralis
Streptococcus
Klebsiella variicola
Klebsiella
Burkholderia lata
Burkholderia
Pseudomonas stutzeri
Pseudomonas
Streptococcus pneumoniae
Streptococcus
Streptococcus dysgalactiae
Streptococcus
*
Burkholderia
Pseudomonas fluorescens
Pseudomonas
Acinetobacter guillouiae
Acinetobacter
Veillonella parvula
Veillonella
Xanthomonas campestris
Xanthomonas
Exiguobacterium sp. AT1b
Exiguobacterium
Pseudomonas
Pseudomonas
pseudoalcaligenes
Streptococcus pasteurianus
Streptococcus
Rothia dentocariosa
Rothia
*
*
Delftia acidovorans
Delftia
*
Propionibacterium
Acinetobacter
baumannii
Acinetobacter
Streptococcus parasanguinis
Streptococcus
Pseudomonas sp. WCS374
Pseudomonas
Enterobacter aerogenes
Enterobacter
Stenotrophomonas
Stenotrophomonas
maltophilia
Alicyclobacillus
Alicyclobacillus
acidocaldarius
Haemophilus influenzae
Haemophilus
Rothia mucilaginosa
Rothia
Staphylococcus xylosus
Staphylococcus
Acidovorax sp. JS42
Acidovorax
*
Rahnella
Streptococcus
Streptococcus
pseudopneumoniae
Streptococcus thermophilus
Streptococcus
Pseudomonas aeruginosa
Pseudomonas
Corynebacterium
Corynebacterium
kroppenstedtii
Staphylococcus
Staphylococcus
haemolyticus
Serratia sp. SCBI
Serratia
*
Enterococcus
Staphylococcus
Burkholderia cepacia
Burkholderia
*
Corynebacterium
Comamonas testosteroni
Comamonas
Bifidobacterium
Bifidobacterium
thermophilum
*
Lactobacillus
Thermoanaerobacterium
Thermoanaerobacterium
xylanolyticum
Ralstonia pickettii
Ralstonia
Meiothermus ruber
Meiothermus
Acidovorax ebreus
Acidovorax
Micrococcus sp. V7
Micrococcus
Leuconostoc mesenteroides
Leuconostoc
Bacillus halodurans
Bacillus
Corynebacterium variabile
Corynebacterium
*
Acidovorax
Bifidobacterium bifidum
Bifidobacterium
Rhizobium sp. IRBG74
Rhizobium
*
Acinetobacter
Acinetobacter calcoaceticus
Acinetobacter
Chroococcidiopsis thermalis
Chroococcidiopsis
Streptococcus sanguinis
Streptococcus
Acidovorax sp. KKS102
Acidovorax
Raoultella ornithinolytica
Raoultella
Ochrobactrum anthropi
Ochrobactrum
Lactobacillus johnsonii
Lactobacillus
Methylobacterium populi
Methylobacterium
Rhodococcus equi
Rhodococcus
Lactobacillus helveticus
Lactobacillus
*
Serratia
Burkholderia cenocepacia
Burkholderia
*
Staphylococcus
Enterobacter sp. R4-368
Enterobacter
Propionibacterium
Propionibacterium
propionicum
Streptococcus gordonii
Streptococcus
Corynebacterium singulare
Corynebacterium
Burkholderia ambifaria
Burkholderia
*
Micrococcus
Fervidobacterium
nodosum
Fervidobacterium
Aeromonas media
Aeromonas
Cronobacter sakazakii
Cronobacter
Myroides profundi
Myroides
Methylobacterium oryzae
Methylobacterium
*
Xanthomonas
Thermoanaerobacterium
Thermoanaerobacterium
saccharolyticum
Pseudomonas mendocina
Pseudomonas
Corynebacterium
Corynebacterium
ureicelerivorans
Lactobacillus crispatus
Lactobacillus
Alicycliphilus denitrificans
Alicycliphilus
Gardnerella vaginalis
Gardnerella
*
Gemella
*
Ralstonia
Eggerthella lenta
Eggerthella
*
*
Prevotella denticola
Prevotella
Prevotella intermedia
Prevotella
Psychrobacter sp. PRwf-1
Psychrobacter
Azospira oryzae
Azospira
Acinetobacter haemolyticus
Acinetobacter
*
Delftia
Burkholderia contaminans
Burkholderia
Arthrobacter arilaitensis
Arthrobacter
Dermacoccus
Dermacoccus
nishinomiyaensis
Pantoea ananatis
Pantoea
Staphylococcus
Staphylococcus
saprophyticus
Staphylococcus pasteuri
Staphylococcus
Rahnella aquatilis
Rahnella
Rahnella sp. Y9602
Rahnella
Campylobacter concisus
Campylobacter
Geobacillus sp. WCH70
Geobacillus
*
Frankia
Lactobacillus casei
Lactobacillus
Thiomonas intermedia
Thiomonas
Streptococcus gallolyticus
Streptococcus
Thioalkalivibrio
Thioalkalivibrio
*
Bradyrhizobium
Bifidobacterium longum
Bifidobacterium
Corynebacterium falsenii
Corynebacterium
Delftia sp. Cs1-4
Delftia
Acinetobacter sp. M131
Acinetobacter
Prevotella melaninogenica
Prevotella
*
*
Leuconostoc carnosum
Leuconostoc
Pectobacterium carotovorum
Pectobacterium
*
Myroides
*
Erwinia
Gordonia sp. KTR9
Gordonia
Paenibacillus sp. FSL R7-0273
Paenibacillus
Paracoccus sp. N81106
Paracoccus
Sphingobium fuliginis
Sphingobium
*
*
*
Geobacillus
Pseudomonas sp. VLB120
Pseudomonas
Pelagibacterium halotolerans
Pelagibacterium
Streptococcus intermedius
Streptococcus
Propionibacterium
Propionibacterium
freudenreichii
*
Enterobacter
Nakamurella multipartita
Nakamurella
Haemophilus parasuis
Haemophilus
Fusobacterium nucleatum
Fusobacterium
Citrobacter freundii
Citrobacter
Ruminococcus sp. SR1/5
Ruminococcus
Pseudoxanthomonas spadix
Pseudoxanthomonas
Lactococcus garvieae
Lactococcus
Neisseria elongata
Neisseria
Acidovorax citrulli
Acidovorax
Novosphingobium
Novosphingobium
pentaromativorans
Citrobacter koseri
Citrobacter
Methylobacterium aquaticum
Methylobacterium
Pseudomonas denitrificans
Pseudomonas
Rhodococcus erythropolis
Rhodococcus
Lactobacillus reuteri
Lactobacillus
Bacteroides fragilis
Bacteroides
Lactobacillus plantarum
Lactobacillus
*
Bacillus
Pseudomonas
Pseudomonas
rhizosphaerae
Achromobacter xylosoxidans
Achromobacter
Lactobacillus amylovorus
Lactobacillus
Propionibacterium
Propionibacterium
acidipropionici
Leuconostoc gelidum
Leuconostoc
Weissella thailandensis
Weissella
Pandoraea sp. RB-44
Pandoraea
Escherichia vulneris
Escherichia
Yersinia intermedia
Yersinia
Flavobacteriaceae bacterium
Flavobacteriaceae
*
Rhodococcus
Streptococcus sp. (N1)
Streptococcus
*
Methylobacterium
Sphingomonas sp. MM-1
Sphingomonas
Rhizobium etli
Rhizobium
*
Agrobacterium
Thermus scotoductus
Thermus
Methylobacterium
Methylobacterium
extorquens
Streptococcus sp. I-P16
Streptococcus
Pantoea sp. PSNIH1
Pantoea
Methylobacterium
Methylobacterium
radiotolerans
Blautia
Lactobacillus sakei
Lactobacillus
*
*
Bacillus licheniformis
Bacillus
Corynebacterium accolens
Corynebacterium
Corynebacterium sp.
Corynebacterium
Serratia symbiotica
Serratia
Lactobacillus delbrueckii
Lactobacillus
Bacillus sp. YP1
Bacillus
Klebsiella sp. PG122E
Klebsiella
*
Rathayibacter
Pseudoalteromonas sp. P30
Pseudoalteromonas
Staphylococcus
Corynebacterium resistens
Corynebacterium
Shigella dysenteriae
Shigella
*
*
Agrobacterium fabrum
Agrobacterium
Gordonia polyisoprenivorans
Gordonia
Pseudomonas balearica
Pseudomonas
*
*
Ruminococcus bromii
Ruminococcus
Brachybacterium faecium
Brachybacteriunn
Acinetobacter johnsonii
Acinetobacter
Micrococcus sp. A1
Micrococcus
Filifactor alocis
Filifactor
Pantoea vagans
Pantoea
Haemophilus parainfluenzae
Haemophilus
Pantoea rwandensis
Pantoea
Corynebacterium
Corynebacterium
vitaeruminis
Pseudomonas poae
Pseudomonas
*
*
Lactobacillus fermentum
Lactobacillus
Anabaena variabilis
Anabaena
Sphingobacterium sp. ML3W
Sphingobacterium
*
*
Megasphaera elsdenii
Megasphaera
Pseudoxanthomonas
Pseudoxanthomonas
suwonensis
Corynebacterium
Corynebacterium
glutamicum
Sphingomonas taxi
Sphingomonas
Pseudomonas graminis
Pseudomonas
Bradyrhizobium sp. BTAi1
Bradyrhizobium
Enterococcus hirae
Enterococcus
Corynebacterium sp.
Corynebacterium
Arthrobacter
Arthrobacter
phenanthrenivorans
Corynebacterium maris
Corynebacterium
Gordonia bronchialis
Gordonia
Kytococcus sedentarius
Kytococcus
Kosakonia cowanii
Kosakonia
Xenorhabdus bovienii
Xenorhabdus
Paracoccus haeundaensis
Paracoccus
Methylobacterium sp. 238
Methylobacterium
Acinetobacter sp. BW3
Acinetobacter
Aeromonas sobria
Aeromonas
Bacillus lehensis
Bacillus
Ralstonia solanacearum
Ralstonia
Citrobacter sp. FPO3
Citrobacter
Citrobacter sp. I91-3
Citrobacter
Erwinia amylovora
Erwinia
Klebsiella milletis
Klebsiella
Salmonella bongori
Salmonella
Serratia grimesii
Serratia
Yersinia pestis
Yersinia
*
Lactobacillus brevis
Lactobacillus
Kocuria sp. starX
Kocuria
Acinetobacter sp. 26
Acinetobacter
Peptoclostridium difficile
Peptoclostridium
Sorangium cellulosum
Sorangium
Pseudomonas sp. NSi14
Pseudomonas
Escherichia albertii
Escherichia
Mobiluncus curtisii
Mobiluncus
*
Caulobacter
Methylotenera versatilis
Methylotenera
Propionibacterium sp.
Propionibacterium
Bacillus sp. Pc3
Bacillus
Acinetobacter sp. EVA14
Acinetobacter
Agrobacterium
Alistipes shahii
Alistipes
*
Thermus
*
Methylibium
Escherichia fergusonii
Escherichia
Enterobacter sp. Ni15
Enterobacter
Capnocytophaga ochracea
Capnocytophaga
Thauera sp. 6NLG
Thauera
Desulfovibrio alaskensis
Desulfovibrio
Variovorax sp. Alb14
Variovorax
*
Shigella
*
*
*
Micromonospora
Thermobifida fusca
Thermobifida
Turneriella parva
Turneriella
Peptoclostridium
Acinetobacter sp. Ooi24
Acinetobacter
Ochrobactrum sp. SJY1
Ochrobactrum
Carnobacterium sp. WN1359
Carnobacterium
Iamia majanohamensis
Iamia
Saccharomonospora viridis
Saccharomonospora
Rhizobium sp.
Rhizobium
Staphylococcus sp. CDC3
Staphylococcus
Shigella sonnei
Shigella
Pseudomonas syringae
Pseudomonas
Burkholderia vietnamiensis
Burkholderia
Shigella boydii
Shigella
Bacillus weihenstephanensis
Bacillus
Erythrobacter litoralis
Erythrobacter
Pseudoalteromonas
Pseudoalteromonas
haloplanktis
Pseudomonas sp. FGI182
Pseudomonas
*
Rhizobium
*
Rickettsia
Sphingobium yanoikuyae
Sphingobium
Stenotrophomonas
Stenotrophomonas
rhizophila
*
Leuconostoc
Aquincola tertiaricarbonis
Aquincola
Nocardiopsis dassonvillei
Nocardiopsis
Carnobacterium
Carnobacterium
maltaromaticum
*
Haemophilus
Bordetella parapertussis
Bordetella
*
Dietzia
Shewanella sp. W3-18-1
Shewanella
Sphingomonas sp. NP5
Sphingomonas
Staphylococcus gallinarum
Staphylococcus
Micavibrio aeruginosavorus
Micavibrio
Paracoccus denitrificans
Paracoccus
Cellulomonas
Corynebacterium jeikeium
Corynebacterium
*
*
Meiothermus silvanus
Meiothernnus
Asticcacaulis excentricus
Asticcacaulis
*
Atopobium
Streptococcus constellatus
Streptococcus
Microcystis aeruginosa
Microcystis
Blautia
Thermus thermophilus
Thermus
Shigella flexneri
Shigella
*
Mycobacterium
Pseudomonas savastanoi
Pseudomonas
Staphylococcus capitis
Staphylococcus
*
Cupriavidus
Dyadobacter fermentans
Dyadobacter
Dietzia sp. CQ4
Dietzia
*
*
*
Neisseria
Corynebacterium
Corynebacterium
aurimucosum
*
Pseudomonas fulva
Pseudomonas
Chromohalobacter
Chromohalobacter
salexigens
Brevundimonas diminuta
Brevundimonas
Streptococcus lutetiensis
Streptococcus
Bordetella
Erythrobacter sp. JP13.1
Erythrobacter
Methylobacillus glycogenes
Methylobacillus
Candidatus Rhodoluna
Candidatus
lacicola
Rhodoluna
Arthrobacter sp. JBH1
Arthrobacter
Aggregatibacter aphrophilus
Aggregatibacter
Thauera sp. B4
Thauera
Lysobacter dokdonensis
Lysobacter
Clostridiales genomosp.
Streptococcus sp. I-G2
Streptococcus
Pseudomonas mandelii
Pseudomonas
Bradyrhizobium sp. S23321
Bradyrhizobium
Phenylobacterium zucineum
Phenylobacterium
Pseudomonas mosselii
Pseudomonas
Staphylococcus lugdunensis
Staphylococcus
Proteus mirabilis
Proteus
*
*
Arthrobacter sp. J3-40
Arthrobacter
*
Pantoea
Corynebacterium efficiens
Corynebacterium
*
Halomonas
Trueperella pyogenes
Trueperella
Streptomyces coelicolor
Streptomyces
Kocuria rhizophila
Kocuria
Bacillus cereus
Bacillus
Tannerella forsythia
Tannerella
*
Alkalibacterium
Atopobium parvulum
Atopobium
Serinicoccus profundi
Serinicoccus
*
Leptotrichia
*
*
Planomicrobium
Planomicrobium
okeanokoites
Jannaschia sp. CCS1
Jannaschia
Paracoccus aestuarii
Paracoccus
Rhodobacter blasticus
Rhodobacter
Agrobacterium tumefaciens
Agrobacterium
Shewanella sp. ANA-3
Shewanella
Pseudomonas cichorii
Pseudomonas
Halomonas sp. A3H3
Halomonas
Serratia liquefaciens
Serratia
Sphingopyxis alaskensis
Sphingopyxis
*
Brevundimonas
Deinococcus deserti
Deinococcus
Desulfovibrio vulgaris
Desulfovibrio
Propionibacterium sp.
Propionibacterium
Paracoccus alcaliphilus
Paracoccus
Vibrio parahaemolyticus
Vibrio
Candidatus Saccharibacteria
Geitlerinema sp. PCC 7407
Geitlerinema
*
Actinomyces
Brevundimonas vesicularis
Brevundimonas
Acinetobacter sp. YS0810
Acinetobacter
*
Prevotella
Methyloceanibacter
Methyloceanibacter
caenitepidi
Leuconostoc citreum
Leuconostoc
*
Bacteroides
Pseudomonas alcaligenes
Pseudomonas
Methylibium petroleiphilum
Methylibium
Moraxella catarrhalis
Moraxella
Sphingopyxis sp. Kp5.2
Sphingopyxis
Pandoraea apista
Pandoraea
*
Cellulomonas
Olsenella uli
Olsenella
Acinetobacter oleivorans
Acinetobacter
Sphingomonas sp. 133
Sphingomonas
Meiothermus taiwanensis
Meiothernnus
Deinococcus geothermalis
Deinococcus
Lactobacillus
Lactobacillus
sanfranciscensis
Acinetobacter sp. LUH5605
Acinetobacter
Staphylococcus simulans
Staphylococcus
Arsenophonus nasoniae
Arsenophonus
Buchnera aphidicola
Buchnera
Weissella koreensis
Weissella
Psychrobacter sp. G
Psychrobacter
Amycolicicoccus subflavus
Amycolicicoccus
Staphylococcus
Morganella morganii
Morganella
Kyrpidia tusciae
Kyrpidia
Ramlibacter tataouinensis
Ramlibacter
Weeksella virosa
Weeksella
Acinetobacter junii
Acinetobacter
Acinetobacter sp. 26132
Acinetobacter
Mycobacterium abscessus
Mycobacterium
Neisseria gonorrhoeae
Neisseria
Sphingomonas wittichii
Sphingomonas
Bacteroides dorei
Bacteroides
Corynebacterium
Corynebacterium
halotolerans
Pelomonas aquatica
Pelomonas
Janibacter sp. TYM3221
Janibacter
*
Arthrobacter
Mycobacterium gordonae
Mycobacterium
Pimelobacter simplex
Pimelobacter
Pseudomonas sp. OM2164
Pseudomonas
Streptomyces albus
Streptomyces
Halomonas halocynthiae
Halomonas
Nitrosomonas sp. AL212
Nitrosomonas
Sphingobacterium mizutaii
Sphingobacterium
Vibrio cholerae
Vibrio
Rickettsia felis
Rickettsia
*
*
Bacteroidetes bacterium
Corynebacterium casei
Corynebacterium
Corynebacterium marinum
Corynebacterium
*
*
Caulobacter segnis
Caulobacter
Lactobacillus gasseri
Lactobacillus
*
Meiothermus
Rhizobium sp. NT-26
Rhizobium
Bacillus coagulans
Bacillus
*
Sphingomonas
*
Brevibacterium
Nitrosomonas europaea
Nitrosomonas
Pseudomonas alkylphenolia
Pseudomonas
Terrabacter sp. DBF63
Terrabacter
beta proteobacterium CB
Moraxella ovis
Moraxella
Shewanella baltica
Shewanella
Mycobacterium gilvum
Mycobacterium
*
Exiguobacterium
*
Ochrobactrum
Geodermatophilus
obscurus
Geodermatophilus
*
Devosia
Moraxella osloensis
Moraxella
Exiguobacterium sp. 11-28
Exiguobacterium
Nocardioides sp. JS614
Nocardioides
Nocardioides sp. USM2
Nocardioides
Burkholderia gladioli
Burkholderia
Renibacterium
Renibacterium
salmoninarum
Pseudomonas syringae
Pseudomonas
Bifidobacterium
Bifidobacterium
pseudolongum
Corynebacterium imitans
Corynebacterium
Corynebacterium callunae
Corynebacterium
Bosea sp. WAO
Bosea
Xanthobacter autotrophicus
Xanthobacter
Corynebacterium diphtheriae
Corynebacterium
Bacteroides thetaiotaomicron
Bacteroides
Caulobacter vibrioides
Caulobacter
*
*
Finegoldia magna
Finegoldia
Anaerococcus prevotii
Anaerococcus
Azorhizobium caulinodans
Azorhizobium
Erysipelothrix rhusiopathiae
Erysipelothrix
Porphyromonas
Porphyromonas
asaccharolytica
*
Sphingopyxis
Eubacterium rectale
Eubacterium
Acinetobacter venetianus
Acinetobacter
Variovorax paradoxus
Variovorax
Acinetobacter sp. ED45-25
Acinetobacter
Bradyrhizobium
Bradyrhizobium
diazoefficiens
Bradyrhizobium japonicum
Bradyrhizobium
Megamonas
*
Methylobacillus
Nocardiopsis alba
Nocardiopsis
Modestobacter
marinus
Modestobacter
Corynebacterium
Corynebacterium
doosanense
Blastococcus saxobsidens
Blastococcus
Anoxybacillus flavithermus
Anoxybacillus
Aeromonas caviae
Aeromonas
Bacillus subtilis
Bacillus
Elizabethkingia anophelis
Elizabethkingia
Staphylococcus hominis
Staphylococcus
Ruminococcus bicirculans
Ruminococcus
Paracoccus marcusii
Paracoccus
*
Psychrobacter
Sphingomonas
Sphingomonas
sanxanigenens
*
*
*
*
Bacteroides vulgatus
Bacteroides
Rhodopseudomonas
Rhodopseudomonas
palustris
Pantoea agglomerans
Pantoea
*
*
Mycobacterium kansasii
Mycobacterium
*
Streptomyces
Enterococcus faecalis
Enterococcus
Acinetobacter sp. NFM2
Acinetobacter
Shewanella putrefaciens
Shewanella
Bifidobacterium adolescentis
Bifidobacterium
*
Bifidobacterium
Porphyromonas gingivalis
Porphyromonas
Neisseria meningitidis
Neisseria
Rhodococcus pyridinivorans
Rhodococcus
Aeromonas salmonicida
Aeromonas
Planococcus sp. PAMC
Planococcus
Pseudomonas simiae
Pseudomonas
Faecalibacterium prausnitzii
Faecalibacterium
Acinetobacter lwoffii
Acinetobacter
Exiguobacterium sp. N139
Exiguobacterium
Streptococcus anginosus
Streptococcus
Thauera sp. MZ1T
Thauera
*
Shewanella
*
Aeromonas
Staphylococcus aureus
Staphylococcus
Aeromonas hydrophila
Aeromonas
Aeromonas veronii
Aeromonas
Bifidobacterium breve
Bifidobacterium
Bacillus megaterium
Bacillus
Filobasidiella
Aspergillus
Toxoplasma
Filobasidiella
*
Neospora
Hammondia
*
Hammondia
Aspergillus
Hammondia
Aspergillus
Aspergillus
Bartheletia
Filobasidiella
Aspergillus
Spirometra
Mucor
Aspergillus
Setosphaeria
Caenorhabditis
Aspergillus
Anisakis
Penicillium
Plectosphaerella
*
Malassezia
Gongylonema
Wallemia
*
Candida
Elaeophora
Brugia
Parastrongyloides
Sordaria
Bipolaris
Malassezia
Cladosporium
Strongylocentrotus
Caenorhabditis
Penicillium
Fusarium
Leptosphaeria
Chaetomium
Myceliophthora
Alternaria
Saccharomyces
Aphanomyces
Ophiostoma
Rhodotorula
Trichosporon
Phoma
Erysiphe
Leptosphaeria
Zymoseptoria
Debaryomyces
Alternaria
Phoma
Unnbilicaria
Penicillium
Wuchereria
Cladosporium
Saccharomyces
Fusarium
Dasybranchus
Neofusicoccum
Cladosporium
Penicillium
Ustilago
Mucor
Albugo
The present application is a continuation of U.S. patent application Ser. No. 15/917,286, filed Mar. 9, 2018, which is a continuation of International Patent Application No. PCT/US2016/052912 filed Sep. 21, 2016, which claims the benefit of and priority to U.S. Provisional Application No. 62/221,574, filed Sep. 21, 2015, the entire contents of which are herein incorporated by reference in their entirety for all purposes.
This invention was made with government support under grant no. HL105704 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62221574 | Sep 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15917286 | Mar 2018 | US |
Child | 15931487 | US | |
Parent | PCT/US2016/052912 | Sep 2016 | US |
Child | 15917286 | US |