The inventive subject matter generally relates at least to the fields of molecular biology, molecular diagnostics, infectious disease, and medicine. The inventive subject matter relates broadly to novel techniques for identification and comparative analysis of sequence features in metagenomic whole-genome shotgun (WGS) sequence data associated with particular disease states in a subject. More particularly, the inventive subject matter relates to diagnostic methods for distinguishing between different types of inflammatory bowel disease in a subject based on the microbial community signature of the subject.
The human body and its associated microbiota, the “human microbiome,” represent a complex superorganism with thousands of microbial species distributed biogeographically among niche body sites [1]. The fundamental role of the microbiome for human health is generally accepted and a growing body of literature reports associations between specific diseases or disorders and the microbiome [2-4]. However, in many cases it has remained difficult to correlate a defined microbiome characteristic with a specific disease, disorder, condition, or symptom thereof. Furthermore, recent studies have demonstrated significant diversity in microbial communities between healthy humans, most notably in the digestive tract [2,4-8]. Additionally, the host-microbiome relationship is highly dynamic; host-induced changes to diet or other perturbations, such as antibiotic treatment, can significantly alter microbial composition thereby potentially inducing secondary effects on human health [7,9-11]. Despite the known within- and between-subject taxonomic variation, Applicants have determined that by characterizing certain complex systems, specific microbiome signatures can be identified that are predictive of health and/or disease state(s) of the human host.
The rapid advancement of next-generation DNA sequencing technologies has allowed for deep sampling of human-associated microbial communities either in the form of whole genome shotgun sequencing of metagenomic DNA or multiplexed PCR amplicons of the 16S rRNA gene. The Human Microbiome Project (HMP) has already sequenced over ˜3.5 TB of microbial DNA from ˜750 human samples, revealing a vast phylogenetic diversity and functional capacity of commensals throughout the human population and across the human body [1,12,13]. There is extensive information to learn from the HMP data, including what constitutes a healthy or “normal” microbiome.
Assembly and taxonomic assignment of metagenomic data remain active areas of research for computational scientists, and although next-generation platforms produce tremendous amounts of sequence, many analysis protocols that were developed for larger sequence reads from Sanger or Roche/454 technologies are difficult to apply due to short read lengths. While there has been substantial effort to develop computational methods that assign single metagenomic reads and assembled contig fragments to specific taxonomic lineages or metabolic functions [14-20], identification of a group of specific nucleic acid sequences (a “signature”) that define a particular microbiome within samples as a way to associate sample composition with sample background has received less attention.
While recent applications of machine learning algorithms have demonstrated that metagenomic sample classification with some accuracy at some level is achievable [21-25], these studies were largely limited to analyses of targeted sequence data, i.e. products of polymerase chain reaction (PCR) amplifications of the universal bacterial marker gene, the small ribosomal RNA subunit or 16S rRNA. Though easy to obtain, the prior art 16S rRNA datasets suffer from several deficiencies, providing limited phylogenetic information which is typically heavily biased by primer design, amplicon region, and gene copy number variability. Moreover, although phylogenetically-related bacteria tend to show similar phenotypic presentations, 16S rRNA-based phylogenetic profiles and the composition of health-relevant functional microbiomes are not necessarily correlated. For example, similar functions could be performed by phylogenetically distinct organisms within the microbiome. According to the understanding prior to Applicants' work, in the extreme case, this could in theory result in a situation where two perfectly healthy human subjects have completely different, non-overlapping phylogenetic microbiome compositions. In contrast, WGS metagenomic datasets, which do not include a PCR amplification step, have fewer technical caveats relative to 16S rRNA surveys, which in turn provides novel opportunities to find distinctive features of different microbiomes and associations with phenotypic microbiota representations.
In this application, Applicants demonstrate that the elucidation of distinctive features of different microbiomes and phenotypic microbiota representations are useful for diagnosing and monitoring the state of a disease, disorder, or condition in a subject. The inventive subject matter provides solutions to this and other important problems associated with inflammatory bowel disease as an exemplary disease state.
In one aspect, the inventive subject matter is directed to a method of diagnosing inflammatory bowel disease in a subject in need thereof, comprising the steps of:
In another aspect, the inventive subject matter is directed to a method of differentially diagnosing Crohn's disease and ulcerative colitis in a subject in need thereof, comprising the steps of:
In one aspect, the inventive subject matter is directed to methods for classifying patients with inflammatory bowel disease (IBD) from patients without IBD. In another aspect, the inventive subject matter is directed to methods for distinguishing between inflammatory bowel diseases in a subject, for example distinguishing between ulcerative colitis (UC) and Crohn's disease (CD) in a subject. The inventive subject matter is based on Applicants' discovery that the characterization and analysis of the particular microbiome associated with a subject, such as a human subject, can be used to classify the subject as having a disease, disorder, or condition such as having IBD or not, or having UC or having CD. The microbiome of a subject can thus be screened, and the health and/or medical condition of the subject can be determined.
Because microbiomes are comprised of bacterial populations that differ between individuals, even healthy subjects will have taxonomically different microbiomes. Therefore, a single sequence feature of the microbiome, considered alone, would generally not be expected to serve as a functional diagnostic. However, Applicants have further determined that a group of at least about 10 to larger groups of to 50 or more different sequence features can be screened in a microbiome sample from a subject, and statistically significant diagnoses about the relative health of the subject can be made using this information. In one aspect, some features of interest of the inventive subject matter include particular nucleic acid molecules produced by the population of bacteria that comprise a subject's microbiome. The inventive subject matter uses short nucleic acid oligomers, interchangeably termed “kmers” herein, to survey the entire composite metagenome of all individual microbial genomes obtained from a microbiome sample from a subject. Applicants have determined that by surveying the entire metagenome for the relative amount of pre-selected, relevant kmer sequences, also called significant features herein, one may determine the signature of the microbiome and classify the subject as having or not having a particular disease, disorder, or condition. Thus, the inventive subject matter takes advantage of Applicants' discovery that the composition and characteristics of a subject's microbiome is, in some cases, directly correlated with a particular disease, disorder, or condition.
In an aspect, the inventive subject matter is directed to a method of classifying a subject as having inflammatory bowel disease (IBD), comprising: (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature(s) that distinguishes subjects having IBD from a subject not having IBD; (b) determining in the sample the abundance of the features comprising the signature (“the dataset”); and (c) classifying the dataset from (b) as being either obtained from a subject having IBD or a subject not having IBD.
In certain aspects, the signature(s) that distinguishes a subject having IBD from a subject not having IBD is obtained by determining a set of kmers with statistically significant differential abundance between a microbiome sample from a subject having IBD and a microbiome sample from a subject not having IBD, and performing an algorithm to identify one or more groups of between 1 and 50 kmers from within the set of kmers that distinguish a subject having IBD from a subject not having IBD. In certain aspects, the kmers are oligomers comprising between 2 and 10 nucleotides, including 2, 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides. In some aspects, the kmers are octamers comprising 8 nucleotides. In some aspects, the signature comprises at least one kmer selected from those listed in Tables 3, 7, and 10. In other aspects, the signature comprises at least one of the 21 kmers listed in Table 7.
In certain aspects, the data set is classified as being obtained from a subject having IBD or a subject not having IBD by performing a nearest neighbor analysis on the data set.
In another aspect, the inventive subject matter is directed to a method for recommending a subject for treatment, comprising: (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature that distinguishes subjects having IBD from a subject not having IBD; (b) determining in the sample the abundance of the features comprising the signature (“the dataset”); (c) classifying the dataset from (b) as being either obtained from a subject having IBD or a subject not having IBD; and (d) recommending the subject for a treatment when the dataset from (c) is classified as being from a subject with IBD.
In certain aspects, the signature(s) that distinguishes a subject having IBD from a subject not having IBD is obtained by determining a set of kmers with statistically significant differential abundance between a microbiome sample from a subject having IBD and a microbiome sample from a subject not having IBD, and performing an algorithm to identify one or more groups of between 1 and 50 kmers that distinguish a subject having IBD from a subject not having IBD. In certain aspects, the kmers are oligomers comprising between 2 and 10 nucleotides, including 2, 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides. In some aspects, the kmers are octamers comprising 8 nucleotides. In some aspects, the signature comprises at least one kmer selected from those listed in Tables 3, 7, and 10. In other aspects, the signature comprises at least one of the 21 kmers listed in Table 7.
In certain aspects, the data set is classified as being obtained from a subject having IBD or a subject not having IBD by performing a nearest neighbor analysis on the data set.
In some aspects, the treatment is an endoscopic procedure. The endoscopic procedure maybe selected from colonoscopy, sigmoidoscopy, capsule endoscopy, and endoscopic ultrasound. In an aspect, the endoscopic test is a colonoscopy.
In another aspect, the inventive subject matter is directed to a method of classifying a subject as having ulcerative colitis (UC) or Crohn's disease (CD) comprising: (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature that distinguishes subjects having UC from a subject having CD; (b) determining in the sample the abundance of the features comprising the signature (“the dataset”); and (c) classifying the dataset from (b) as being either obtained from a subject having UC or a subject having CD.
In certain aspects, the signature(s) that distinguishes a subject having UC from a subject having CD is obtained by determining a set of kmers with statistically significant differential abundance between a microbiome sample from a subject having UC and a microbiome sample from a subject having CD, and performing an algorithm to identify one or more groups of between 1 and 50 kmers that distinguish a subject having UC from a subject having CD. In certain aspects, the kmers are oligomers comprising between 2 and 10 nucleotides, including 2, 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides. In some aspects, the kmers are octamers comprising 8 nucleotides. In another aspect, the signature comprises one or more of those listed in Table 4. In another aspect, the signature comprises one or more of those listed in Table 5. In another aspect, the signature comprises one or more kmers listed in Table 3.
In a further aspect, the inventive subject matter is directed to a method of distinguishing between a subject having ulcerative colitis and a subject having Crohn's disease, comprising: (a) determining a set of kmers with statistically significant differential abundance between a microbiome sample from a subject having ulcerative colitis (UC) and a microbiome sample from a subject having Crohn's disease (CD); (b) performing an algorithm to identify one or more groups of between 1 and 50 kmers that distinguish a subject having UC from a subject having CD, wherein such groups are termed signatures; (c) obtaining a microbiome sample from a subject having an inflammatory bowel disease; (d) isolating and sequencing microbiome nucleic acid of the sample; (e) determining the abundance of nucleic acid in the sample corresponding to kmers comprising at least one of the signatures of (b) to produce a data set; (f) normalizing the data set; and (g) classifying the data set as being obtained from a subject having UC or a subject having CD, thereby distinguishing between a subject having UC and a subject having CD.
In certain aspects, the microbiome sample is a stool sample. In certain aspects, the nucleic acid is DNA, cDNA or RNA. In certain aspects, the signature is at least one signature selected from among the 1087 signatures of Table 4. In certain aspects, the signature is at least one signature selected from among the 17 signatures of Table 5.
In certain aspects, the data set is classified as being obtained from a subject having UC or a subject having CD by performing a nearest neighbor analysis on the data set.
The foregoing has outlined rather broadly the aspects and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional aspects and advantages of the invention will be described herein, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that any conception and specific aspect disclosed herein may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel aspects which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description. It is to be expressly understood, however, that any description, table, example, etc. is provided for the purpose of illustration and description only and is by no means intended to define the limits the invention.
Unless otherwise noted, technical terms are used according to conventional usage. Definitions of common terms in molecular biology may be found, for example, in Benjamin Lewin, Genes VII, published by Oxford University Press, 2000 (ISBN 019879276X); Kendrew et al. (eds.); The Encyclopedia of Molecular Biology, published by Blackwell Publishers, 1994 (ISBN 0632021829); and Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by Wiley, John & Sons, Inc., 1995 (ISBN 0471186341); and other similar technical references.
As used herein, “a” or “an” may mean one or more. As used herein when used in conjunction with the word “comprising,” the words “a” or “an” may mean one or more than one. As used herein “another” may mean at least a second or more. Furthermore, unless otherwise required by context, singular terms include pluralities and plural terms include the singular.
As used herein, “about” refers to a numeric value, including, for example, whole numbers, fractions, and percentages, whether or not explicitly indicated. The term “about” generally refers to a range of numerical values (e.g., +/−5-10% of the recited value) that one of ordinary skill in the art would consider equivalent to the recited value (e.g., having the same function or result). In some instances, the term “about” may include numerical values that are rounded to the nearest significant figure.
As used herein, the term “drug” is as defined under 21 U.S.C. §321(g)(1), and means “(A) articles recognized in the official United States Pharmacopoeia, official Homeopathic Pharmacopoeia of the United States, or official National Formulary, or any supplement to any of them; and (B) articles intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease in man or other animals; and (C) articles (other than food) intended to affect the structure or any function of the body of man or other animals . . . .”
As used herein, the terms “patient” and “subject” refer to a mammal, for example, a human.
As used herein, the terms “treating” or “treatment” of any disease refers to reversing, alleviating, arresting, or ameliorating a disease or at least one of the clinical symptoms of a disease, reducing the risk of acquiring at least one of the clinical symptoms of a disease, inhibiting the progress of a disease or at least one of the clinical symptoms of the disease or reducing the risk of developing at least one of the clinical symptoms of a disease. “Treating” or “treatment” also refers to inhibiting the disease, either physically, (e.g., stabilization of a discernible symptom), physiologically, (e.g., stabilization of a physical parameter), or both, and to inhibiting at least one physical parameter that may or may not be discernible to the patient. In certain aspects, “treating” or “treatment” refers to protecting against or delaying the onset of at least one or more symptoms of a disease in a patient.
The inventive subject matter is directed to a method of diagnosing inflammatory bowel disease in a subject in need thereof, comprising the steps of:
In one aspect of the inventive subject matter, said microbiome sample is isolated from a stool sample.
In another aspect of the inventive subject matter, the presence of said nucleic acid oligomer is determined by whole-genome shotgun sequencing of total DNA isolated from said microbiome sample.
In a further aspect of the inventive subject matter, said nucleic acid oligomer comprises between 2 and 10 nucleotides.
In a preferred embodiment, said oligomer comprises an octamer.
In a more preferred embodiment, said octamer is selected from the group consisting of the octamers listed in Table 3 and 10.
In a most preferred embodiment, said octamer is selected from the group consisting of the octamers listed in Table 7.
In an alternate aspect of the inventive subject matter, the method comprises the additional step of reporting said diagnosis to a medical professional, to the subject or a representative of the subject, or to a combination thereof.
The inventive subject matter is further directed to a method of differentially diagnosing Crohn's disease and ulcerative colitis in a subject in need thereof, comprising the steps of:
In one aspect of the inventive subject matter, said microbiome sample is isolated from a stool sample.
In another aspect of the inventive subject matter, the presence of each said nucleic acid oligomer is determined by whole-genome shotgun sequencing of total DNA isolated from said microbiome sample.
In a further aspect of the inventive subject matter each said nucleic acid oligomer comprises between 2 and 10 nucleotides.
In a preferred embodiment, each said nucleic acid oligomer comprises an octamer.
In an alternate aspect of the inventive subject matter, said signature is selected from the group consisting of the signatures listed in Table 4.
In a preferred embodiment, said signature is selected from the group consisting of the signatures listed in Table 5.
In yet another aspect of the inventive subject matter, said method comprises the additional step of reporting said diagnosis to a medical professional, to the subject or a representative of the subject, or to a combination thereof.
In an aspect, the inventive subject matter provides methods to identify feature sets that can be used to classify an individual into one of the sets. Microbial communities associated with the human body are collectively summarized as the “human microbiome.” Differences in human microbiome compositions are generally believed to be associated with different body sites as well as with specific health and disease states. These complex communities of microbial organisms can be studied on the systems level using whole-genome shotgun (WGS) sequencing of total DNA isolated from microbiome samples (termed “metagenomics”). Using bioinformatic tools, microbiome-specific signatures can be identified and used to provide valuable information for basic research, forensics, and clinical diagnostics. However, which features of the microbiome provide the best signature to distinguish and classify microbiome samples remains yet to be determined, as many commonly used parameters in microbial ecology (e.g., 16S rRNA-based phylogenetic community compositions) show large variations across apparently related sample populations (e.g. healthy human stool samples).
Using a computationally efficient, but statistically rigorous, method developed by the inventors and described herein, predictive diagnostic markers (termed “features” herein) have been identified within metagenomic datasets that allow classification of samples as having particular signatures that correspond to selected gastrointestinal disease backgrounds. In particular, a method for identifying microbial community signatures that distinguish a priori-selected sample groups of interest has been developed. These signatures are based on sequence data compositions from metagenomic samples, i.e. total sequenced DNA isolates from entire microbiome samples. Classifiers based on these signatures were assessed for robust sensitivity and specificity using a cross-validation sub-sampling procedure in which random sets of samples were selected and subsequently re-classified individually using the remaining data. Additional testing could be performed using classifiers to assign samples that were not part of the original training sets. This approach was applied to data from the Human Microbiome Project as well as other gut microbiome datasets, including data from IBD patients, and IBD patients with Crohn's disease (CD) and/or ulcerative colitis (UC).
Using this method of statistical analysis, sets of features were identified that allow for patients with Crohn's disease to be distinguished from patients having ulcerative colitis at a statistically-significant rate of accuracy. The method was also applied to identify sets of features that allow for patients with IBD to be distinguished from patients without IBD. Thus, the method of statistical analysis has been applied to metagenomic datasets to create classifiers that accurately diagnose patients as having IBD, as well as distinguish between patients with ulcerative colitis and Crohn's disease. These feature sets can be used to assign additional samples to these two groups. The inventive subject matter may be used to (i) identify signatures characteristic for each of the two classes, (ii) assess the statistical significance associated with these signatures, and (iii) perform a classification of additional samples of unknown patient backgrounds to either of the two groups.
In particular, the inventive subject matter relates to the identification and comparative analysis of sequence-features in metagenomic whole-genome shotgun (WGS) sequence data derived from human clinical specimens. In combination with the statistical analysis method developed by the inventors, a group of sequence-features that correspond to particular human microbiomes, termed significant features, are identified, and together form microbiome signatures. Short nucleic acid sequences or oligomers (termed “kmers” herein) of generally 8 nucleotides are then identified that serve as classifiers for use in screening for signatures that will place a sample into a particular class (e.g., a patient with CD or UC, or a patient with IBD or without IBD). A set of terms that define specific relationships in the method are as follows:
The general manner in which the statistical methodology detailed in the present application is performed can be summarized as follows:
While the Examples provided herein are directed to the analysis of samples from human subjects, it will be apparent to one of ordinary skill in the art that the methods described herein can be conducted in conjunction with microbiome samples from a wide variety of animals, including, but not limited to, humans and non-human animals, e.g., a non-human primate, bird, horse, cow, goat, sheep, a companion animal, such as a dog, cat or rodent, or other mammal.
Microbiome samples analyzed using the inventive methods are not limited in the source or location on the subject from which they are obtained or the means used to obtain them. Example sources include dental plaques, such as supragingival plaque, saliva, and stool. Example locations include all areas of the skin, including on and around the anus, vaginal, urethra, and the interior of the mouth. Additional example locations include the anterior nares, buccal mucosa, posterior fornix, and tongue dorsum. Exemplary techniques for obtaining samples are well known in the art, and include swabs.
When the source of the microbiome sample is stool, extraction of total metagenomic DNA can be conducted using known techniques (Gajer et al., Sci Transl Med 4:132-152 (2012); Sellitto et al., PLoS One 7: e33387 (2012)), including community-approved standard operating procedures (SOPs) established and published by the Human Microbiome Project (http://hmpdacc.org/tools_protocols/tools_protocols.php). In particular, the MoBio UltraClean® Fecal DNA Isolation Kit (MO BIO Laboratories, Inc., Carlsbad, Calif.) can be used, which contains a humic acid inhibitor removal step.
Metagenomic DNA can be multiplexed and sequenced, for example, on a single channel of an Illumina HiSeq 2000 platform, following the manufacturer's recommendations as amended by the Genomics Resource Center at the Institute for Genome Sciences. For example, combining 100 samples on a single HiSeq 2000 channel will generate about 3 million 100 bp paired-end sequence reads per sample.
Working with stool samples provides advantages with respect to potential clinical integration as a high-throughput diagnostic tool: (i) sample amounts: typically 0.1 g of stool are sufficient to isolate genomic DNA for several rounds of sequencing; (ii) automation: robotic platforms and corresponding sample processing kits from several manufacturers (e.g. MoBio, Qiagen, Zymo) allow for automated processing of stool samples for DNA isolation. Together with automated workflow systems for sequencing library generation and sequencing on the Illumina HiSeq 2000 or MiSeq platform, fast, efficient and cost-effective sample processing can be achieved.
A method of statistical analysis has been developed whereby metagenomic WGS sequence data of particular microbiomes is analyzed to identify particular signatures that can be used to distinguish samples as belonging to one of two (or more) defined classes. Short oligomers of 8 nucleotides (“kmers”), for example, are then identified that serve as features for use in screening for signatures to develop a classifier that will place a sample into a particular class (e.g., a patient with IBD, CD or UC). Kmers of other lengths, such as from 1 to 50 nucleotides in length, may also be used, and the lengths include each integer from 1 to 50.
The described procedure is comprised of three major phases: (1) significant feature selection, (2) signature selection and classifier development, and (3) classifier validation and testing.
The first phase of the analysis begins by obtaining a set of n metagenomic dataset samples S=[S1, S2, S3, . . . , Sn] associated with a set of m mutually-exclusive classes where m≧2. The metagenomic datasets can be obtained from existing databases, such as the Human Microbiome Project (HMP) body site database or from the MetaHIT project [4] which contains WGS metagenomic samples from stool of European individuals.
The number of times short DNA oligomers (kmers; e.g., oligonucleotide sequences of 8 nucleotides=an 8 mer) appear in each metagenomic sample S are counted using the Jellyfish program from the open source wgs-assembler software package (http://www.cbcb.umd.edu/software/jellyfish/). Overlapping kmers are counted on both DNA strands, i.e. in the original and reverse complement sequence of each read, resulting in sequences being counted twice. To reduce redundancy, only canonical kmers are stored, i.e., a specific kmer w and its reverse complement wrc are not distinguished. The kmer that appears first in alphabetical order is recorded. Therefore, the feature space consisted of 32,896 canonical Kmers (48(0.5*255/256+1/256)).
Kmers are next assessed for differential mean abundance between the two classes, as defined a priori, using the Metastats program [26] (-b set to 5000 permutations). Due to the sizeable feature space of 32,896 kmers, accounting for multiple hypothesis tests is critical to assess false positives. This means that between any two groups, a certain number of kmers will show differential abundance by chance. Therefore, a p-value threshold is selected that controls the false discovery rate [27]. Given a p-value threshold, this results in a set of kmers, designated as corresponding to significant features, to be considered in the next phase. Lastly, kmer counts are normalized to relative abundances within each sample (i.e. given a value between 0 [0%] and 1 [100%]).
To prevent biased weighting of significant features during the classification process, the distribution of each feature is centered to zero mean and scaled to unit variance. Specifically, given the distribution of values for a feature F across all datasets SF=[S1F, S2F, S3F, . . . , SnF], the sample mean and variance of this vector, (μF and σ2F, respectively), is computed and normalized by subtracting μF from each value in the feature vector and subsequently dividing by the square root of the variance:
The associated scaling factors (μF and σ2F) from this normalization are also applied to all downstream test sets.
For each class b, the data containing all samples and corresponding normalized values of significant features is processed according to the following algorithm. For each selection iteration i, a subset of Ni significant features are first selected from the set of differentially abundant features determined in 1.2 (Nmin≦Ni≦Nmax). This set of Ni features comprises a signature. Under this signature:
At the end of this procedure, prediction results for each signature selection iteration i relative to class b are obtained. Signatures are ranked by overall classification accuracy.
The nearest neighbor algorithm is a mainstay of machine learning and has been successfully used for many classification problems ranging from secondary structures of proteins to face recognition. For high dimensional datasets, it is not possible to evaluate every unique combination of parameters, (e.g. [100 choose 8]=1.861×1011) therefore selection of Nmin and Nmax is important. For each application of the method of statistical analysis, broad searches are empirically performed to identify practical values for Nmin and Nmax.
By performing this classification training separately for each class, the possibility that some signatures may distinguish a particular class from all others, but may not effectively distinguish all classes universally from one another, is allowed for. This flexibility leads to the application of multiple binary classifiers for class-specific assignment instead of a single universal classifier.
The robustness of top ranking classifiers is assessed using the following cross-validation (CV) procedure. For each repeated CV iteration, a random subset of samples is removed from the full dataset (e.g. 15%) and used for cross-validation consisting of re-classification of each removed sample individually on the basis of the remaining samples and the same nearest-neighbor algorithm as described above. The average sensitivity and specificity across all CV iterations are recorded for each classifier. Choices for iteration and size of validation set vary depending on the characteristics of the dataset.
To test the performance of a classifier on new samples, the significant features from the initial assessment are evaluated for these samples, i.e. the counts of those kmers that correspond to the significant features comprised in the signature are determined and normalized according to the scaling factors described in section 2.1. The sample is then assigned to either class b or not b using the nearest neighbor algorithm with a Euclidean distance metric as described in 2.2 A.
Another aspect of the inventive subject matter provides a signature comprising one or more feature selected from Tables 3, 7, and 10. For example, a signature may comprise about 1 feature and about 10 features, about 5 features to about 15 features, about 20 features to about 30 features, about 25 features to about 35 features, about 30 features to about 40 features, about 35 features to about 45 features, or about 40 features to about 50 features selected from Tables 3, 7, and 10. A signature may also comprise about 10, about 15, about 25, about 50, about 75, or about 100 features selected from Table 3. Alternatively, a signature may comprise about 250, about 500, about 1000, about 1500, about 2000, about 2500, about 3000, about 3500, about 4000, about 4500, or about 5000 features selected from Tables 3, 7, and 10. Preferably a signature comprises less than about 500 features, more preferably less than about 100 features.
In some aspects, a signature comprises between about 1 and about 50 features selected from Tables 3, 7, and 10. In an aspect, a signature is selected from Table 4. In another aspect, a signature is selected from Table 5. In another aspect, a signature comprises the features listed in Table 7.
In other aspects, a signature consists of the features listed in Table 7.
Another aspect of the inventive subject matter provides a panel comprising at least one probe. Each probe of the panel is capable of specifically hybridizing to a unique feature selected from Tables 3, 7, and 10 via Watson-Crick base pairing. A panel may comprise about 1 probe to about 10 probes, about 5 probes to about 15 probes, about 20 to about 30 probes, about 25 probes to about 35 probes, about 30 probes to about 40 probes, about 35 probes to about 45 probes, or about 40 probes to about 50 probes that each specifically hybridize to a unique feature selected from Table 3. A panel may also comprise about 10, about 15, about 25, about 50, about 75, or about 100 probes that each specifically hybridize to a unique feature selected from Tables 3, 7, and 10. Alternatively, a panel may comprise about 250, about 500, about 1000, about 1500, about 2000, about 2500, about 3000, about 3500, about 4000, about 4500, or about 5000 probes that each specifically hybridize to a unique feature selected from Tables 3, 7, and 10. Preferably a panel comprises less than about 500 probes, more preferably less than about 100 probes. In some aspects, a panel comprises between about 1 and about 50 probes that each specifically hybridize to a unique feature selected from Tables 3, 7, and 10. In an aspect, a panel comprises a plurality of probes that each specifically hybridize to a unique feature comprising a signature selected from Table 4. In another aspect, a panel comprises a plurality of probes that each specifically hybridize to a unique feature comprising a signature selected from Table 5. In another aspect, a panel comprises a plurality of probes that each specifically hybridize to a unique feature listed in Table 7.
A further aspect of the inventive subject matter is an array comprising at least one address. At least one address of the array has disposed thereon a probe that specifically hybridizes to a unique feature selected from Table 3, Table 7, or Table 10. In certain aspects, an array has disposed thereon a plurality of probes, wherein each probe specifically hybridizes to a unique feature selected from selected from Table 3, Table 7, or Table 10, and is located at a unique address of the array. Suitable panels of probes are described in the preceding paragraph.
Several substrates suitable for the construction of arrays are known in the art, and one skilled in the art will appreciate that other substrates may become available as the art progresses. The substrate may be a material that may be modified to contain discrete individual sites appropriate for the attachment or association of a probe and is amenable to at least one detection method. Non-limiting examples of substrate materials include glass, modified or functionalized glass, plastics (including acrylics, polystyrene and copolymers of styrene and other materials, polypropylene, polyethylene, polybutylene, polyurethanes, TeflonJ, etc.), nylon or nitrocellulose, polysaccharides, nylon, resins, silica or silica-based materials including silicon and modified silicon, carbon, metals, inorganic glasses and plastics. In an aspect, the substrate may allow optical detection without appreciably fluorescing.
A substrate may be planar, a substrate may be a well, e.g. a well of a 6, 24, 96, 384 or 1536 well plate, or alternatively, a substrate may be a bead. Additionally, the substrate may be the inner surface of a tube for flow-through sample analysis to minimize sample volume. Similarly, the substrate may be flexible, such as a flexible foam, including closed cell foams made of particular plastics.
A probe may be attached to the substrate in a wide variety of ways, as will be appreciated by those in the art. The probe may either be synthesized first, with subsequent attachment to the substrate, or may be directly synthesized on the substrate. The substrate and the probe may be derivatized with chemical functional groups for subsequent attachment of the two. For example, the substrate may be derivatized with a chemical functional group including, but not limited to, amino groups, carboxyl groups, oxo groups or thiol groups. Using these functional groups, the probe may be attached either directly or indirectly using linkers.
The probe may also be attached to the substrate non-covalently. For example, a biotinylated probe may be prepared, which may bind to surfaces covalently coated with streptavidin, resulting in attachment. Alternatively, a probe may be synthesized on the surface using techniques such as photopolymerization and photolithography. Additional methods of attaching probes to arrays and methods of synthesizing biomolecules on substrates are well known in the art.
A probe may be represented more than once on a given array. In other words, more than one address of an array may be comprised of the probe. In some aspects, two, three, or more than three addresses of the array may be comprised of the probe. In certain aspects, the array may comprise control probes and/or control addresses. The controls may be internal controls, positive controls, negative controls, or background controls.
In another aspect, the inventive subject matter provides a method for classifying a subject as having IBD or not. The method comprises (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature that distinguishes subjects having IBD from a subject not having IBD, (b) determining in the sample the abundance of the features comprising the signature (“the dataset”), and (c) classifying the dataset from (b) as being either obtained from a subject having IBD or a subject not having IBD, thereby classifying the subject as having IBD or not having IBD.
Another aspect of the inventive subject matter provides a method for classifying a subject as having UC or CD. In some aspects, the subject is diagnosed as having IBD. In other aspects, a diagnosis of IBD is suspected for the subject but not confirmed. The method comprises (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature that distinguishes subjects having UC from a subject having CD, (b) determining in the sample the abundance of the features comprising the signature (“the dataset”), and (c) classifying the dataset from (b) as being either obtained from a subject having UC or a subject having CD, thereby classifying the subject as having UC or CD.
Suitable sources of microbiome samples and methods for isolating nucleic acids from the microbiome samples are described above in Section B. Suitable signatures are described in Section C above, and further described in the Examples. The abundance of the features in the signatures can be determined by any method known in the art including, but not limited to, sequencing or hybridization to an array. Suitable arrays are described in Section D above.
A preferred nucleic acid sample may be a nucleic acid sample obtained from a suitable fecal sample. Fecal samples are commonly used in the art to sample gut microbiota. Methods for obtaining a fecal sample from a subject are known in the art and include, but are not limited to, rectal swab and stool collection. Suitable fecal samples may be freshly obtained or may have been stored under appropriate temperatures and conditions known in the art. Methods for extracting nucleic acids from a fecal sample are also well known in the art and described herein. The nucleic acids comprising the nucleic acid sample may or may not be amplified prior to being used in step (b) depending upon the type and sensitivity of the data acquisition component. When amplification is desired, nucleic acids may be amplified via polymerase chain reaction (PCR) from a nucleic acid sample. Methods for performing PCR are well known in the art. The nucleic acids comprising the nucleic acid sample may also be fluorescently or chemically labeled, fragmented, or otherwise modified prior to sequencing or hybridization to an array as is routinely performed in the art.
In some aspects, the abundance of the signature's features in the sample is determined by sequencing. In other aspects, the abundance of the signature's features in the sample is determined by using an array. The inventive subject matter is not limited to any particular sequencing or array platform. Suitable sequencing platforms are capable of single-molecule sequencing, ion semiconductor sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation, nanopore sequencing, or tunneling currents sequencing. Non-limiting examples of suitable sequencers include an Illumina sequencer, a MiSeq Desktop Sequencer, a NextSeq Sequencer, an Ion PGM™ sequencer, a MinION™ device, a GridION™ device, a Ion Proton™ device, and/or the like. Suitable array platforms are capable of generating array data that captures the intensity of each position on the array for use in producing a microbial nucleic acid feature. Data from the sequencers or array scanner may be used by a program to qualitatively or quantitatively determine the abundance of the features comprised in the signature. The abundance values may be normalized by methods known in the art, or as described in Section B.2.1. The sample is then assigned to either class b or not b using the nearest neighbor algorithm with a Euclidean distance metric as described in Section B.2.2 A.
Diagnosis of IBD is a difficult and lengthy process involving extensive testing and the systematic eliminate of other possible diagnoses. The lengthy nature of this process causes those subjects with IBD to suffer long waits for proper diagnosis and treatment. Thus, there is a need for method of rapidly classifying whether a subject has IBD. Such a method is presently disclosed. Classifying subjects as having IBD, CD, or UC allows the physician to take the most appropriate next steps in diagnosing and treating the subject. These steps may include prescribing or administering drugs, prescribing dietary and nutritional therapies, prescribing or preforming medical and surgical treatments, prescribing or performing additional diagnostic methods, enrolling patients in clinical trials, and other treatments.
Drugs that may be prescribed or administered for treatment, include, but are not limited to anti-inflammatory drugs (including, but not limited to aminosalicylates and corticosteroids), immune system suppressors, antibiotics, anti-diarrheal medications, and pain relievers. Dietary and nutritional therapies may include, but are not limited to, vitamin supplements, enteral nutrition, parenteral nutrition, probiotics, prebiotics, special diets (i.e., low-residue diet), and nutritional counseling. Additional diagnostic methods may include, but are not limited to, blood tests, genetic testing, stool tests, biopsy, radiology scans or diagnostic imaging, and endoscopy. Blood tests may include, but are not limited to, complete blood count, erythrocyte sedimentation rate (ESR), c-reactive protein (CRP), liver enzymes, electrolytes, vitamin B12, vitamin D, calprotectin, latoferrin, thiopurine methyltransferase (TMPT), perinuclear anti-neutirophil antibody (pANCA), anti-Saccharomyces cervisiae antibody (ASCA), anti-flagellin antibody (CBir1), and anti-OMPC antibody (OmpC). Radiology scans and diagnostic imaging may include, but are not limited to, barium enema, CT scan and CT enterography (CTE), leukocyte scintigraphy, magnetic resonance imaging (MRI) and MR enterography (MRE), small bowel follow-through and small bowel enteroclysis, ultrasounds, and X-rays. Endoscopic tests may include but are not limited to colonoscopies, sigmoidoscopy, capsule endoscopy (CE), and endoscopic ultrasound. A preferred endoscopy test for subject classified as having IBD is a colonoscopy.
Accordingly, the inventive subject matter provides a method for recommending a subject for a colonoscopy. The method comprises (a) providing (i) a sample obtained from the subject, the sample comprising nucleic acid from the subject's microbiome, and (ii) a signature that distinguishes subjects having IBD from a subject not having IBD, (b) determining in the sample the abundance of the features comprising the signature (“the dataset”), (c) classifying the dataset from (b) as being either obtained from a subject having IBD or a subject not having IBD, and (d) recommending the subject for a colonoscopy when the dataset from (c) is classified as being from a subject with IBD. The primary advantange such a method is that a patient can be accelerated to colonoscopy for confirmation of IBD by a specialist rather than endure long-term dietary evaluation by a general practitioner.
The following examples are included to demonstrate aspects of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples that follow represent techniques discovered by the inventors to function well in the practice of the invention. Those of skill in the art should, however, in light of the inventive subject matter as disclosed herein, appreciate that changes may be made in the specific aspects that are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention. Therefore, all matter set forth or shown in the accompanying drawings is to be interpreted as illustrative and not in a limiting sense.
Metagenomic data sets analyzed using the methods of statistical analysis described above include the following.
Human Microbiome Project (HMP) body site data. A total of 690 HMP WGS metagenomic samples from a six distinct body sites (anterior nares, buccal mucosa, posterior fornix, stool, supragingival plaque, and tongue dorsum) were analyzed. The minimum number of samples associated with a particular body site was 51. Given the large scale of the HMP WGS Illumina sequence datasets (over 15.3 terabytes of raw sequence), kmer counting was performed on random sequence subsets of each sample (˜20% of raw reads). This resulted in an average kmer frequency of 4,139 observations per kmer and sample.
MetaHIT data, healthy individuals, and inflammatory bowel disease (IBD) patients. Raw Illumina GA II read data was acquired from the MetaHIT project [4] which contained WGS metagenomic samples from stool of 124 European individuals. Of the 124 individuals, 21 had ulcerative colitis (UC) and four had Crohn's disease (CD). The remaining 99 were considered healthy.
Obese and lean twin gut microbiomes. The classifiers were additionally tested on WGS metagenomic datasets representing the gut microbiomes of 15 subjects who were either obese (BMI>30, n=9) or lean (BMI=18.5-24.9, n=6) [2]. This raw sequence data was generated using the Roche/454 platform, either with the FLX or Titanium generation of sequencers.
Japanese gut microbiomes. Nine gut microbiome samples were obtained from a study of Japanese infants and adults [5]. All subjects in this case were >1.5 years in age (mean: 26.6 yrs). Datasets consisted of assembled contigs and singleton reads generated using Sanger sequencing technology.
Healthy individual and Crohn's disease patient fecal microbiomes. Raw sequence data from two recent studies on inflammatory bowel disease was included. 1) Roche/454 Titanium reads corresponding to eight fecal samples from patients diagnosed with CD and four samples from healthy controls [29], which contained technical replicates, i.e. independently generated sequence data from the same sample, which were treated as independent samples in the analysis of this data. 2) Illumina MiSeq reads corresponding to four fecal samples from CD patients and seven samples from healthy controls [30].
A proof-of-concept study focused on assigning metagenomic samples from the Human Microbiome Project to their associated body sites. The analysis was restricted to 690 samples from six well-represented body regions (anterior nares, buccal mucosa, posterior fornix, supragingival mucosa, stool, tongue dorsum).
In the initial differential abundance detection of kmers (see section 1., above), a universal p-value cutoff (p<5e-04) was heuristically chosen such that the resulting expected feature false discovery rate (FDR) was below 0.08% for each pairwise comparison of body sites. Additionally, only kmers that were differentially abundant between a particular body site and all other sites were included. For example, if the kmer “ACGTTACG” was identified as differentially abundant in all pairwise comparisons involving buccal mucosa, it was designated as a significant feature to consider for buccal mucosa signature selection. Thus, significant feature sets differed by body site. After performing a broad scan of signature set sizes where Nmin=Nmax (range 10-1000), Nmin=50 and Nmax=100 was selected as the best balance between computational complexity and overall accuracy.
For each body site and its associated set of priority kmers, the feature selection algorithm (see section 2., above) was run for 10,000 iterations, and the 10 signatures with the best classification accuracy or 10 random classifiers if more showed 100% accuracy (e.g. for stool samples) were selected for cross-validation. The single best performing classifier was then chosen based on average sensitivity/specificity (SN/SP) results. In some cases, multiple classifiers had optimal performance (e.g. stool classifiers), and in these cases a single classifier was randomly selected to report.
Table 1 displays the cross-validation results of the top selected classifier for each body site. For each classifier, sensitivity and specificity values were calculated for each subsampling step of the cross-validation procedure. Fisher's p-values were determined to assess the statistical significance of the obtained accuracy of the classifier during cross-validation. All classifiers demonstrated high accuracy with a mean sensitivity of >97% and mean specificity of >99%. Furthermore, the corresponding range of p-values from Fisher's exact test showed that the classification results from cross-validation are significantly better than random.
The cross-validation procedure used a subsampling approach in which 15% of the samples were repeatedly selected as a validation set for re-classification using the remaining 85% of the data.
Further investigation into how the body site classifiers would perform on datasets from metagenomic studies outside the HMP, including those employing different sequencing technologies, was undertaken. To do so, three datasets associated with previously published distal gut microbiome projects [2,4,5] were collected. The first dataset represented the gut microbiota of 85 healthy subjects from the MetaHIT project. Similarly to the HMP, the MetaHIT project utilized the Illumina sequencing platform. The second dataset was comprised of 15 samples generated by the Roche/454 sequencer from the study on obese and lean twins [2], while the third set of nine samples from the Japanese study [5] used Sanger sequencing and only provided assembled contigs and singleton reads. kmer counting was performed on all samples and they were assigned using each body site classifier from Table 1.
Table 2 displays the results of the HMP body site classifiers on the external gut microbiome datasets. For the stool classifier, the overall sensitivity was computed for each gut dataset, that is, the percentage of tested samples that were successfully assigned to stool. In contrast, for each of the other body site classifiers, the specificity was calculated, that is, the percentage of gut samples that were correctly classified as not belonging to that body site. It was observed that all classifiers performed perfectly on these test sets, either with 100% sensitivity or 100% specificity, respectively.
In spite of the highly variable characteristics of these datasets (e.g. different geographic sample origins, sequencing technologies, or read lengths, assembled vs. unassembled reads), the classifiers assigned all gut samples perfectly. This evidence supports the robustness of the microbial signatures to alterations in sequencing technology or data processing.
MetaHIT data consists of raw illumina GA II sequences. The twin gut microbiomes consist of raw reads from 454/Roche FLX or Titanium platforms. The Japanese gut microbiome data consists of assembled contigs and singleton reads from a Sanger sequencing platform.
IBD vs. Non-IBD
The method of statistical analysis was applied to a setting of clinical importance. Clinicians often find it difficult to easily determine whether a patient has inflammatory bowel disease. Therefore, the method was performed to determine signatures that accurately distinguished patients with IBD from patients without IBD.
The training sets included shotgun metagenomic sequences the following: (i) 147 non-IBD control samples from Karlsson et al. (“Gut metagenome in European women with normal, impaired and diabetic glucose control.” Nature 498.7452 (2013): 99-103), (ii) 12 IBD/non-IBD control samples from Erickson et al. (“Integrated metagenomics/metaproteomics reveals human host-microbiota signatures of Crohn's disease.” PloS one 7.11 (2012): e49138), (iii) 43 IBD/non-IBD control samples from Gevers et al. (“The treatment-naive microbiome in new-onset Crohn's disease.” Cell host & microbe 15.3 (2014): 382-392), (iv) 342 IBD/non-IBD control samples from Lewis et al. (“Inflammation, Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn's Disease.” Cell host & microbe 18.4 (2015): 489-500), (v) 11 IBD/non-IBD control samples from Morgan et al. (“Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment.” Genome Biol 13.9 (2012): R79), (vi) 124 non-IBD control samples from Qin et al. (“A human gut microbial gene catalogue established by metagenomic sequencing.” Nature 464.7285 (2010): 59-65), and (vii) 18 non-IBD control samples from Turnbaugh, ct al. (“A core gut microbiome in obese and lean twins.” Nature 457.7228 (2009): 480-484). In total, there were 389 and 308 IBD and non-IBD control samples, respectively.
Kmers (8 mers) with significant differential abundance between the two sample groups were determined with the Mann-Whitney test. After differential abundance analysis, a p-value threshold was selected to maintain a FDR <0.01%, resulting in a set of 18,680 significant features.
The signature selection algorithm was run on the 18,680 significant features for 800,000 iterations using signatures consisting of between 1 and 50 different 8 mers. Each sample was screened using the different signature iterations, and then classified as belonging to one of the two classes (i.e. IBD vs. non-IBD). This classification was performed with the nearest neighbor algorithm with a Euclidean distance metric (see section 2.2A above). Next, examining the 2×2 outcome table of the resulting assignments, the corresponding sensitivity (SN), specificity (SP), and positive/negative predictive values (PPV/NPV) were computed and recorded. Fisher's exact test was used to assess the significance of each classification relative to a randomized assignment. All 800,000 tested signatures, i.e. combinations of between 1 and 50 random 8 mers, were ranked based on 2×2 outcome tables.
Of these 800,000 iterations, a signature of 21 kmers was found that significantly distinguished IBD and non-IBD samples. This signature is provided in Table 7. The primary results of the signature is shown in Table 8.
The robustness of this top-performing classifier (=signature) was assessed using a re-sampling cross-validation procedure as described in section 3.1 above. This type of validation determines whether the accuracy of the classifiers is significantly reduced if only a subset of the original samples is used as a reference for the classification, i.e. it can help assess whether additional reference samples are needed to improve the classifier. For a particular classifier undergoing validation, a random subset of samples (20%) were removed from the full dataset and then individually re-classified. This process was repeated for 100 iterations for the classifier. The average SN, SP, PPV, NPV and Fisher's exact test p-values across all cross-validation iterations were recorded for the optimal signature (results shown in Table 9).
Crohn's Disease vs. Ulcerative Colitis
The method of statistical analysis was applied to a setting of clinical importance. Clinicians often find it difficult to easily distinguish between different forms of inflammatory bowel disease. Therefore, the method was performed to determine signatures that accurately distinguished patients with CD from patients with UC.
The training sets included the following: Crohn's Disease—the CD training set was comprised of metagenomic WGS data from stool samples from: 4 CD patients from the European MetaHIT study (Illumina HiSeq sequence data) [4] plus 12 CD patients (including several technical replicates) from two U.S. studies (Roche/454 sequence data, [29] and Illumina MiSeq data [30]), thus providing 16 CD samples; Ulcerative Colitis—the UC training set was comprised of gut samples from the MetaHIT project representing 21 patients with UC.
Kmers (8 mers) with significant differential abundance between the two sample groups were determined with the Metastats program [26]. After differential abundance analysis, a p-value threshold was selected to maintain a FDR <1%, resulting in a set of 5,287 significant features provided in Table 3.
The signature selection algorithm was run on the 5,287 significant features for 105,287 iterations using signatures consisting of between 1 and 50 different 8 mers. Each sample was screened using the different signature iterations, and then classified as belonging to one of the two classes (i.e. UC vs. CD). This classification was performed with the nearest neighbor algorithm with a Euclidean distance metric (see section 2.2A above). Next, examining the 2×2 outcome table of the resulting assignments, the corresponding sensitivity (SN), specificity (SP), and positive/negative predictive values (PPV/NPV) were computed and recorded. Fisher's exact test was used to assess the significance of each classification relative to a randomized assignment. All 105,287 tested signatures, i.e. combinations of between 1 and 50 random 8 mers, were ranked based on 2×2 outcome tables.
Of these 105,287 iterations, a total of 1,087 signatures were found that perfectly distinguished CD and UC samples. These signatures are provided in Table 4. Kmers listed in column A of Table 4 correspond to the kmers in Table 3, starting with position 0 (e.g. “AAAAAAAA” is number 0), running down a column, and then continuing to the next column to the right. Of the 5,287 significant 8-mers, 5,063 (95.8%) are included in at least one top-performing signature.
As Table 4 shows, using any of 1,087 signatures, the 37 samples were correctly classified as coming from the 16 patients with Crohn's disease or the 21 patients with ulcerative colitis, i.e. 100% specificity and sensitivity were achieved for this classification.
The robustness of the 1,087 top-performing classifiers (=signatures) was assessed using a re-sampling cross-validation procedure as described in section 3.1 above. This type of validation determines whether the accuracy of the classifiers is significantly reduced if only a subset of the original samples is used as a reference for the classification, i.e. it can help assess whether additional reference samples are needed to improve the classifier. For a particular classifier undergoing validation, a random subset of samples (20%) were removed from the full dataset and then individually re-classified. This process was repeated for 100 iterations for each classifier. The average SN, SP, PPV, NPV and Fisher's exact test p-values across all cross-validation iterations were recorded for each tested signature, i.e. for each set of 8 mers. Of all 1,087 top-performing classifiers, 17 also showed 100% mean SN, SP, PPV, and NPV values after cross-validation (Table 5). This indicates that the variation detected by these 17 signatures between samples from the CD and UC classes is stronger than variations between samples from the same class (i.e. CD or UC) and, thus, that classification of samples as UC or CD with these 17 signatures is most reliable. Classification with the remaining 1070 signatures on the other hand is hindered by the detection of signals resulting from variations between samples from the same class (i.e., CD or UC). The accuracy of these 1070 signatures is therefore expected to depend to a larger extent on the number of reference samples and to improve with larger numbers of reference samples.
To evaluate the importance of kmer length for sample classification, the signature identification method was also applied to kmer abundances of lengths of 2, 3, and 4 nucleotides and a training set of 21 UC and 12 CD samples (including the 4 CD patients from the MetaHIT and 8 CD patients from one of the U.S. studies [29]). The results, including mean SN, SP, PPV, NPV, and Fisher's exact test P-values, are shown in Table 6 and indicate that classification of CD and UC samples is possible using signatures of kmers combinations with length 2 nucleotides and longer, although at lower accuracy (i.e., SN, SP <100%) compared to kmers of length 8 nucleotides.
To validate the statistical significance of the findings, the same type of kmer-based signature identification analysis was performed using computer-generated random kmer counts of the length 4 nucleotides. For this analysis, two groups with 100 samples each were compared. While the sensitivity of 73% of the top-performing signatures to distinguish these random sample data is relatively high, the specificity of 52% is low and the associated mean Fisher's exact test P-values of 0.25 are high, as would be expected. The results of this comparison are also shown in Table 6.
All patents and publications mentioned in this specification are indicative of the level of skill of those skilled in the art to which the inventive subject matter pertains. The following literature references are believed to useful to an understanding of the inventive subject matter in the context of its place in the relevant art. Citation here is not to be construed as an assertion or admission that any reference cited is material to patentability of the inventive subject matter. Applicants will properly disclose information material to patentability in an Information Disclosure Statement. Each of the following documents is hereby incorporated by reference in its entirety in this application.
The inventive subject matter herein began with Applicants' essential observation that significant biases exist in the oligonucleotide compositions of metagenomic samples between different body sites and that some of these biases are remarkably stable across multiple individuals, in spite of an overall taxonomic variability of the microbiota, as determined by 16S rRNA phylogenetic analysis. Taking advantage of this information, robust signatures were discovered for accurately assigning metagenomic samples to different body sites or to a disease, disorder, condition, or symptom thereof such as the two inflammatory bowel diseases, Crohn's disease, and ulcerative colitis.
While the invention has been described with reference to certain particular aspects thereof, those skilled in the art will appreciate that various modifications may be made without departing from the spirit and scope of the invention. The scope of the appended claims is not to be limited to the specific aspects described.
The inventive subject matter being thus described, it will be obvious that the same may be modified or varied in many ways. Such modifications and variations are not to be regarded as a departure from the spirit and scope of the inventive subject matter, and all such modifications and variations are intended to be included within the scope of the following claims.
This application is a Continuation-In-Part which claims the benefit of U.S. Provisional Application No. 62/323,257, filed Apr. 15, 2016, and claims priority to U.S. patent application Ser. No. 15/028,253, filed Apr. 8, 2016, which claims priority from PCT/US2014/011321 filed Jan. 13, 2014, which claims the benefit of U.S. Provisional Application Ser. No. 61/888,288 filed Oct. 8, 2013, the disclosures of which are hereby incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
62323257 | Apr 2016 | US | |
61888288 | Oct 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15028253 | Apr 2016 | US |
Child | 15489658 | US |