The present invention generally relates at least to the fields of molecular biology, molecular diagnostics, infectious disease and medicine. The invention relates to the identification and comparative analysis of sequence features in metagenomic whole-genome shotgun (WGS) sequence data associated with particular disease states in a subject. In particular, the invention relates to diagnostic methods for distinguishing between different types of inflammatory bowel disease in a subject based on the microbial community signature of the subject.
The human body and its associated microbiota, the “human microbiome,” represent a complex superorganism with thousands of microbial species distributed biogeographically among niche body sites [1]. The fundamental role of the microbiome for human health is generally accepted and a growing body of literature reports associations between specific diseases or disorders and the microbiome [2-4]. However, in many cases it has remained difficult to correlate a defined microbiome composition with a specific disease symptom. Furthermore, recent studies have demonstrated significant diversity in microbial communities between healthy humans, most notably in the digestive tract [2,4-8]. Additionally, the host-microbiome relationship is highly dynamic; host-induced changes to diet or other perturbations, such as antibiotic treatment, can significantly alter microbial composition thereby potentially inducing secondary effects on human health [7,9-11]. Despite this known within-and between-subject taxonomic variation, it is hoped that by characterizing these complex systems specific signatures of microbiomes can be identified that are predictive of health and disease states of the human host.
The rapid advancement of next-generation DNA sequencing technologies has allowed for deep sampling of human-associated microbial communities either in the form of whole genome shotgun (WGS) metagenomic DNA or multiplexed PCR amplicons of the 16S rRNA gene. The Human Microbiome Project (HMP) has already sequenced over ˜3.5 TB of microbial DNA from ˜750 human samples, revealing a vast phylogenetic diversity and functional capacity of commensals throughout the human population and across the human body [1,12,13]. There is extensive information to learn from the HMP data, including what constitutes a healthy or “normal” microbiome.
Assembly and taxonomic assignment of metagenomic data remain active areas of research for computational scientists, and although next-generation platforms produce tremendous amounts of sequence, many analysis protocols that were developed for larger sequence reads from Sanger or Roche/454 technologies are difficult to apply due to short read lengths. While there has been substantial effort to develop computational methods that assign single metagenomic reads and assembled contig fragments to specific taxonomic lineages or metabolic functions [14-20], identification of a group of specific nucleic acid sequences (a “signature”) that define a particular microbiome within samples as a way to associate sample composition with sample background has received less attention.
Recent applications of machine learning algorithms have demonstrated that metagenomic sample classification with some accuracy is achievable [21-25]. However, these studies were largely limited to analyses of targeted sequence data, i.e. products of polymerase chain reaction (PCR) amplifications of the universal bacterial marker gene, the small ribosomal RNA subunit or 16S rRNA. Though easy to obtain, 16S rRNA datasets only provide limited phylogenetic information, and are typically heavily biased by primer design, amplicon region and gene copy number variability. Moreover, although phylogenetically-related bacteria tend to show similar phenotypic presentations, 16S rRNA-based phylogenetic and health-relevant functional microbiome compositions are not necessarily correlated. For example, phylogenetically distinct organisms within the microbiome could perform similar functions. In the extreme case this could result in a situation where two perfectly healthy humans have completely different, non-overlapping phylogenetic microbiome compositions. In contrast, WGS metagenomic datasets, which do not include a PCR amplification step, have fewer technical caveats relative to 16S rRNA surveys, and provide novel opportunities that can be explored to find distinctive features of different microbiomes and to investigate potential associations with phenotypic microbiota representations.
The elucidation of distinctive features of different microbiomes and phenotypic microbiota representations may prove to be useful as a means for diagnosing and monitoring particular disease states in a subject. The present invention is directed to this and other important goals in association with inflammatory bowel disease as the particular disease state.
The present invention is directed to methods for distinguishing between inflammatory bowel diseases in a subject. The invention thus includes methods for distinguishing between ulcerative colitis (UC) and Crohn's disease (CD) in a subject.
The invention takes advantage of the discovery that the particular microbiome associated with a subject, such as a human, can be used to classify the subject as belonging to a particular group (e.g., having UC or having CD). The microbiome of a subject can thus be screened, and certain information about the health and medical condition of the subject can be acquired.
Because microbiomes are comprised of bacterial populations that differ between individuals, even healthy subjects will have taxonomically different microbiomes. Therefore, a single sequence feature of the microbiome alone would generally not be expected to serve as a functional diagnostic. However, a group of between about 10 to about 50 different sequence features can be screened in a microbiome sample from a subject, and statistically significant diagnoses can be made about the relative health of the subject using this information. The features of interest in the present invention are particular nucleic acid molecules produced by the population of bacteria that comprise a subject's microbiome. The present invention uses short nucleic acid oligomers (termed “kmers” herein) to survey the entire composite metagenome (i.e., the sum of all individual microbial genomes) obtained from a microbiome sample from a subject. By surveying the entire metagenome for the relative amount of pre-selected, relevant sequences (i.e., significant features) using a group of kmers, one may determine the signature of the microbiome and classify the subject as belonging to a particular group. Thus, the present invention takes advantage of the discovery that the composition of a subject's microbiome is, in some cases, directly correlated with a particular disease.
In a first embodiment, the invention is directed to a method of distinguishing between a subject having ulcerative colitis and a subject having Crohn's disease, comprising:
In certain aspects of this embodiment, the microbiome sample is a stool sample.
In certain aspects of this embodiment, the nucleic acid is DNA, cDNA or RNA.
In certain aspects of this embodiment, the kmers are oligomers comprising between 2 and 10 nucleotides, including 2, 3, 4, 5, 6, 7, 8, 9, or 10 nucleotides. In a particular aspect, the kmers are oligomers comprising 8 nucleotides.
In certain aspects of this embodiment, the signature is at least one signature selected from among the 1087 signatures of Table 4.
In certain aspects of this embodiment, the signature is at least one signature selected from among the 17 signatures of Table 5.
In certain aspects of this embodiment, the data set is classified as being obtained from a subject having UC or a subject having CD by performing a nearest neighbor analysis on the data set.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described herein, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that any conception and specific embodiment disclosed herein may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description. It is to be expressly understood, however, that any description, table, example, etc. is provided for the purpose of illustration and description only and is by no means intended to define the limits the invention.
Unless otherwise noted, technical terms are used according to conventional usage. Definitions of common terms in molecular biology may be found, for example, in Benjamin Lewin, Genes VII, published by Oxford University Press, 2000 (ISBN 019879276X); Kendrew et al. (eds.); The Encyclopedia of Molecular Biology, published by Blackwell Publishers, 1994 (ISBN 0632021829); and Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by Wiley, John & Sons, Inc., 1995 (ISBN 0471186341); and other similar technical references.
As used herein, “a” or “an” may mean one or more. As used herein when used in conjunction with the word “comprising,” the words “a” or “an” may mean one or more than one. As used herein “another” may mean at least a second or more. Furthermore, unless otherwise required by context, singular terms include pluralities and plural terms include the singular.
As used herein, “about” refers to a numeric value, including, for example, whole numbers, fractions, and percentages, whether or not explicitly indicated. The term “about” generally refers to a range of numerical values (e.g., +/−5-10% of the recited value) that one of ordinary skill in the art would consider equivalent to the recited value (e.g., having the same function or result). In some instances, the term “about” may include numerical values that are rounded to the nearest significant figure.
Microbial communities associated with the human body are collectively summarized as the “human microbiome.” Differences in human microbiome compositions are generally believed to be associated with different body sites as well as with specific health and disease states. These complex communities of microbial organisms can be studied on the systems level using whole-genome shotgun (WGS) sequencing of total DNA isolated from microbiome samples (termed “metagenomics”). Using bioinformatic tools, microbiome-specific signatures can be identified and used to provide valuable information for basic research, forensics and clinical diagnostics. However, which features of the microbiome provide the best signature to distinguish and classify microbiome samples remains yet to be determined, as many commonly used parameters in microbial ecology (e.g., 16S rRNA-based phylogenetic community compositions) show large variations across apparently related sample populations (e.g. healthy human stool samples).
Using a computationally efficient, but statistically rigorous, method developed by the inventors and described herein, predictive diagnostic markers (termed “features” herein) have been identified within metagenomic datasets that allow classification of samples as having particular signatures that correspond to selected gastrointestinal disease backgrounds. In particular, a method for identifying microbial community signatures that distinguish a priori-selected sample groups of interest has been developed. These signatures are based on sequence data compositions from metagenomic samples, i.e. total sequenced DNA isolates from entire microbiome samples. Classifiers based on these signatures were assessed for robust sensitivity and specificity using a cross-validation sub-sampling procedure in which random sets of samples were selected and subsequently re-classified individually using the remaining data. Additional testing could be performed using classifiers to assign samples that were not part of the original training sets. This approach was applied to data from the Human Microbiome Project as well as other gut microbiome datasets, including data from inflammatory bowel disease (IBD) patients with Crohn's disease (CD) and/or ulcerative colitis (UC).
Using this method of statistical analysis, sets of features were identified that allow for patients with Crohn's disease to be distinguished from patients having ulcerative colitis at a statistically-significant rate of accuracy. Thus, the method of statistical analysis has been applied to metagenomic datasets to create classifiers that accurately disagnose patients as having IBD, as well as distinguish between patients with ulcerative colitis and Crohn's disease. These feature sets can be used to assign additional samples to these two groups. The present invention may be used to (i) identify signatures characteristic for each of the two classes, (ii) assess the statistical significance associated with these signatures, and (iii) perform a classification of additional samples of unknown patient backgrounds to either of the two groups.
In particular, the present invention relates to the identification and comparative analysis of sequence-features in metagenomic whole-genome shotgun (WGS) sequence data derived from human clinical specimens. In combination with the statistical analysis method developed by the inventors, a group of sequence-features that correspond to particular human microbiomes, termed significant features, are identified and together form microbiome signatures. Short oligomers of generally 8 nucleotides (termed “kmers” herein) are then identified that serve as classifiers for use in screening for signatures that will place a sample into a particular class (e.g., a patient with CD or UC). A set of terms that define specific relationships in the method are as follows:
The general manner in which the statistical methodology detailed in the present application is performed can be summarized as follows:
While the Examples provided herein are directed to the analysis of samples from human subjects, it will be apparent to one of ordinary skill in the art that the methods described herein can be conducted in conjunction with microbiome samples from a wide variety of animals, including, but not limited to, humans and non-human animals, e.g., a non-human primate, bird, horse, cow, goat, sheep, a companion animal, such as a dog, cat or rodent, or other mammal.
Microbiome samples analyzed using the methods of the present invention are not limited in the source or location on the subject from which they are obtained or the means used to obtain them. Exemplary sources include dental plaques, such as supragingival plaque, saliva, and stool. Exemplary locations include all areas of the skin, including on and around the anus, vaginal, urethra, and the interior of the mouth. Specific exemplary locations include the anterior nares, buccal mucosa, posterior fornix, and tongue dorsum. Exemplary means for obtaining samples include swabs.
When the source of the microbiome sample is stool, extraction of total metagenomic DNA can be conducted using known techniques (Gajer et al., Sci Transl Med 4:132-152 (2012); Sellitto et al., PLoS One 7: e33387 (2012)), including community-approved standard operating procedures (SOPs) established and published by the Human Microbiome Project (http://hmpdacc.org/tools_protocols/tools_protocols.php). In particular, the MoBio UltraClean® Fecal DNA Isolation Kit (MO BIO Laboratories, Inc., Carlsbad, Calif.) can be used, which contains a humic acid inhibitor removal step.
Metagenomic DNA can be multiplexed and sequenced, for example, on a single channel of an Illumina HiSeq 2000 platform, following the manufacturer's recommendations as amended by the Genomics Resource Center at the Institute for Genome Sciences. For example, combining 100 samples on a single HiSeq 2000 channel will generate about 3 million 100 bp paired-end sequence reads per sample.
Working with stool samples provides advantages with respect to potential clinical integration as a high-throughput diagnostic tool: (i) sample amounts: typically 0.1 g of stool are sufficient to isolate genomic DNA for several rounds of sequencing; (ii) automation: robotic platforms and corresponding sample processing kits from several manufacturers (e.g. MoBio, Qiagen, Zymo) allow for automated processing of stool samples for DNA isolation. Together with automated workflow systems for sequencing library generation and sequencing on the Illumina HiSeq 2000 or MiSeq platform, fast, efficient and cost-effective sample processing can be achieved.
A method of statistical analysis has been developed whereby metagenomic WGS sequence data of particular microbiomes is analyzed to identify particular signatures that can be used to distinguish samples as belonging to one of two (or more) defined classes. Short oligomers of 8 nucleotides (“kmers”), for example, are then identified that serve as features for use in screening for signatures to develop a classifier that will place a sample into a particular class (e.g., a patient with CD or UC). Kmers of other lengths, such as from 1 to 50 nucleotides in length, may also be used, and the lengths include each integer from 1 to 50.
The described procedure is comprised of three major phases: (1) significant feature selection, (2) signature selection and classifier development, and (3) classifier validation and testing.
The first phase of the analysis begins by obtaining a set of n metagenomic dataset samples S=[S1, S2, S3, . . . , Sn] associated with a set of m mutually-exclusive classes where m≧2. The metagenomic datasets can be obtained from existing databases, such as the Human Microbiome Project (HMP) body site database or from the MetaHIT project [4] which contains WGS metagenomic samples from stool of European individuals.
1.1 Kmer frequency estimation
The number of times short DNA oligomers (kmers; e.g., oligonucleotide sequences of 8 nucleotides=an 8 mer) appear in each metagenomic sample S are counted using the Jellyfish program from the open source wgs-assembler software package (http://www.cbcb.umd.edu/software/jellyfish/). Overlapping kmers are counted on both DNA strands, i.e. in the original and reverse complement sequence of each read, resulting in sequences being counted twice. To reduce redundancy, only canonical kmers are stored, i.e., a specific kmer w and its reverse complement wrc are not distinguished. The kmer that appears first in alphabetical order is recorded. Therefore, the feature space consisted of 32,896 canonical 8 mers (48(0.5*255/256+1/256)).
Kmers are next assessed for differential mean abundance between the two classes, as defined a priori, using the Metastats program [26] (−b set to 5000 permutations). Due to the sizeable feature space of 32,896 kmers, accounting for multiple hypothesis tests is critical to assess false positives. This means that between any two groups, a certain number of kmers will show differential abundance by chance. Therefore, a p-value threshold is selected that controls the false discovery rate [27]. Given a p-value threshold, this results in a set of kmers, designated as corresponding to significant features, to be considered in the next phase. Lastly, kmer counts are normalized to relative abundances within each sample (i.e. given a value between 0 [0%] and 1 [100%]).
To prevent biased weighting of significant features during the classification process, the distribution of each feature is centered to zero mean and scaled to unit variance. Specifically, given the distribution of values for a feature F across all datasets SF=[S1F, S2F, S3F, . . . , SnF], the sample mean and variance of this vector, (μF and σ2F, respectively), is computed and normalized by subtracting μF from each value in the feature vector and subsequently dividing by the square root of the variance:
The associated scaling factors (μF and σ2F) from this normalization are also applied to all downstream test sets.
For each class b, the data containing all samples and corresponding normalized values of significant features is processed according to the following algorithm. For each selection iteration i, a subset of Ni significant features are first selected from the set of differentially abundant features determined in 1.2 (Nmin≦Ni≦Nmax). This set of Ni features comprises a signature. Under this signature:
At the end of this procedure, prediction results for each signature selection iteration i relative to class b are obtained. Signatures are ranked by overall classification accuracy.
The nearest neighbor algorithm is a mainstay of machine learning and has been successfully used for many classification problems ranging from secondary structures of proteins to face recognition. For high dimensional datasets, it is not possible to evaluate every unique combination of parameters, (e.g. [100 choose 8]=1.861×1011) therefore selection of Nmin and Nmax is important. For each application of the method of statistical analysis, broad searches are empirically performed to identify practical values for Nmin and Nmax.
By performing this classification training separately for each class, the possibility that some signatures may distinguish a particular class from all others, but may not effectively distinguish all classes universally from one another, is allowed for. This flexibility leads to the application of multiple binary classifiers for class-specific assignment instead of a single universal classifier.
The robustness of top ranking classifiers is assessed using the following cross-validation (CV) procedure. For each repeated CV iteration, a random subset of samples is removed from the full dataset (e.g. 15%) and used for cross-validation consisting of re-classification of each removed sample individually on the basis of the remaining samples and the same nearest-neighbor algorithm as described above. The average sensitivity and specificity across all CV iterations are recorded for each classifier. Choices for iteration and size of validation set vary depending on the characteristics of the dataset.
To test the performance of a classifier on new samples, the significant features from the initial assessment are evaluated for these samples, i.e. the counts of those kmers that correspond to the significant features comprised in the signature are determined and normalized according to the scaling factors described in section 2.1. The sample is then assigned to either class b or not b using the nearest neighbor algorithm with a Euclidean distance metric as described in 2.2 A.
Metagenomic data sets analyszed using the methods of statistical analysis described above include the following.
Human Microbiome Project (HMP) body site data. A total of 690 HMP WGS metagenomic samples from a six distinct body sites (anterior nares, buccal mucosa, posterior fornix, stool, supragingival plaque, and tongue dorsum) were analyzed. The minimum number of samples associated with a particular body site was 51. Given the large scale of the HMP WGS Illumina sequence datasets (over 15.3 terabytes of raw sequence), kmer counting was performed on random sequence subsets of each sample (˜20% of raw reads). This resulted in an average kmer frequency of 4,139 observations per kmer and sample.
MetaHIT data, healthy individuals, and inflammatory bowel disease (IBD) patients. Raw Illumina GA II read data was acquired from the MetaHIT project [4] which contained WGS metagenomic samples from stool of 124 European individuals. Of the 124 individuals, 21 had ulcerative colitis (UC) and four had Crohn's disease (CD). The remaining 99 were considered healthy.
Obese and lean twin gut microbiomes. The classifiers were additionally tested on WGS metagenomic datasets representing the gut microbiomes of 15 subjects who were either obese (BMI>30, n=9) or lean (BMI=18.5-24.9, n=6) [2]. This raw sequence data was generated using the Roche/454 platform, either with the FLX or Titanium generation of sequencers.
Japanese gut microbiomes. Nine gut microbiome samples were obtained from a study of Japanese infants and adults [5]. All subjects in this case were >1.5 years in age (mean: 26.6 yrs). Datasets consisted of assembled contigs and singleton reads generated using Sanger sequencing technology.
Healthy individual and Crohn's disease patient fecal microbiomes. Raw sequence data from two recent studies on inflammatory bowel disease was included. 1) Roche/454 Titanium reads corresponding to eight fecal samples from patients diagnosed with CD and four samples from healthy controls [29], which contained technical replicates, i.e. independently generated sequence data from the same sample, which were treated as independent samples in the analysis of this data. 2) Illumina MiSeq reads corresponding to four fecal samples from CD patients and seven samples from healthy controls [30].
A proof-of-concept study focused on assigning metagenomic samples from the Human Microbiome Project to their associated body sites. The analysis was restricted to 690 samples from six well-represented body regions (anterior nares, buccal mucosa, posterior fornix, supragingival mucosa, stool, tongue dorsum).
In the initial differential abundance detection of kmers (see section 1., above), a universal p-value cutoff (p<5e-04) was heuristically chosen such that the resulting expected feature false discovery rate (FDR) was below 0.08% for each pairwise comparison of body sites. Additionally, only kmers that were differentially abundant between a particular body site and all other sites were included. For example, if the kmer “ACGTTACG” was identified as differentially abundant in all pairwise comparisons involving buccal mucosa, it was designated as a significant feature to consider for buccal mucosa signature selection. Thus, significant feature sets differed by body site. After performing a broad scan of signature set sizes where Nmin=Nmax (range 10-1000), Nmin=50 and Nmax=100 was selected as the best balance between computational complexity and overall accuracy.
For each body site and its associated set of priority kmers, the feature selection algorithm (see section 2., above) was run for 10,000 iterations, and the 10 signatures with the best classification accuracy or 10 random classifiers if more showed 100% accuracy (e.g. for stool samples) were selected for cross-validation. The single best performing classifier was then chosen based on average sensitivity/specificity (SN/SP) results. In some cases, multiple classifiers had optimal performance (e.g. stool classifiers), and in these cases a single classifier was randomly selected to report.
Table 1 displays the cross-validation results of the top selected classifier for each body site. For each classifier, sensitivity and specificity values were calculated for each subsampling step of the cross-validation procedure. Fisher's p-values were determined to assess the statistical significance of the obtained accuracy of the classifier during cross-validation. All classifiers demonstrated high accuracy with a mean sensitivity of >97% and mean specificity of >99%. Furthermore, the corresponding range of p-values from Fisher's exact test showed that the classification results from cross-validation are significantly better than random.
Further investigation into how the body site classifiers would perform on datasets from metagenomic studies outside the HMP, including those employing different sequencing technologies, was undertaken. To do so, three datasets associated with previously published distal gut microbiome projects [2,4,5] were collected. The first dataset represented the gut microbiota of 85 healthy subjects from the MetaHIT project. Similarly to the HMP, the MetaHIT project utilized the Illumina sequencing platform. The second dataset was comprised of 15 samples generated by the Roche/454 sequencer from the study on obese and lean twins [2], while the third set of nine samples from the Japanese study [5] used Sanger sequencing and only provided assembled contigs and singleton reads. kmer counting was performed on all samples and they were assigned using each body site classifier from Table 1.
Table 2 displays the results of the HMP body site classifiers on the external gut microbiome datasets. For the stool classifier, the overall sensitivity was computed for each gut dataset, that is, the percentage of tested samples that were successfully assigned to stool. In contrast, for each of the other body site classifiers, the specificity was calculated, that is, the percentage of gut samples that were correctly classified as not belonging to that body site. It was observed that all classifiers performed perfectly on these test sets, either with 100% sensitivity or 100% specificity, respectively.
In spite of the highly variable characteristics of these datasets (e.g. different geographic sample origins, sequencing technologies, or read lengths, assembled vs. unassembled reads), the classifiers assigned all gut samples perfectly. This evidence supports the robustness of the microbial signatures to alterations in sequencing technology or data processing.
Crohn's Disease vs. Ulcerative Colitis
The method of statistical analysis was applied to a setting of clinical importance. Clinicians often find it difficult to easily distinguish between different forms of inflammatory bowel disease. Therefore, the method was performed to determine signatures that accurately distinguished patients with CD from patients with UC.
The training sets included the following: Crohn's Disease—the CD training set was comprised of metagenomic WGS data from stool samples from: 4 CD patients from the European MetaHIT study (Illumina HiSeq sequence data) [4] plus 12 CD patients (including several technical replicates) from two U.S. studies (Roche/454 sequence data, [29] and Illumina MiSeq data [30]), thus providing 16 CD samples; Ulcerative Colitis—the UC training set was comprised of gut samples from the MetaHIT project representing 21 patients with UC.
Kmers (8 mers) with significant differential abundance between the two sample groups were determined with the Metastats program [26]. After differential abundance analysis, a p-value threshold was selected to maintain a FDR <1%, resulting in a set of 5,287 significant features provided in Table 3.
The signature selection algorithm was run on the 5,287 significant features for 105,287 iterations using signatures consisting of between 1 and 50 different 8 mers. Each sample was screened using the different signature iterations, and then classified as belonging to one of the two classes (i.e. UC vs. CD). This classification was performed with the nearest neighbor algorithm with a Euclidean distance metric (see section 2.2A above). Next, examining the 2×2 outcome table of the resulting assignments, the corresponding sensitivity (SN), specificity (SP), and positive/negative predictive values (PPV/NPV) were computed and recorded. Fisher's exact test was used to assess the significance of each classification relative to a randomized assignment. All 105,287 tested signatures, i.e. combinations of between 1 and 50 random 8 mers, were ranked based on 2×2 outcome tables.
Of these 105,287 iterations, a total of 1,087 signatures were found that perfectly distinguished CD and UC samples. These signatures are provided in Table 4. Kmers listed in column A of Table 4 correspond to the kmers in Table 3, starting with position 0 (e.g. “AAAAAAAA” is number 0), running down a column, and then continuing to the next column to the right. Of the 5,287 significant 8-mers, 5,063 (95.8%) are included in at least one top-performing signature.
As Table 4 shows, using any of 1,087 signatures, the 37 samples were correctly classified as coming from the 16 patients with Crohn's disease or the 21 patients with ulcerative colitis, i.e. 100% specificity and sensitivity were achieved for this classification.
The robustness of the 1,087 top-performing classifiers (=signatures) was assessed using a re-sampling cross-validation procedure as described in section 3.1 above. This type of validation determines whether the accurracy of the classifiers is significantly reduced if only a subset of the original samples is used as a reference for the classification, i.e. it can help assess whether additional reference samples are needed to improve the classifier. For a particular classifier undergoing validation, a random subset of samples (20%) were removed from the full dataset and then individually re-classified. This process was repeated for 100 iterations for each classifier. The average SN, SP, PPV, NPV and Fisher's exact test p-values across all cross-validation iterations were recorded for each tested signature, i.e. for each set of 8 mers. Of all 1,087 top-performing classifiers, 17 also showed 100% mean SN, SP, PPV, and NPV values after cross-validation (Table 5). This indicates that the variation detected by these 17 signatures between samples from the CD and UC classes is stronger than variations between samples from the same class (i.e. CD or UC) and, thus, that classification of samples as UC or CD with these 17 signatures is most reliable. Classification with the remaining 1070 signatures on the other hand is hindered by the detection of signals resulting from variations between samples from the same class (i.e., CD or UC). The accurracy of these 1070 signatures is therefore expected to depend to a larger extent on the number of reference samples and to improve with larger numbers of reference samples.
To evaluate the importance of kmer length for sample classification, the signature identification method was also applied to kmer abundances of lengths of 2, 3, and 4 nucleotides and a training set of 21 UC and 12 CD samples (including the 4 CD patients from the MetaHIT and 8 CD patients from one of the U.S. studies [29]). The results, including mean SN, SP, PPV, NPV, and Fisher's exact test P-values, are shown in Table 6 and indicate that classification of CD and UC samples is possible using signatures of kmers combinations with length 2 nucleotides and longer, although at lower accurracy (i.e., SN, SP <100%) compared to kmers of length 8 nucleotides.
To validate the statistical significance of the findings, the same type of kmer-based signature identification analysis was performed using computer-generated random kmer counts of the length 4 nucleotides. For this analysis, two groups with 100 samples each were compared. While the sensitivity of 73% of the top-performing signatures to distinguish these random sample data is relatively high, the specificity of 52% is low and the associated mean Fisher's exact test P-values of 0.25 are high, as would be expected. The results of this comparison are also shown in Table 6.
The approach disclosed herein began with the essential observation that significant biases exist in the oligonucleotide compositions of metagenomic samples between different body sites and that some of these biases are remarkably stable across multiple individuals, in spite of an overall taxonomic variability of the microbiota, as determined by 16S rRNA phylogenetic analysis. Taking advantage of this information, robust signatures were discovered for accurately assigning metagenomic samples to different body sites or to particular disease states such as the two inflammatory bowel diseases, Crohn's disease and ulcerative colitis.
While the invention has been described with reference to certain particular embodiments thereof, those skilled in the art will appreciate that various modifications may be made without departing from the spirit and scope of the invention. The scope of the appended claims is not to be limited to the specific embodiments described.
All patents and publications mentioned in this specification are indicative of the level of skill of those skilled in the art to which the invention pertains. Each cited patent and publication is incorporated herein by reference in its entirety. All of the following references have been cited in this application:
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2014/011321 | 1/13/2014 | WO | 00 |
| Number | Date | Country | |
|---|---|---|---|
| 61888288 | Oct 2013 | US |