New machine learning tools or artificial intelligence (AI) are able to analyze key biomarkers including those from the fecal metagenome and metabolome to discriminate risk factors for disease in a variety of conditions and in particular preterm infants at risk of necrotizing enterocolitis (NEC).
A major limitation in preventing or treating particular diseases is that a combination of genetics and environmental factors such as the composition and function of the host microbiomes including but not limited to the gut microbiome may be multifactorial and difficult to treat due to underlying variability in the functional capacity contained within the metagenome that may alter risk.
Prevention of a specific condition known to affect the preterm infant gut, neonatal necrotizing enterocolitis (NEC), dwells in the inability to predict which subset of premature infants is at risk for developing NEC. Recently, gut dysbiosis has emerged as a major trigger in NEC, particularly supported by the fact that NEC cannot be produced in germ free animals.
Major limitations have been encountered when focusing solely at the taxonomic level. Composition of the microbiome (i.e., which microbial species are represented) is not enough to be able to uncover microbial signatures for NEC. A greater depth of functional information is required to be able to uncover the patterns required for accurately diagnosing and altering the microbiome function to correct for the risk a premature infant has of developing NEC.
This invention provides a method of determining risk of necrotizing enterocolitis (NEC) in an infant, comprising the steps of: (a) obtaining a fecal sample of the infant's relevant microbiome; (b) sequencing genetic material in the sample to obtain sequence data for the relevant microbiome; (c) analyzing sequence data for the relevant microbiome to identify biomarkers in the infant's microbiome; and (d) categorizing the NEC risk of the infant using the biomarkers identified in the microbiome of the infant.
In a preferred mode, the categorizing according to step (d) is based on an artificial intelligence (AI) model developed by analyzing sequence data from the relevant microbiomes of N infants, the N infants comprising at least M infants diagnosed with NEC, and N−M infants not diagnosed with NEC, where the AI model is developed by processing the sequence data from the relevant microbiomes of the N infants by Machine Learning algorithms to identify at least X biomarkers which differ significantly between infants diagnosed with NEC and infants not diagnosed with NEC and associating the X biomarkers with infants having or at risk for having NEC. Generally, N is at least 10-fold higher than X and M is at least 2-fold higher than X. Preferably, N is between 400 and 10,000 infants, and M is between 200 and 1300 infants, and more preferably, X is at least 5, at least 10, at least 20, at least 30 or at least 40 biomarkers. Typically, the biomarkers identified in step (c) are proteins, mobile genetic elements, functional annotations, superpathways, taxonomic identifiers, and/or combinations thereof. Preferably, the biomarkers identified in step (c) are biomarkers found on Table 5 and/or 6.
In accordance with this invention, the infant may be a term infant or a preterm infant. The relevant microbiome for this invention may be an intestinal microbiome, fecal microbiome, a milk microbiome, a skin microbiome, an environmental microbiome, or a combination thereof. Further according to this invention, the infant's risk of NEC is likely to be categorized as high if intestinal ARG levels are low [add quantitiation], and/or the [insert quantifiable threshold for intestinal integrity]. This invention also provides for therapy of an infant having high risk of NEC categorized according to this invention, where such infants are treated by administering B. infantis and/or mammalian milk oligosaccharides (MMO).
Inventors have developed a process for characterizing microbiome samples which reveals a biomarker pattern associated with NEC. This process can be utilized with any human-associated microbiome, including but not limited to, fecal, skin, or milk, as well as environmental microbiome such as those found on non-living surfaces or in the air, to assess the likelihood of the presence of NEC in the individual or the likelihood of development of NEC. This process could further be utilized to assess the risk of development of NEC by patients exposed to environments shown to exhibit a NEC-associated biomarker pattern.
This process consists primarily of the collection of a microbiome sample, followed by analysis of said sample through genetic sequencing techniques; resulting sequence data is then annotated by labeling genes associated with microbial biomarkers and superpathways. Annotated sequence data is further analyzed through one or more machine learning algorithms which have been trained to detect biomarker and superpathway patterns associated with NEC.
Indifferent to host genetic background, AI or machine learning offers the potential to provide previously undiscovered associations that facilitate stratification of risk within a particular population to identify not only individuals most at risk, but also to provide alternative protocols and therapies that can be deployed to prevent and/or treat based on these different risk profiles.
The insights from machine learning can be used to provide a deeper, more complete understanding of interactions and critical influencers within the microbiome that are a signature of the underlying dysbiosis associated with NEC. Applications can include a new drug discovery pipelines, environmental monitoring, new treatment protocols for prevention and/or treatment options that focus on risk reduction.
Fecal samples provide an underexplored opportunity to non-invasively understand a number of systems simultaneously, including metabolic, immune activity, and intestinal integrity. Intestinal integrity includes proliferation or growth, wound healing, tight junctions, mucin production, and/or immune activity as a measure of competence against dysbiosis-associated disease conditions.
The invention described here goes beyond taxonomic classification to be agnostic on the precise composition of the gut microbiome but rather focuses on the functional capacity down to the individual gene level to predict with better accuracy the NEC risk and treatment options. The specific biomarker patterns and/or superpathways provide a more integrated, comprehensive, and holistic view of the gut microbiome and its function that can be monitored.
The algorithm can be used on unknown samples from infants in the NICU by taking a fecal sample and sequencing the fecal sample using shotgun metagenomics, which will allow taxonomic and functional characterization of the infant's microbiome. The sequencing data is then entered into the software assembled as part of this invention in which an algorithm is used to predict NEC risk.
Moreover, coupling metagenomics with metabolomics, observed as well as predicted via machine learning, will identify proteins that are signatures of NEC risk. This platform may be used to identify the biomarkers and then develop assays based on the knowledge of the bacteria present, the gene functions, gene expression, protein expression, and/or the output of one or more key metabolites in identified superpathways
The protein biomarkers may be used to create a protein-based assay, which may be employed to indicate the level of NEC risk before proceeding with shotgun metagenomic sequencing and may also lead to small molecule drug discovery through a greater understanding of the metabolomics profile. The protein assay may provide a rapid diagnostic tool aiding doctors in deciding how to handle each case of prematurity and greatly reduce errors in communication or individual diagnosis.
These may also be used to develop new drug candidates to sort through the abundance of the gene products most often associated with NEC.
Necrotizing enterocolitis (NEC) mostly affects the intestine of premature infants, but may affect term infants with other conditions. The wall of the intestine is invaded by bacteria, which cause local infection and inflammation that can ultimately destroy the wall of the intestine. Portions of the intestine die. The disease has three stages:
NEC burst. A period where the incidence of NEC spikes in the NICU seasonally due to an unknown change in the environment, probably linked to change in the microbial community composition.
Preterm infant is defined as babies born alive before 37 weeks of pregnancy are completed. There are sub-categories of preterm birth, based on gestational age: extremely preterm (less than 28 weeks) very preterm (28 to 32 weeks) moderate to late preterm (32 to 37 weeks). These infants may also be classified according to birth weight. Infants born with a birth weight less than 1500 g are defined as very low birth weight (VLBW) infants. Low birth weight (LBW) is defined as a birth weight of less than 2500 g (up to and including 2499 g).
Metagenome or metagenomic profile is defined as the totality of the DNA recovered from a given biological sample that can include human, bacteria, viruses, mold and yeast DNA.
Skin microbiome is any microbiome that can be recovered from any skin surface.
Milk microbiome is collected by swabbing the breast and is considered the extension of the maternal skin and infant buccal microbiomes.
Environmental microbiome refers to a sample containing the collection of microorganisms retrieved from any environmental source, including but not limited to, non-living surfaces; air; food; and/or water.
Dysbiosis-associated disease condition (DADC). A DADC refers to any physiological condition associated with an unhealthy composition and/or function of the individual's gut microbiome.
Metabolomic profile is the sum of all metabolites measured at a given time to provide a snapshot of overall metabolic output. It may be relative between one group or the next or may be quantified.
Superpathways are groups of functionally related reactions and/or metabolic or biosynthetic pathways.
Biomarker is any genetic information or information obtained by analyzing a genome. They include proteins, mobile genetic elements, functional annotations, superpathways, and taxonomic information among others.
Oligosaccharide refers to polymeric carbohydrates that contain 3 to 20 monosaccharides covalently linked through glycosidic bonds. In some embodiments, the oligosaccharides are purified from human or bovine milk/whey/cheese/dairy products, {e.g., purified away from oligosaccharide-degrading enzymes in bovine milk/whey/cheese/dairy products).
Mammalian milk oligosaccharides are oligosaccharide compounds found, but not necessarily exclusively found, in mammalian milk. Mammalian milk oligosaccharides may come from any source so long as they are analogous in structure and/or function to those found in mammalian milk.
Synthetic human milk products containing prebiotics are those that are processed for delivery to the premature infant. Processing may occur in a manner which serves to preserve the milk and/or alter the composition. Pasteurization, or other heating methods) freezing, fractionation, separation and reassembly may all be considered. A prebiotic product may be any product that has at least one mammalian milk oligosaccharide of any species (i.e., human, bovine, ovine) contained in infant formula, or as a standalone product that is then mixed with human milk or infant formula, water or other liquid suitable for the preterm infant. The mammalian milk oligosaccharide may be derived from a synthetic process in yeast, or E. coli or other chemical synthesis as long as it has a structure that matches the structure or function of human milk. Examples include, but are not limited to Lacto-N-biose, Lacto-neotetraose (LNT), Lacto-N-neotetraose (LNnT), Fucosyl lactose (2″FL or 3′FL), Sialyl lactose (3′SL or 6″SL).
As described below, the input for the analysis may be metagenome DNA sequences pulled from other databases and properly curated before analysis.
Typically, the input starts with collection of microbiome samples which may be fecal samples. Fecal samples are non-invasive and can be readily collected from vulnerable populations, including but not limited to preterm infants and other hospitalized groups. DNA sequencing of fecal samples for preterm patient populations who may or may not be at risk for NEC can be used to better stratify the population by identifying those individuals who are at risk for development of a DADC (such as NEC) to improve the effectiveness of protocols or therapies used to treat patients under physician care. This can be achieved by isolating the total DNA present in fecal samples that includes all the human, bacteria, viruses, yeast and mold present in that sample. The DNA can be prepared for deep sequencing that allows for all of the different contributions to be detected. The inventors also utilized a tool (bowtie2) to scrub all human DNA from the analysis for HIPPA compliance which renders de-identified samples for further population-based analysis, when required.
Metagenomics analysis of microbiome samples (e.g., fecal samples) can be used to understand key differences between certain groups. Certain embodiments of the invention provide a method of measuring the metagenome to identify differences between individuals in a given group. The group may consist of individuals within the same age group with unknown or known risk factors for a certain condition. In some embodiments, the metagenome is used in the method to help identify differences between individuals or to determine health status of an individual. It is also possible to take repeated measures from the same individual over time to assess pre-clinical differences between individuals who later went on to develop the condition. This metagenomic approach can be used to both better describe the condition, but also to look for earlier warning signs to be able to provide more effective treatment.
In some embodiments, the metagenome information is combined with other microbial data such as the fecal metabolomic data, which may be a combination of microbial and host metabolites. Other host information from fecal samples, such as cytokine data, may be added to the machine learning model to see additional interactions and determine what are the most significant influencers concerning either the presence or absence of NEC. Further, the host information may be used to determine if these most significant influencers change whether the sample is from an infant with stage 1, 2 or 3 NEC.
It is recognized that in some embodiments only a subset of the detected differences are clinically significant and that the data may be prioritized and or limited based on a number of different markers; these markers may be part of key superpathways, and the superpathways may be defined as key metabolites, key enzyme activities and/or presence of key proteins to assess risk or by certain gene products.
It is also recognized that in some embodiments, the time frame for metagenomics may not be practical for the treatment of individuals but may be an effective strategy to evaluate specific population risk and also to evaluate the success of any risk mitigation strategy deployed in a healthcare setting. However, taking a subset of metabolites, bacteria, or proteins identified as part of the metagenomic analysis that are key risk factors can be developed into lab tests or more preferably point of care tests that provide information to evaluate the risk of a particular disease in a particular individual receiving treatment. The application of these tests provides a strategy for personalizing treatment protocols and therapies to suit individual needs.
It is also recognized that a subset of the metagenome and metabolomic analysis may be used to assess specific gut functions including but not limited to intestinal integrity. Intestinal integrity is a general term that may include factors such as tight junction integrity, wound healing capacity, mucus layer integrity, and/or bacterial translocation.
It may also be used to establish appropriate gut motility that may be measured as stooling patterns, number of stools per day and/or stool consistency.
In yet other embodiments, particular subsets maybe used to control treatment of certain conditions or used to prevent certain conditions or symptoms in individuals. In some embodiments, the treatment of the individual first requires diagnostic and/or prognostic characterization.
A non-invasive approach that combined functional and taxonomical data from infant fecal samples was used to evaluate infant gut microbiomes and to develop an artificial intelligence (AI) model able to predict significant metagenomic biomarkers of NEC among a preterm infant population.
Cohort selection and data extraction. A total of eight studies were selected that performed shotgun metagenomic sequencing matching the word “NEC” or “preterm” on NCBI Sequence Read Archive (SRA). A summary of the studies and patient characteristics can be found in Table 1. In order for a sample to be included in the analysis a minimum of intrinsic metadata criteria had to be met in regard to reporting “day of life”, “NEC presence/absence”, “antibiotic treatment”, “country of origin”, “gestational age”, “delivery mode”, “feeding practice”, “sex” and “birth weight”. After applying filtering criteria based on meta data, a total of 1,647 shotgun metagenomic raw datasets were retained. These represent every shotgun metagenomics sequencing dataset from preterm babies available in the NCBI SRA.
Feature annotation. Samples were analyzed concurrently within the same pipeline. Taxonomic profiling of the metagenomic samples was performed using MetaPhlAn2[Truong D T, Franzosa E A, Tickle T L, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. 2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature methods 12:902] with default parameters, using the included library of clade-specific markers to provide panmicrobial (bacterial, archaeal, viral and eukaryotic) profiling. Functional gene characterization was performed using the Humann2 [Franzosa E A, McIver L J, Rahnavard G, Thompson L R, Schirmer M, Weingart G, Lipson K S, Knight R, Caporaso J G, Segata N. 2018. Species-level functional profiling of metagenomes and metatranscriptomes. Nature methods 15:962.] pipeline with default settings following the updated global profiling of the Human Microbiome Project analysis pipeline (2017) [Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall A B, Brady A, Creasy H H, McCracken C, Giglio M G. 2017. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature. After running samples through MetaPhlan and Humann2 pipeline, matrices were obtained containing taxonomic or functional annotations based on different classifications against Uniref90 [Apweiler R, Bairoch A, Wu C H, Barker W C, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M. 2004. UniProt: the universal protein knowledgebase. Nucleic acids research 32:D115-D119], KEGG [Kanehisa M, Goto S. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28:27-30] and MetaCyc. [Caspi R, Foerster H, Fulcher C A, Kaipa P, Krummenacker M, Latendresse M, Paley S, Rhee S Y, Shearer A G, Tissier C. 2007] databases. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic acids research 36:D623-D631].
Statistical analysis. Significantly different genes among treatments were estimated using the Kruskal-Wallis one-way analysis of variance, coupled with FDR or Bonferroni correction as cross-sample normalization. A Bray-Curtis dissimilarity matrix was constructed to estimate global differences among samples and visualized via Principal Coordinate Analysis (PCoA). Permutational Multivariate Analysis of Variance Using Distance Matrices (adonis) was used to assess global microbiome differences between groups. P-value for PCoA panel was computed using F-tests based on sequential sums of squares from permutations of the raw data. P-values throughout this analysis are represented by asterisks (*, P<0.05; **, P<0.01; ***, P<0.001; ****, P<0.0001).
A total of 1,712 raw publicly available shotgun metagenomic datasets were collected (NEC=253; and healthy preterm=1,459) and entered into a data analysis pipeline that consists of a number of processing steps that can be analyzed concurrently within the same pipeline that results in meaningful outputs on the metagenomic data set. Taxonomic profiling of the metagenomic samples was performed using MetaPhlAn2 with default parameters, using the included library of clade-specific markers to provide panmicrobial (bacterial, archaeal, viral and eukaryotic) profiling. Functional gene characterization was performed using the Humann2 pipeline with default settings following the updated global profiling of the Human Microbiome Project analysis pipeline. After MetaPhlan and Humann2 pipelines, a plurality of different matrices were obtained containing taxonomic or functional annotations based on different classifications against Uniref90, KEGG, and MetaCyc databases. After quality filtering of sequence datasets, a subset of the data (n=1,647) was selected for downstream analysis. The dataset was divided based on corrected gestational age (cGA) according to NEC occurrence. This dataset was the input for several artificial intelligence (AI)/machine learning models (Random Forest and Gradient Boosting classifiers). The different models were used to identify functional core biomarkers able to distinguish NEC from healthy preterm infant microbiomes.
Data preparation and feature engineering. An initial two datasets, an unstratified pathway abundance dataset and a pathway abundance dataset stratified by bacterial species, were divided into smaller datasets by corrected gestational age (cGCA). Each dataset was divided into samples with cGCA lower than 29 weeks and samples with cGCA 29 weeks or higher. Each of these four datasets was further divided into four smaller datasets: a training set with original NEC distribution, a training set with oversampled NEC distribution, a testing set (20%) of unique samples, and a validation set (20%) of unique samples.
Machine Learning. A decision tree is a common classification model where, to classify the target, the optimal split from the optimal feature is serially made to maximize accuracy (or some other metric). This results in a hierarchical model where each node is used as a filter until a sample is classified. Random forests are ensembles of individual decision trees where voting is implemented to determine the final prediction of the ensemble and only a subset of random features is considered for each optimal split in each tree. Thus, each composing tree is significantly different from all others in the model and captures a different signal from the data upon which it is trained.
A Gradient Boosting Classifier is similar to a random forest, however it determines the criterion for splitting by a feature by creating and minimizing a differentiable loss function of the entire tree. It then tunes these values with subsequently smaller tweaks and aggregating all trees into an ensemble.
For each training dataset, a Random Forest Classifier and a Gradient Boosting Classifier were trained from python's scikit-learn library. Models were trained to predict NEC occurrence from stratified and unstratified bacterial superpathways from each of the 8 datasets. Hyperparameters for a gradient boosting classifier and random forest classifier were grid-searched for each dataset resulting in the final 16 models.
The Ideal Hyperparameters for the Random Forest Model Through Grid-Search
For each Random Forest model, the following hyperparameters were tested. Bootstrap was set to ‘True’. Max depth was grid-searched for each dataset between 1, 2, 3, 5, 8, 12, and ‘None’. The number of estimators was set to 500. Random state was set to 310 and all other hyperparameters were left at scikit-learn's default values. For each Gradient Boosting model, the learning rate was grid-searched for each dataset across 0.1, 0.15, 0.2, and 0.3, the max depth across 1, 2, 3, 4, 5, 6, and ‘None’, and the minimum number of samples per leaf across 1, 2, 3, and 4. The number of estimators was also set to 500. Random state was set to 310 and all other hyperparameters were left at scikit-learn's default values. Feature importances were calculated from the highest performing hyperparameters using Gini importance scores. Because Gini importance scores account for the impurity at each node, these scores were expected to change significantly between the balanced and unbalanced datasets. Thus, to confirm findings from feature importance scores permutation importances were also calculated on the test dataset and compared.
Ranking. A sublist of statistically significant proteins was obtained by conducting a Kruskal-Wallis test with each protein. Protein feature ranking of Uniref_90 proteins was determined by conducting recursive feature elimination on a random forest classifier. Approximately 6.1 million proteins were filtered by conducting a Kruskal-Wallis test with each protein, including only the 3420 statistically significant features. A feature ranking of these Uniref_90 proteins was determined by conducting recursive feature elimination on a random forest classifier.
Scikit-learn's Recursive Feature Elimination algorithm was implemented where the hyperparameters for the most performant model identified through grid-search were utilized. A train, test, and validation accuracy score was calculated for each set of top ranked features. Thus, the minimum number of features required to obtain consistent maximal accuracy was determined. A model was then trained utilizing the ideal hyperparameters previously identified and was tested on two holdout datasets.
As a comparison, a random forest model was trained on the full feature-set of the gene families dataset with a train:test:validation split of 60:20:20. A machine with 468 GB of RAM and 64 cores was utilized. The hyperparameters utilized were n_estimators=300, max_depth=None, random_state=310 and oob_score=True.
Globally, 928 different microbial species were identified (4 Archeae; 9 Eukaryota; 7 Viroids; 397 Bacteria; 511 Viruses).
Besides taxonomic profiling we were able to characterize the functional microbiome in terms of protein coding genes as well as superpathways. Gene family entries were converted into pathways. By default, HUMAnN2 uses MetaCyc pathway definitions and MinPath to identify a parsimonious set of pathways that explain observed reactions in the community. This led to a matrix of 1,605 (samples)×19,039 (pathway) or 30.5 million entries. First, Principal Component Analysis (PCA) was used to investigate our data set both across taxonomic and gene features. This revealed insights into the structure of the data from both a sample and a feature perspective.
Second, we divided the sampling size into different subsets based on corrected gestational age and applied random forest techniques to assess whether the NEC or healthy preterm status could be predicted based on microbiome signatures. Since there is no previous indication on which microbial feature should be over or under abundant in NEC vs. healthy preterm state, we used the Kruskal-Wallis test to determine the subset of gene families that are most statistically significant between NEC and healthy preterms. From the Kruskal-Wallis test we selected entries with an adjusted p<0.0001 (Bonferroni). The 3,420 significant gene families were then converted into KEGG functional orthologs (KO), resulting in 155 KO features (Table 3). The 3,420 gene families were further analyzed to look for redundant functions. For instance, if the same enzyme was identified from two different bacteria, this would give two different gene family entries from the UniProt database but converted in KEGG would result in one KO entry (namely an ortholog with same function independently from its taxonomic origin). Any KO might consist of multiple UniProt with the commonality of being related by vertical descent from a common ancestor and encoding proteins with the same function in different species. Therefore, we have determined the most statistically significant over and under abundant KEGGs in NEC state.
Bifidobacteriaceae were lower in infants with NEC and this was also true for Bifodobacterium longum (B. longum) that includes the subspecies B. longum subsp. infantis (B. infantis). In contract Enterobacteriaceae and in particular, Enterobacter clocae (
The data set was further evaluated and here we report an example of some significant proteins (Table 2), KEGG gene orthologs (Table 3) identified among samples.
In a further analysis, the top 100 predictive stratified superpathways were identified from the gini feature importances of trained models (Table 4). The index of each ranked feature was taken for each model and compared across models. This demonstrates the process for developing new biomarkers based on AI models.
sordellii
Protein and superpathway Identified among samples. The largest dataset produced represented a matrix of 11,026,566 (Uniref90 hits)×1,605 (samples; 245 NEC positive) or 17.7 billion entries. Gene family entries were converted into pathways. By default, HUMAnN2 uses MetaCyc pathway definitions and MinPath to identify a parsimonious set of pathways that explain observed reactions in the community. This led to a matrix of 1,605 (samples)×595 (pathway) or ˜955 thousand entries. The stratified matrix had 18,442 features when considering the superpathway and the respective contributing bacterial species. First, we used Principal Component Analysis (PCA) to investigate our data set across both taxonomic and gene features. This revealed insights into the structure of the data from both a sample and a feature perspective. Second, we divided the sampling size into different subsets based on corrected gestational age and applied random forest techniques to assess whether the NEC or healthy preterm status could be predicted based on microbiome signatures. Since there is no previous indication on which microbial feature should be over or under abundant in NEC vs. healthy preterm state, we used the Kruskal-Wallis test coupled with Bonferroni correction to determine the subset of gene families that are most statistically significant between NEC and healthy preterms. From the Kruskal-Wallis test we selected entries with an adjusted p<0.0001 (Bonferroni). The 3,420 significant gene families were then converted into KEGG functional orthologs (KO), resulting in 155 KO features. Therefore, we have determined the most statistically significant over and under abundant KEGGs in NEC state.
Microbial-driven arginine depletion in the Intestine is characteristic of NEC. 2,732 biomarkers presented the highest risk for NEC from a combination of KEGG ID with a specific bacterial species. When grouping those biomarkers by the pathway they are involved in, we identified among those, the Microbiome-mediated arginine (Arg) metabolism pathway, to be different in the NEC cases compared to controls (
Enterobacter cloacae
Veillonella atypica
Klebsiella pneumoniae IS43
Klebsiella pneumoniae
Veillonella sp. ICM51a
XY_41090
Bacteroides xylanisolvens
Enterobacter asburiae
Bacteroides fragilis str.
ceJ
Enterobacter cloacae EcWSU1
Klebsiella pneumoniae
Bacteroides dorei 5_1_36/D4
Bacteroides fragilis str.
Klebsiella pneumoniae
Klebsiella pneumoniae
Klebsiella pneumoniae
Enterobacter cloacae P101
Finegoldia magna BVS033A4
Klebsiella pneumoniae
rcnA_2
Klebsiella pneumoniae
Bacteroides sp. 2_2_4
Klebsiella pneumoniae
Streptococcus salivarius
Bacteroides sp. 3_1_19
Enterobacter cloacae EcWSU1
Bacteroides sp. 2_2_4
Finegoldia magna BVS033A4
Bacteroides sp. 3_1_19
Streptococcus salivarius
Klebsiella pneumoniae
indicates data missing or illegible when filed
Escherichia coli 541-15
Escherichia coli M718
Acetobacterium woodii
Enterococcus saccharolyticus
Bacteroides fragilis
Bacteroides fragilis
Bacteroides fragilis
Bacteroides xylanisolvens
Stapylococcus virus IPLA88
Bacteroides xylanisolvens
Staphyloccus aureus
Klebsiella pneumoniae
Enterbacter cloacae
Enterobacter asburiae
Staphylococcus phage
Staphylococcus virus 187
Staphylococcus phage
Staphylococcus phage
Staphylococcus phage
Staphylococcus phage
Enterobacter cloacae EcWSU1
ninGNTH1728_1
Haemophilus influenzae
Finegoldia magna BVS033A4
Bacteroides sp. 2_2_4
Staphylococcus phage
Staphylcoccus phage
Enterobacter cloacae EcWSU1
Staphylcoccus phage
indicates data missing or illegible when filed
Legend for Table 5 and 6. The tables shows the most important microbial genes that were identified by the model to discriminate between NEC and controls. ID=UniProt gene ID; Protein names=UniProt protein name; Gene names=UniProt gene name; Organism=The taxonomic affiliation of the gene; Length=The protein length in aa; ID_proc=Uniref_90 ID; Healthy preterm mean=Mean value of the gene in CPM (copy per million); NEC mean=Mean value of the gene in CPM (copy per million); Log2 FC=The Log2 fold change difference of CPM values between NEC and controls. Fold change is the mean value NEC/mean value healthy preterm control. If these genes reported in the table are removed from the input, this will cause the collapse of the predictive model, namely the model would not be able to discriminate between NEC and controls with any meaningful accuracy that is more than random guessing. Therefore, the listed genes are the most influential genes that appear to be always higher in the NEC samples compared to controls. The genes are ranked based on their importance in the model, in terms of predictiveness of NEC (Table 7).
To determine the minimum number of samples required for training an informative model, a random forest classifier was trained on a random subset of features. The mean accuracy was obtained for each samples size. With even class distribution, a minimum number of 30 samples would begin to yield minimum discriminatory power. Optimally, it was determined that approximately 10,000 features would best eliminate overfitting, however approximately 1,000 features would yield sufficient explanatory power for treatment purposes.
Each model was used to obtain the percent risk of each sample classifying as NEC positive. Treatment courses could then be taken to minimize risk of samples developing NEC based on a high risk of between 20 and 50%.
In some embodiments of this invention the risk for NEC is determined by the detection and/or quantification of the biomarkers listed on Table 7 or any combinations thereof. In preferred embodiments of this invention NEC risk is determined based on the detection and/or quantification of any combination of the UniRef90_G2SBG8, UniRef90_B5XVF2, UniRef90_Q8SDM3, UniRef90_D71XQ4, UniRef90_X8H364, UniRef90_B5XPQ3, UniRef90_G2S602, UniRef90_G810W8, UniRef90_W1GNF8, UniRef90_Q64WL9 biomarkers, or homologues thereof. In more preferred embodiments of this invention determination of the risk of NEC can be made by the detection and/or quantification of the following biomarkers or, homologues thereof, and/or the presence of an organism associated with the detection of the relevant biomarker as follows: UniRef90_G2SBG8 an integrase family protein associated with Enterobacter asburiae; UniRef90_B5XVF2 a PAP2 family protein associated with Klebsiella pneumoniae; UniRef90_Q8SDM3 a phi ETA irf 18-like protein associated with Staphylococcus phage phi13; UniRef90_D71XQ4 a ribose phosphate pyrophosphokinase associated with Bacteroides sp.; UniRef90_X8H364 an arylsulfatase associated with Veillonella sp.
In some embodiments of this invention the risk of NEC may be determined by the presence/absence and/or the quantification of any combination of microbial organisms enumerated on Table 5 and Table 6. In preferred embodiments of this invention determination of the risk for NEC can be made by the detection and/or quantification of Klebsiella spp., Veillonella spp., Bacteroides spp., Enterobacter spp., Bacteriophage phi-13, Bacteriophage phi-11, or any combination thereof. In preferred embodiments of this invention the risk of NEC may be determined by the presence/absence and/or quantification of Klebiella pneumonia, Enterobacter asburiae, Bacteroides fragilis, Viellonella sp. ICM51a, Bacteriophage-13, and/or Bacteriophage phi-11 or any combination thereof.
Biomarkers identified by this process can be used to diagnose and monitor infants in the NICU to highlight dysbiosis, indicate dysfunction, and predict risk factors to stratify infants and treat the underlying dysbiosis and/or dysfunction through therapies designed to treat the observed dysbiosis. In some cases the therapy may include the addition of Bifidobacterium and more specifically B. infantis to reverse dysbiosis in these preterm infants. Therapeutic steps for this invention are described in WO 2016/065324, WO 2016/149149, WO 2017/156550, and WO 2018/006080, incorporated herein by reference.
This information may also be used to target antimicrobial therapies that can target microbial pathway without interfering with host metabolic pathways, or those of beneficial bacteria.
The invention can be used to evaluate any microbiome associated with the body including but not limited to the vaginal, gut, skin, buccal, milk, or other surfaces that have a specific microbiome that might be implicated in NEC. Surfaces in the environment may also be evaluated for their contribution of virus, bacteria, mold and/or yeast. In some embodiments, one or more of the microbiome in the preterm or term infant or surrounding the preterm or term infant is used as part of the AI model. In other embodiments, host data including anthropometry, blood work, fecal cytokines, fecal calprotectin, T cell profiles may also be used in an AI model to evaluate success of altering risk profile for preterm infants born into specific hospital systems to assess risk of NEC.
To assess risk to the preterm infant, a particular group may also be monitored as a group residing in a particular part of the hospital or health care system such as, but not limited to hospitalized patients in the neonatal intensive care unit, the pediatric intensive care unit, the intensive care unit for non-pediatric patients i.e., adults, the emergency room, the cardiology unit, psychiatric unit, or the neurology unit in which bacteria containing the elements of. It may also be applied to specific outpatient facilities with particular risks including infections and more particularly antibiotic resistant infections are known, but best treatment strategy is unknown.
Machine learning as described herein may be used to understand the dispersion of antibiotic resistance genes across a health system and/or geographic region, to understand risk and provide data driven strategies to improve antibiotic stewardship and/or to understand the emergence of new resistance and/or to understand the full resistome to better prescribe antibiotics to reduce treatment failure in NEC.
A dashboard or a system of assessing risk that provides a tool for a clinician to monitor the health of a preterm infant to alter and/or implement a treatment regime who is at particular risk of a condition or disease based on the environment they find themselves in, their genetic predisposition to particular conditions or have pre-clinical presentation of risk that is a precursor to overt symptoms (i.e intestinal integrity).
A subset of proteins, enzymes, peptides, metabolites can be monitored to to inform clinician of risk selected from Table 5 and/or 6.
The genes identified in Tables 5 and 6 may be monitored with a PCR method that amplifies one or more genes from Table 5 or 6 using specific validated primers to look for fold changes. Inflammatory markers such as calprotectin or fecal cytokines may be monitored. ATP or lactate dehydrogenase levels may also be monitored.
The embodiments, of this test may be used to improve known treatment, and ensure that treatment is effective in reducing the presence of the organisms and genes identified in Table 5 and 6. The introduction of B. infantis in a diet that contains human milk oligosaccharides or their functional equivalents is one such treatment for the prevention or reduction in risk for NEC. Premature infant treatment is complicated by routine antibiotic use and other medicines that may render addition of probiotics and prebiotics to improve microbiome function less effective. In an embodiment, a B. infantis alone or in combination with other probiotic bacteria are used as part of the standard of care. In a preferred embodiment, Bifidobacterium longum subsp. infantis may comprise a functional H5 gene cluster (genes required for successful colonization of the infant gut), including Bifidobacterium longum subsp. infantis EVC001 deposited under ATCC Accession No. PTA-125180 (“Deposited Bifidobacterium”).
Hospitals have the opportunity to assess risk based on banked fecal samples in different hospital units. A cohort may be established that analyzes the metagenomes of all hospitalized individuals within that cohort, separated into those that developed disease and those that did not, or those that responded to treatment and the non-responders to a given treatment. The analysis provides an output of major taxa, superpathways, metabolites enzyme activities, or proteins associated with disease risk. In that particular unit for that particular condition, a treatment plan or protocol can be implemented aimed at eliminating a key risk factor. The success of the treatment, processes or protocol may be assessed by collecting samples from the cohort post-change in practice. The post-change cohort validates the success of the reduction in risk associated with specific treatments, protocols or processes.
The above may be applied to environmental monitoring of hospital environments for key taxa associated with NEC. If klebsiella was identified as a key risk in a specific hospital environment, a new cleaning protocol would be implemented that was known to reduce klebsiella on hospital surfaces in order to reduce transmission to the infant. Following a set time frame, new fecal samples are taken to assess the success of an intervention. Machine learning requires minimum of 30 independent samples to assess the success of any given treatment.
Intestinal integrity is considered a risk factor for many disease conditions including NEC and late onset-sepsis. Leaky gut results when there is insufficient intestinal integrity.
B. infantis EVC001 dominant microbiome produces metabolites improve enterocyte proliferation in vitro.
Short chain fatty acids (SCFA) are an important energy source for host cells to maintain homeostasis. Indeed, SCFAs account for 50-70% of the energy used by intestinal epithelial cells (IECs) and provide nearly 10% of our daily caloric requirements. Given previous findings showing infants colonized with B. infantis EVC001 have significantly increased fecal SCFAs concentrations compared to infants not colonized with B. infantis, we investigated the effect of fecal water (FW) from two distinct populations on enterocyte proliferation and morphology in vitro.
Fecal Waters (FW) were derived from fecal samples from infants colonized with B. infantis EVC001 (EVC001) and infants not colonized with B. infantis (controls). FW were added to adult and premature enterocyte cell lines to assess growth, proliferation and cytotoxicity. Microscopic images were taken to observe morphological differences.
Intestinal epithelial cells (Caco-2 and HIEC-6 cells) exposed to EVC001 FW showed significantly increased proliferation as shown by cell count and real-time ATP expression compared to medium alone and control FW (P<0.0001). Conversely, significantly decreased lactate dehydrogenase, an indication of decreased membrane integrity, was detected in enterocytes exposed to EVC001 FW compared to controls FW (P<0.01). Furthermore, control FW altered the morphology of enterocytes compared to cells exposed to EVC001 FW or medium alone.
EVC001 FW significantly increased enterocyte proliferation compared to control FW and medium alone, while control FW negative affected cell growth, membrane integrity and cell morphology; thus, suggesting SCFA produced by B. infantis EVC001 promote enterocyte growth and improve intestinal integrity in infants.
This in vitro model is applicable to assess the effect of any of the metabolites identified herein, but specifically the evaluation of fecal waters with microbiomes expected to deplete ARG on intestinal integrity. The addition of supplemental arginine can be investigated. This model may be used to evaluate fecal waters from healthy preterm infants, those supplemented with B. infantis and those with NEC. This model may also be used to evaluate the effect of specific inhibitors of microbial arginine pathways to limit the growth of those organisms. This method can be used to help develop new targeted antimicrobials against the bacteria specifically implicated in NEC.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/012277 | 1/4/2020 | WO | 00 |