None.
This invention was made by or on behalf of parties to a joint research agreement entitled “Collaboration Agreement” effective as of May 13, 2019 between Viome, Inc. and Queensland University of Technology.
None.
Microbiome refers to the collection of microorganisms—bacteria, fungi and viruses—that inhabit the body of multicellular organisms. The microbiome inhabits many different parts of the human body, including, for example, mouth, throat, gut, skin, eye, nose, bronchi, urethra, and vagina. Microbes commonly found in the human microbiome include, for example, Escherichia, Haemophilus, Streptococcus, Neisseria, Bacteroides, Clostridium, Mycobacterium, Pseudomonas, Spirochaeta and Mycoplasma.
Microbiome composition (taxonomy) and activity can be associated with wellness and health conditions. Knowledge of such associations can be useful for the determination and treatment of such conditions. Alterations in a subject's microbiome content and activity can impact wellness and health.
Oral cancers express genes that healthy tissue does not. Oral cancer cells may also have genetic and epigenetic variations that are different from healthy tissues. These include primary sequence variants (SNPs, indels, translocations, etc.) and post-transcriptional modifications, such as RNA base modifications, splice variants, etc.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate exemplary embodiments and, together with the description, further serve to enable a person skilled in the pertinent art to make and use these embodiments and others that will be apparent to those skilled in the art. The invention will be more particularly described in conjunction with the following drawings wherein:
In one aspect, provided herein is a method for inferring a state of oral cancer in a subject, comprising: a) providing a biological sample from a subject comprising an oral microbiome, and, optionally, somatic host cells; b) sequencing nucleic acids from the sample to produce sequence information; c) determining, from the sequence information, measures of activity of each of one or more microbial taxa and/or measures of activity of one or more gene orthologs, wherein the one or more measures are included in a feature set; d) executing by computer a classification model that infers, from one or more features in the feature set, a state of oral cancer in the subject. In one embodiment the method further comprises d) outputting the inference to a user interface device or to computer-readable memory. In another embodiment the method further comprises d) delivering and/or administering to the subject a therapeutic intervention effective to treat the oral cancer. In another embodiment the classification model classifies presence or absence of oral cancer. In another embodiment wherein the classification model classifies a stage of oral cancer (e.g., selected from stage 0, stage 1, stage 2, stage 3, stage 4). In another embodiment the nucleic acids comprise a microbial metatranscriptome. In another embodiment wherein the nucleic acids further comprise host nucleic acids. In another embodiment the subject is a human. In another embodiment the classification model uses features selected from both microbial taxa activity and gene ortholog activity. In another embodiment the classification model uses one or more features selected from the features of Table 1. In another embodiment the classification model uses at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, or 157 of the features selected from the features of Table 1. In another embodiment the classification model uses at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, or 17 of the features selected from: Actinobaculum sp. oral taxon 183, Actinomyces massiliensis, Actinomyces sp. oral taxon 448, Alloscardovia omnicolens, Selenomonas sp. CM52, Mycoplasma salivarium, Parvimonas sp. oral taxon 110, Rothia sp. HMSC062H08, K01697, K12452, Actinomyces johnsonii, Prevotella loescheii, Streptococcus cristatus, Streptococcus sobrinus, Streptococcus sp. HPH0090, Tannerella forsythia, and K02909. In another embodiment the features of Table 1 include one or more microbial taxa features and/or one or more gene ortholog features. In another embodiment the features of Table 1 include one or more positively associated features and/or one or more negatively associated features. In another embodiment the classification model uses only features selected from the features of Table 1. In another embodiment the oral cancer is selected from squamous cell carcinoma, verrucous carcinoma, minor salivary gland carcinoma, lymphoma, benign oral cavity tumor and basal cell carcinoma.
In another aspect provided herein is a method comprising: a) providing biological samples from each of a first set of subjects and a second set of subjects, wherein the biological samples comprise an oral microbiome, and, optionally, somatic host cells, and wherein the first set of subjects have oral cancer present and the second set of subjects have oral cancer absent; b) sequencing nucleic acids in the biological samples to provide sequence information; and c) performing a statistical analysis on the sequence information to produce a model that infers a state of oral cancer in a subject based on sequence information. In one embodiment the statistical analysis comprises a model developed by machine learning.
In another aspect provided herein is a method comprising: a) providing a biological sample from a subject, wherein the biological sample comprises an oral microbiome; b) sequencing nucleic acids in the biological sample to provide sequence information; c) executing a model of claim 14 on the sequence information to infer a state of oral cancer in the subject based on the sequence information; and d) outputting the inference to a user interface device or to computer-readable memory.
In another aspect provided herein is a method comprising: a) administering to a subject inferred to have oral cancer by a method of claim 1 or as disclosed herein, a therapeutic intervention effective to treat the oral cancer.
In another aspect provided herein is a system comprising: (a) a computer comprising: (i) a processor; and (II) a memory, coupled to the processor, the memory storing a module comprising: (1) nucleic acid sequence information from a biological sample from a subject comprising an oral microbiome; (2) a classification model which, based on values including the measurements, classifies the subject as having oral cancer present or absent, wherein the classification model is configured to have a sensitivity of at least 75%, at least 85% or at least 95%; and (3) computer executable instructions for implementing the classification model on the test data.
In another aspect provided herein is a method for developing a computer model for inferring, from feature data, a state of oral cancer in a subject, the method comprising: a) training a machine learning algorithm on a training data set, wherein the training data set comprises, for each of a plurality of subjects, (1) a class label classifying a subject as having or not having an oral cancer; and (2) feature data comprising quantitative measures for each of a plurality of features selected from oral microbiome transcriptome expression, and wherein the machine learning algorithm develops a model that infers a class label for a subject based on the feature data.
In another aspect provided herein is a method that infers a state of oral cancer in a subject, the method comprising: (a) providing a data set comprising, for the subject, feature data for each of a plurality of features selected from oral microbiome transcriptome gene expression data and taxa activity data; and (b) executing a computer model on the data set to infer the presence or absence of oral cancer in the subject.
In another aspect provided herein is a software product comprising a computer readable medium in tangible form comprising machine executable code, which, when executed by a computer processor, infers a state of oral cancer in a subject by: (a) accessing a data set comprising, for a subject, feature data for each of a plurality of features selected from oral microbiome transcriptome gene expression data and taxa activity data; and (b) executing a computer model on the data set to infer the state of oral cancer in the subject.
In another aspect provided herein is a method of treating oral cancer in a subject comprising: (a) determining the presence of oral cancer in a subject according to a method as described herein; and (b) administering a therapeutic intervention to the subject effective to treat the oral cancer.
In another aspect provided herein is a method for diagnosing and treating an oral cancer in a subject, the method comprising: (a) receiving from a subject a sample comprising an oral microbiome and, optionally, host somatic cells; (b) determining nucleic acid sequences of a microorganism component of the sample; (c) determining alignments of the nucleic acid sequence to reference nucleic acid sequences associated with the oral cancer; (d) generating a microbiome feature dataset for the subject based upon the alignments; (e) generating an inference of the oral cancer in the subject upon processing the microbiome feature dataset with an inference model derived from a population of subjects; and (f) at an output device associated with the subject, providing a therapy to the subject with the oral cancer upon processing the inference with a therapy model designed to treat the oral cancer.
In another aspect provided herein is a method comprising: (a) measuring, in a sample from a subject comprising an oral microbiome and, optionally, host somatic cells, activity of one or more biomarkers selected from Table 1; (b) inferring, from the measurements, presence of oral cancer in the subject; and (c) delivering to the subject a therapeutic intervention to treat the oral cancer. In one embodiment measuring comprises: (i) optionally, amplifying microbial metatranscriptome sequences in the sample; (ii) sequencing the microbial metatranscriptome from the sample to produce sequence reads; (iii) searching reference sequences in a reference sequence catalog for matches with the sequence reads; (iv) determining amounts of sequence reads matching references sequences in the catalog to produce a data set; and (v) determining, from the data set, activity of each of the one or more biomarkers. In another embodiment determining activity comprises: (1) for biomarkers that are taxa categories, performing a taxonomic analysis with a metagenomic classifier to measure taxa activity; (2) for biomarkers that are gene orthologs, performing a functional analysis by determining activity of genes having the same function across taxa based on sequences corresponding to microbial open reading frames (ORFs), and combing the activities to produce gene ortholog activity. In another embodiment inferring comprises: (i) executing by computer a classification model that infers presence or absence of oral cancer based on the biomarkers. In another embodiment the therapeutic intervention is selected from a drug, a dietary supplement, a food ingredient, and a food. In another embodiment measuring comprises: (i) selectively amplifying in the sample nucleic acids specific for the biomarkers; and (ii) determining amounts of the amplified nucleic acids.
In another aspect provided herein is a method comprising: a) providing biological samples from each of a first set of subjects and a second set of subjects having an oral cancer and having been subject to a therapeutic intervention, wherein the biological samples comprise an oral microbiome, and, optionally, host somatic cells, and wherein the first set of subjects responded positively to the therapeutic intervention and the second set of subjects did not respond positively to the therapeutic intervention; b) sequencing nucleic acids in the biological samples to provide sequence information; and c) performing a statistical analysis on the sequence information to produce a model that infers subject oral cancer having a positive response or lack of positive response to the therapeutic intervention.
In another aspect provided herein is a method of treating a subject with oral cancer comprising: (a) inferring that the subject will respond positively to each of one or more therapeutic interventions by executing a model on nucleic acid information from a biological sample from the subject comprising or oral microbiome and, optionally, host somatic cells; and (b) administering to the subject one or more of the therapeutic interventions.
Oral cancers will interact with the oral microbiome such that the microbes express genes, resulting in transcripts, that may not be expressed in the absence of oral cancers. Such transcripts may be found in saliva and be identified as biomarkers of oral cancer. By analyzing oral metatranscriptome, biomarkers of oral cancers may be found in the combination of human and microbial transcripts found in the mouth.
It has been discovered that features of a subject's oral metatranscriptome (RNA content) are associated with oral cancer. Accordingly, disclosed herein are methods for analyzing the oral metatranscriptome (MT), producing oral MT data, building machine-learning models to learn associations between oral cancers and MT data, and the use of such models to determine the presence or absence of oral cancer in a subject, as well as methods of treatment following such determination.
Methods of diagnosing oral cancer use a mouth sample from a subject. RNA from the mouth sample is sequenced to produce nucleic acid sequence information. For gene expression analysis only, an alternative method, such as microarray, could be used. RNA sequence information is subject to bioinformatics processing. Bioinformatics processing can produce information that indicates a measure of each of a plurality of genes or gene orthologs and of active microbial taxa in the sample. It can also produce information about the sequence and level of expression of human genes and transcripts, including specific sequence variants. These data, in turn, can be used as features in a dataset used to perform statistical analysis, e.g., to train a machine learning algorithm, to develop a model to classify a sample as consistent with presence of oral cancer or absence of oral cancer, or with a probability of cancer. Such models can be implemented on samples from test subjects. Subjects diagnosed with oral cancer according to the methods described herein can be administered a therapeutic intervention to treat the cancer.
The term “subject” refers to any animal. Animals can include vertebrates or invertebrates, including fish, amphibians, reptiles, birds and mammals. Mammalian hosts can include primates and, in particular, humans. Mammalian subjects also can include farm animals and companion animals. The term “host” refers to a subject organism serving a vehicle for habitation of a microbiome. Because certain methods described herein include sequencing of a subject's microbiome, such subjects may also be referred to as “hosts.”
A human subject can be more than 20 years old or more than 50 years old. A subject can have a history of tobacco use or no history of tobacco use. As used herein, a subject with a history of tobacco use can be a current tobacco user or a former tobacco user. A current tobacco user is one who uses tobacco products four or more times per week in the past six months. A former tobacco user is one who has quit using tobacco products at the current time, but had previously used tobacco products four or more times per week for six months or more, within the last 20 years. A subject with no history of tobacco use is neither a current tobacco user of a subject with a history or tobacco use, that is, not being a tobacco user for at least twenty years.
As used herein, the term “microbiome” includes a microbial community comprising one or a plurality of different microbial taxa inhabiting a host. As used herein, the term “oral microbiome” refers to a microbiome inhabiting a mouth (e.g., tongue, gums, cheek, saliva) or throat, of a host.
As used herein, the term metatranscriptome (MT) refers to the collection of microbial and, optionally, host, transcripts in a sample. Accordingly, a mouth metatranscriptome includes all microbiome and, optionally, host, components. Host components include any transcripts from somatic cells of the host and, in the case of an oral sample, in the mouth.
As used herein, the term “biological sample” refers to a sample that includes material of biological origin, such as cells, biological macromolecules (e.g., nucleic acids, proteins, carbohydrates or lipids) or their derivatives. Saliva is an exemplary biological sample.
As used herein, the term “mouth-sourced cell” refers to a cell sourced from the mouth of a subject. This includes, without limitation, cells from the mouth microbiome and host somatic cells, such as cheek cells, tongue cells, gum cells, etc.
Samples for diagnosis of oral cancer can comprise biological samples comprising a mouth MT of a subject. Mouth MT samples can be collected, for example, from saliva, sputum or a cheek swab from a subject.
Data used in developing a model to make the inferences described herein typically comprise large data sets including thousands, tens of thousands, hundreds of thousands or millions of individual measurements taken from or about a subject, typically at the systems biology level. The data can be derived from one or more (typically a plurality) different biological system components. These biological system components, also referred to herein as “feature groups”, include, without limitation, the genome (genomic), the epigenome (epigenomic), the transcriptome (transcriptomic), the proteome (proteomic), the metabolome (metabolomic), the organismal cellular lipid components (lipidome), organismal sugar components of complex carbohydrates (glycomic), the proteome and/or genome of the immune system (immunomics) component of a system, organism phenotype (phenome, phenomic, phenotypic) and environmental exposure (exposome). (These are generally referred to herein as “-omic” data or information.)
A mouth MT sample can be preserved for transport to a laboratory. The sample can be deposited into a container that comprises an aqueous liquid, e.g., a buffered solution. The aqueous liquid can further contain reagents to inhibit or slow degradation of one or more kinds of nucleic acid, such as DNA or RNA. As used herein, the term “nucleic acid preservative” refers to a compound or composition that inhibits degradation of nucleic acid. RNA preservatives include, without limitation, formalin, sulfate (e.g., ammonium sulfate), isothiocyanate (e.g., guanidinium isothiocyanate) and urea. Commercially available RNA preservatives include, for example, TRIzol (ThermoFisher), RNAlater (Ambion, Austin, Tex., USA), Allprotect tissue reagent (Qiagen), PAXgene Blood RNA System (PreAnalytiX GmbH, Hombrechtikon), RNA/DNA Shield® (Zymo Research, Irvine, Calif.), and DNAstable (MilliporeSigma, Burlington, Mass.).
Sample processing can proceed with cell lysis. Cell lysis can be performed by any method known in the art this can include, for example, bead beading, a method that involves rapidly shaking a container containing solid particles such that cells in the container are lysed.
Polynucleotides can be extracted directly from the sample, or cells in the sample can first be lysed to release their polynucleotides. In one method, lysing cells comprises bead beating (e.g., with zirconium beads). In another method, ultrasonic lysis is used. Such a step may not be necessary for isolating cell-free nucleic acids.
After cell lysis, samples are further processed by the extraction or isolation of biomolecules in the container, e.g., biomolecules released from lysed cells. Isolated biomolecules typically include nucleic acids such as DNA and/or RNA. Other biomolecules to be isolated can include polypeptides, such as proteins.
Isolation of biomolecules can be performed with a liquid-handling robot. After cell lysis, biological molecules, such as nucleic acids can be isolated or extracted from the sample
Nucleic acids can be isolated from the sample by any means known in the art. Polynucleotides can be isolated from a sample by contacting the sample with a solid support comprising moieties that bind nucleic acids, e.g., a silica surface. For example, the solid support can be a column comprising silica or can comprise paramagnetic carboxylate coated beads or a silica membrane. After capturing nucleic acids in a sample, the beads can be immobilized with a magnet and impurities removed. In another method, nucleic acids can be isolated using cellulose, polyethylene glycol, or phenol/chloroform.
If the target polynucleotide is RNA, the sample can be exposed to an agent that degrades DNA, for example, a DNase. Commercially available DNase preparations include, for example, DNase I (Sigma-Aldrich), Turbo DNA-free (ThermoFisher) or RNase-Free DNase (Qiagen). Also, a Qiagen RNeasy kit can be used to purify RNA.
In another embodiment, a sample comprising DNA and RNA can be exposed to a low pH, for example, pH below pH 5, below pH 4 or below pH 3. At such pH, DNA is more subject to degradation than RNA.
DNA can be isolated with silica, cellulose, or other types of surfaces, e.g., Ampure SPRI beads. Kits for such procedures are commercially available from, e.g., Promega (Madison, Wis.) or Qiagen (Venlo, Netherlands).
Isolation of nucleic acids can further include elimination of non-informative RNA species from the sample. As used herein, the term “non-informative RNA” refers to a form of non-target or non-analyte species of RNA. Non-informative RNA species can include one or more of: human ribosomal RNA (rRNA), human transfer RNA (tRNA), microbial rRNA, and microbial tRNA. Non-informative RNA species can further comprise one or more of the most abundant mRNA species in a sample, for example, hemoglobin and myoglobin in a blood sample. Non-informative RNAs can be removed by contacting the sample with polynucleotide probes that hybridize with the non-informative species and that are attached to solid particles which can be removed from the sample. Examples of sequences that can be removed include microbial ribosomal RNA, including 16S rRNA, 5S rRNA, and 23S rRNA. Other examples of sequences that can be removed include host RNA. Examples include host rRNA, such as 18S rRNA, 5S rRNA, and 28S rRNA.
Isolated nucleic acids can be further processed to produce nucleic acid libraries. Production of nucleic acid libraries typically includes, in the case of RNA, converting RNA into DNA, e.g., by reverse transcription. Adaptors adapted for the DNA sequencing instrument to be used are typically attached to the DNA molecules.
According to one method, RNA molecules are reverse transcribed into cDNA using a reverse transcriptase. In certain embodiments, primers comprising a degenerate hexamer at their 3′ end hybridize to RNA molecules. The reverse transcriptase extends the primer and can leave a terminal poly-G overhang. In certain embodiments, the primer can also comprise adapter sequences. A template molecule comprising a Poly-C overhang and, optionally, adapter sequences, can be hybridized to the poly-G overhang and used to guide extension to produce an adapter tagged cDNA molecule comprising a cDNA insert flanked by adapter sequences.
If the target polynucleotide is DNA, then DNA can be isolated with silica, cellulose, or other types of surfaces, e.g., Ampure SPRI beads. Kits for such procedures are commercially available from, e.g., Promega (Madison, Wis.) or Qiagen (Venlo, Netherlands).
Methods of enriching nucleic acid samples include the use of oligonucleotide probes. Such probes can be used for either positive selection or negative selection. Such methods often reduce the amount of non-target nucleotides.
Adapter tagged cDNA molecules can be amplified using well-known techniques such as PCR, to produce a library.
In certain embodiments the nucleic acids to be sequenced are comprised in the transcriptome. As used herein, the term “metatranscriptome” refers to the set of RNA molecules in a population of cells. This can include all RNAs, but sometimes refers to only mRNA. In the present context it generally refers to RNA molecules produced by either human or microbial cells. In certain embodiments, the nucleic acids to be sequenced can be free or essentially free of host nucleic acids (“host-free nucleic acids”).
The isolated nucleic acids are generally sequenced for subsequent analysis. The methods described herein generally employ high throughput sequencing methods. As used herein, the term “high throughput sequencing” refers to the simultaneous or near simultaneous sequencing of thousands of nucleic acid molecules. High throughput sequencing is sometimes referred to as “next generation sequencing” or “massively parallel sequencing.” Platforms for high throughput sequencing include, without limitation, massively parallel signature sequencing (MPSS), Polony sequencing, 454 pyrosequencing, Illumina (Solexa) sequencing, SOLiD sequencing, Ion Torrent semiconductor sequencing, DNA nanoball sequencing (Complete Genomics), Heliscope single molecule sequencing, single molecule real time (SMRT) sequencing (PacBio), and nanopore DNA sequencing (e.g., Oxford Nanopore). Nucleotide sequences of nucleic acids produced by sequencing are referred to herein as “sequence information” or “sequence data”.
Also provided herein are methods of analyzing RNA transcripts in a heterogeneous microbial sample. The RNA transcripts can be part of a transcriptome for a cell or cells in the heterogeneous microbial sample. Information regarding the transcriptomes of a plurality of cells from different species may be obtained. The methods generally include isolating and sequencing the RNA found in a sample as described above.
The sequences obtained from these methods can be preprocessed prior to analysis. If the methods include sequencing a transcriptome, the transcriptome can be preprocessed prior to analysis. In one method, sequence reads for which there is paired end sequence data are selected. Alternatively, or in addition, sequence reads that align to a reference genome of the host are removed from the collection. This produces a set of host-free transcriptome sequences. Alternatively, or in addition, sequence reads that encode non-target nucleotides can be removed prior to analysis. As described above, non-target nucleotides include those that are over-represented in a sample or non-informative of taxonomic information. Removing sequence reads that encode such non-target nucleotides can improve performance of the systems, methods, and databases described herein by limiting the sequence signature database to open reading frames (a part of a reading frame that has the ability to be translated) can the size of the database, the amount of memory required to run the sequence signature generation analysis, the number of CPU cycles required to run the sequence signature generation analysis, the amount of storage required to store the database, the amount of time needed to compare sample sequences to the database, the number of alignments that must be performed to identify sequence signatures in a sample, the amount of memory required to run the sequence signature sample analysis, the number of CPU cycles required to run the sequence signature sample analysis, etc.
Subject data can include taxonomic data about the taxonomic classification and amounts of microbes in a microbiome of the subject. Such data is typically derived from nucleic acid sequence data obtained from the subject's microbiome. 16S RNA sequences are a standard source of information for assigning taxonomic classifications. Non-rRNA transcriptome data as an alternative source of information for taxonomic classification. Such methods are described in international patent publication WO 2018/160899 (“Systems And Methods For Metagenomic Analysis”). Many metagenomic classifiers, aligners and profilers are publicly available. See, for example, Florian P Breitwieser et al., “A review of methods and databases for metagenomic classification and assembly,” Briefings in Bioinformatics, Volume 20, Issue 4, July 2019, Pages 1125-1136, doi.org/10.1093/bib/bbx120, Published: 23 Sep. 2017. These include, without limitation, Centrifuge, GOTTCHA, kraken, kraken2, CLARK, Kaiju, MetaPhlAn, MetaPhlAn2, MEGAN, LMAT, MetaFlow, mOTUs, and mOTUs2.
Another method of analysis includes analysis of composition of microbiomes (“ANCOM”). This method is described in, for example, Mandel S, et al., “Analysis of composition of microbiomes: a novel method for studying microbial composition”, Microb Ecol Health Dis. 2015 May 29; 26:27663. doi: 10.3402/mehd.v26.27663. eCollection 2015.
Taxonomic analysis can involve searching a sequence catalog of microbiome sequences for matches with sequences in the dataset, e.g., metatranscriptomic sequences. Matches are assigned to the proper taxonomic category. Numbers of matches with a taxonomic category can indicate quantities of microbes of that taxonomic category in the sample.
The classifications can be at one or a plurality of different taxonomic levels, typically down to the species or strain level. Sequencing reads that map to sequences in the sub-catalog can then be labeled with tags indicating the taxonomic category at each level. The taxonomic label is assigned. Such systems can include classical or modern taxonomic classification systems.
As used herein, the term “taxon” (plural “taxa”) is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. A taxon is usually known by a particular name and given a particular ranking. For example, species are often designated using binomial nomenclature comprising a combination of a generic name for the genus and a specific name for the species. Likewise, subspecies are often designated using trinomial nomenclature comprising a generic name, a specific name, and a subspecific name. The taxonomic name for an organism at the taxonomic rank of genus is the generic name, the taxonomic name for an organism at the taxonomic rank of species is the specific name, and the taxonomic name for an organism at the taxonomic rank of subspecies is the subspecific name, when appropriate.
As used herein, the term “taxonomic level” refers to a level in a taxonomic hierarchy of organisms such as, strain, species, genus, family, order, class, phylum, and kingdom. In some embodiments, each taxonomic level includes a plurality of “taxonomic categories”, that is, the different categories belonging to particular taxonomic level. Some taxonomic levels only include a single member.
As used herein, the term “species” is intended to encompass both morphological and molecular methods of categorization. Species can be defined by genetic similarity. In some embodiments, a cladistic species is an evolutionarily divergent lineage and is the smallest group of populations that can be distinguished by a unique set of morphological or genetic traits.
Genomes imported into the reference catalog are typically indexed with a genome number. Various taxonomy indices, such as the NCBI taxonomy, categorized each genome number into a taxonomic classification. Consequently, sequencing reads that match reference sequences can also be taxonomically classified based on the number. Accordingly, using a taxonomic tree implicit in the taxonomic designation taxonomic source of any sequencing read can be identified and classified.
Once classified, sequences in each category can be quantified or estimated to determine amounts of sequencing reads in each taxonomic category and the relative abundance of each taxonomic entity. The sequencing reads can be metatranscriptomic in origin. Accordingly, amounts of reads in a taxon represent transcriptional activity of the taxon, rather than pure numbers of organisms in the taxon in the sample. “Activity of a microbial taxon” can refer to transcriptional activity.
The methods, systems and databases herein can be used to identify activity of a gene, a biochemical pathway or a functional activity from microbes present in the sample. In some embodiments, the methods include aligning sequencing reads to a database comprising open reading frame information that is associated with a particular biochemical activity or pathway. Some of such methods can include identifying taxonomic information for a sequence. Examples include the VIOMEGA algorithm (see WO 2018/160899 (Vuyisich et al.) or GOTTCHA algorithm, which detects sequence signatures that identify nucleic acids as originating from organisms at various taxonomic levels. Nucleic Acids Res. 2015 May 26; 43(10): e69. Other methods include MetaPhlAn, Bowtie2, mOTUs, Kraken, and BLAST. Some of such methods do not include identifying taxonomic information for the sequence, but instead may identify the biochemical activity, pathway, protein, functional RNA, product, or metabolite associated with a particular sequence read or sequence signature.
“Gene expression,” “gene activity” or “activity of a gene” is generally a function of transcription, e.g., the quantity of RNA in a sample encoding the gene. This can be done at any taxonomic level. For example, gene activity could be a measure of activity of the gene in a single species, or it could be activity of the gene across organisms belonging to a common genus, class, order or phylum. Thus, the term “gene” can refer to orthologs of a gene across different species. As used herein, the term “gene ortholog” refers to a homologous version of a gene across different taxa having the same biological function. Typically, gene orthologs share a high degree of sequence identity. Such orthologs can be identified, for example, with the KEGG orthology. Kanehisa, M. and Goto, S.; KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000)). KO (KEGG Orthology) databases. The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs. The KO databases include, among other things, genomic information, chemical information and systems information such as biological pathway maps. A functional ortholog is manually defined in the context of KEGG molecular networks, namely, KEGG pathway maps, BRITE hierarchies and KEGG modules. In the KEGG orthology, orthologs are identified by number. So, for example, “K01808” refers to rpiB, ribose 5-phosphate isomerase B [EC:5.3.1.6]. Search at the world wide web site genome.jp/kegg/kegg2.html.
Nucleic acid sequence information is processed using bioinformatics to extract higher order information. In particular, two types of information that are usefully extracted from sequence data include gene activity information and taxa activity information.
The activities of one or more taxa groups can be determined from the amount of nucleic acid, e.g., RNA, in a sample originating from particular taxonomic groups. Microbial taxa include taxonomic designation at any taxonomic level, e.g., species, genus, order, class, or phylum. Active microbial taxa are taxa that are not merely present but that are metabolically active, e.g., as measured by transcriptional levels of the microbial genome. Taxa groups of interest include, without limitation, Prevotella (genus)/Bacteroides (genus) ratio, Eubacterium rectale (species), Eubacterium eligens (species), Faecalibacterium prausnitzii (species), Akkermansia muciniphila (species), metabolic-related probiotic species (functional group), Roseburia (genus), Bifidobacterium (genus), Lactobacillus (genus), Clostridium butyricum (species), Allobaculum (genus), Firmicutes (phylum)/Bacteroidetes (phylum) ratio, Lachnospiraceae (family), Enterobacteriaceae (family), Ralstonia pickettii (species), Bilophila wadsworthia (species).
Similar bioinformatic approaches can be used to analyze human gene expression, by identifying and counting the transcripts produced by human cells. Bioinformatic software to extract such information from sequence data is known in the art. Examples include the VIOMEGA algorithm (see WO 2018/160899 (Vuyisich et al.) or GOTTCHA algorithm, which detects sequence signatures that identify nucleic acids as originating from organisms at various taxonomic levels. Nucleic Acids Res. 2015 May 26; 43(10): e69. Other methods include MetaPhlAn, Bowtie2, mOTUs, Kraken, BLAST and Salmon.
“Functional activities” are biological activity categories including biological or health functions or conditions at the cellular, organ or organismal level. Functional activities are assigned functional activity scores based on such data. Functional activity scores represent quantitative measures of functional activity. A functional category can involve any function related to health or wellness. Functional categories can embrace health parameters, health indicators, biological conditions and health risks. The activity of the function is assessed by analyzing -omic, e.g., transcriptomic data, which is collected from active, living organisms, e.g., expressing RNA from their genomes.
Functional activity includes integrative functional activities and non-integrative functional activities. Non-integrative functional activities are based on a single type of data or function, such as microbiome pathway activity data, taxa group activity data and host transcriptomic data. Integrative functional activities can be based on a plurality of different kinds of data or functions. For example, such functional activities can combine pathway activity data in taxa activity data.
In certain embodiments, functional activities include the activities of one or more pathways. As used herein, the term “pathways” refers to biological pathways, which are sequences of proven molecular events (such as enzymatic reactions or signal transduction or transport of substances or morphological structure changes) that lead to specific functional outcomes (such as secretion of substances, sporulation, biofilm formation, motility). Many biological pathways are known in the art, and examples can be found on the web at wikipathways.org/index.php/WikiPathways, pathwaycommons.org, and proteinlounge.com/Pathway/Pathways.aspx. Manual expert curation of scientific literature also can be used to reconstruct or create custom biological pathways. Biological pathways can include a number of genes that encode peptides or proteins, which play specific signaling, metabolic, structural or other biochemical roles in order to carry out various molecular pathways.
As used herein, the terms “biochemical activity” and “biochemical pathway activity” refer to activity of a biochemical pathway. Pathways of interest include, without limitation, butyrate production pathways, LPS biosynthesis pathways, methane gas production pathways, sulfide gas production pathways, flagellar assembly pathways, ammonia production pathways, putrescine production pathways, oxalate metabolism pathways, uric acid production pathways, salt stress pathways, biofilm chemotaxis in virulence pathways, TMA production pathways, primary bile acid pathways, secondary bile acid pathways, acetate pathways, propionate pathways, branched chain amino acid pathways, long chain fatty acid metabolism pathways, long chain carbohydrate metabolic pathways, cadaverine production pathways, tryptophan pathways, starch metabolism pathways, fucose metabolism pathways.
In order to build models to make inferences about the presence or absence of oral cancer, a dataset must be assembled that includes data from a plurality of subjects. Subjects typically will include both those diagnosed as having oral cancer and those diagnosed as not having oral cancer. The number of subjects in each category should be sufficient to provide statistically meaningful results. For example, such a cohort can comprise at least any of 50, 100, 500, or 1000 subjects diagnosed with the disease and at least any of 50, 100, 500, or 1000 subjects diagnosed without the disease.
A. Data sets
In building or executing a model to predict the oral cancer of an individual subject, databases are provided that include information about one or a plurality of subjects. Raw data can include sequence data or information derived therefrom.
Models, or classification models, are algorithms that make inferences based on feature data measured from a test. Methods of generating models to predict oral cancer can involve providing a training dataset on which a machine learning algorithm can be trained to develop one or more models to predict oral cancer. The training dataset will include a plurality of training examples or instances, typically for each of a plurality of subjects and typically in the form of a vector. Each training example will include a plurality of features and, for each feature, data, e.g., in the form of numbers or descriptors. Where learning is to be supervised, the data will include a classification of the subject into a category of a categorical variable to be inferred. For example, the categorical variable may be “cancer diagnosis” and the categories or classifications of this variable can be “present” and “absent”. Typically, for machine learning, the training examples will have at least 10, at least 100, at least 500 or at least 1000 different features. The features selected are those on which prediction will be based. In the present case features can include genes or taxa or gene activity and/or taxa activity. The collection of features included in a dataset can be referred to as a “feature set”.
Accordingly, the collection of sequence data or gene activity and/or taxa activity data from an individual subject represent data for a particular instance. Each gene or taxon measured or determined represents a feature. A value, which can be a number or qualifier, is provided for an instance at a particular feature. The collection of data across a plurality of instances or examples, e.g. subjects, represents a dataset. Accordingly, each dataset can be represented as a vector of values for combinations of instances and features.
A measurement of a variable, such as a phenotypic trait (e.g., presence or absence of cancer), quantity of microbes in a taxon, gene expression levels, biochemical pathway activity or a functional activity, can be any combination of numbers and words. A measure can be any scale, including nominal (e.g., name or category), ordinal (e.g., hierarchical order of categories), interval (distance between members of an order), ratio (interval compared to a meaningful “0”), or a cardinal number measurement that counts the number of things in a set. Measurements of a variable on a nominal scale indicate a name or category (e.g., a class label), such a “cancer” or “non-cancer”, “old” or “young”, “form 1” or “form 2”, “subject 1 . . . subject n,” etc. Measurements of a variable on an ordinal scale produce a ranking, such as “first”, “second”, “third”; or order from most to least. Measurements on a ratio scale include, for example, any measure on a pre-defined scale, such as number of molecules, weight, activity level, signal strength, concentration, age, etc., as well as statistical measurements such as frequency, mean, median, standard deviation, or quantile. Measurements on a ratio scale can be relative amounts or normalized measures. Quantitative measures can be given as a discrete or continuous range. Examples of quantitative measures include a number, a degree, a level, a range or bucket. A number can be a number on a scale, for example 1-10. Alternatively, the score can embrace a range. For example, ranges can be high, medium and low; severe, moderate and mild; or actionable and non-actionable. Buckets can comprise discrete numerals, such as 1-3, 4-6 and 7-10.
Models can be created by statistical methods. Statistical analysis can include any useful methodology including, without limitation, correlational, Pearson correlation, Spearman correlation, chi-square, comparison of means (e.g., paired T-test, independent T-test, ANOVA) regression analysis (e.g., simple regression, multiple regression, linear regression, non-linear regression, logistic regression, polynomial regression. stepwise regression, ridge regression, lasso regression, elasticnet regression) or non-parametric analysis (e.g., Wilcoxon rank-sum test, Wilcoxon sign-rank test, sign test). Statistical analysis can be performed by hand or by computer. Computer methods include, for example, machine learning algorithms.
Machine learning involves training machine learning algorithms on training data sets comprising data from a plurality of test subjects. Machine learning algorithms are trained on the training dataset to generate models that predict the oral cancer of an individual based on sequence data or information derived therefrom. Predicted oral cancer can be translated into recommendations to the subject about therapeutic interventions to be taken.
The machine learning algorithm can be any suitable supervised machine learning algorithm, parametric or non-parametric. Machine learning algorithms include, without limitation, artificial neural networks (e.g., back propagation networks), decision trees (e.g., recursive partitioning processes, CART), random forests, discriminant analyses (e.g., Bayesian classifier or Fischer analysis), linear classifiers (e.g., multiple linear regression (MLR), partial least squares (PLS) regression, principal components regression (PCR)), mixed or random-effects models, non-parametric classifiers (e.g., k-nearest neighbors), support vector machines, and ensemble methods (e.g., bagging, boosting).
Methods for generating models to predict oral cancer can comprise the following operations. A dataset as described above is provided. The dataset includes, for each of a plurality of subjects, raw or processed data. The data set is used as a training dataset to train a machine learning algorithm to produce one or more models that predict oral cancer of a subject based on biomarkers identified from the data.
Biomarkers can be individual features used by the model in making an inference (e.g., diagnosis) of the category in question. For example, of thousands of features used in the original training dataset, the model may use no more than any of 1, 5, 10, 50, 100 or 500 features in determining the classification.
A model may be subsequently validated using a validation dataset. Validation datasets typically include data on the same features as the training dataset. The model is executed on the training dataset and the number of true positives, true negatives, false positives and false negatives is determined, as a measure of performance of the model.
The model can then be tested on a validation dataset to determine its usefulness. Typically, a learning algorithm will generate a plurality of models. In certain embodiments, models can be validated based on fidelity to standard clinical measures used to diagnose the condition under consideration. One or more of these can be selected based on its performance characteristics.
Inferring a state of oral cancer in subject generally means using a model to assign a class label related to oral cancer to a test subject. The classifier can classify the condition according to any classification scheme useful to the operator. The class label can be “presence of oral cancer” or “absence of oral cancer”, or “likely presence of oral cancer” or “likely absence of oral cancer”. Alternatively, the class label can be a stage of oral cancer, including absence of oral cancer. Alternatively, the class label can be a type of oral cancer present, or the absence of oral cancer.
Oral cancers, the presence or absence of which can be inferred by the methods described herein include, without limitation, cancer of the lip, tongue, inner lining of the cheek, gums, floor of the mouth and hard and soft palate. They further include
Methods described herein can infer a stage of an oral cancer. Oral cancer stages include the following: squamous cell carcinoma, verrucous carcinoma, minor salivary gland carcinoma, lymphoma, benign oral cavity tumors and basal cell carcinomas.
Stage 0 oral cancer: Cancer limited to layer of cells lining the oral cavity or oropharynx (also referred to as “carcinoma in situ”. Treatment may include surgery, radiation, or a combination of both.
Stage 1 oral cancer: Tumor is 2 centimeters (cm) (about ¾ inches) or less in size. The cancer has not spread to the lymph nodes or to other places in the body. Also classified as “T1, N0, and M0” where T refers to tumor size, N refers to involvement of lymph nodes, and M refers to metastasis. Treatment may include surgery, radiation, or a combination of both.
Stage 2 oral cancer: Tumor is between 2 and 4 cm (about 1½ inches) in size. The cancer has not spread to the lymph nodes or other places in the body. Also classified as T2, N0, and M0. Treatment may include surgery, radiation, or a combination of both.
Stage 3 oral cancer: Tumor is larger than 4 cm (about 2 inches) and has not metastasized, but may have spread to the lymph nodes. Also classified as T3, N0, M0; T1, N1, M0; T2, N1, M0; and T3, N1, M0. Surgery or radiation or both are likely treatment options. Chemotherapy may be suggested to destroy any cancer that has spread, and other options include targeted treatments which target specific cancer cells in oral cancer called epidermal growth factor receptor (EGFR). The drug cetuximab specifically targets EGFR cells.
Stage 4 oral cancer: Tumor can be any size, but the cancer has spread to the lymph nodes or other parts of the body. Also classified as T(1 to 4), N number (0 to 3), and either M0 or M1. Treatment may include surgery, radiation, chemotherapy, targeted treatments, or a combination.
The model selected can either result from operator executed statistical analysis or machine learning. In any case, the model can be used to make inferences (e.g., predictions) about a test subject. Test data can be generated from a sample taken from the test subject. The test dataset can include all of the same features used in the training dataset, or a subset of these features. Such a subset function as biomarkers. The model is then applied to or executed on the test dataset. Inferring oral cancer is a form of executing a model. The inference is typically performed by computer, but can be performed by a person. The choice may depend on the complexity of the operation of correlating. This produces an inference, e.g., a classification of a subject as belonging to a class (such as a diagnosis of oral cancer).
The classifier or model may generate, from the subject data, a single diagnostic number which functions as the model. Classifying a subject as having oral cancer can involve determining whether the diagnostic number is above or below a threshold (“diagnostic level”). The threshold can be determined, for example, based on a certain deviation of the diagnostic number above subject who do not have oral cancer. A measure of central tendency, such as mean, median or mode, of diagnostic numbers can be determined in a statistically significant number of normal and abnormal individuals. A cutoff above normal amounts can be selected as a diagnostic level of oral cancer. That number can be, for example, a certain degree of deviation from the measure of central tendency, such as variance or standard deviation. In one embodiment the measure of deviation is a Z score or number of standard deviations from the normal average.
The model used to make an inference of oral cancer can be chosen to have any desired level of sensitivity, specificity positive predictive value or negative predictive value.
Sensitivity refers to a value calculated according to the formula TP/(TP+FN), where TP is the number of true positive measurements (e.g., correctly inferring the presence of oral cancer in a subject) and FN is the number of false negative measurements (e.g., incorrectly inferring the absence of oral cancer in a subject). Sensitivity measures the percentage of subjects that actually have oral cancer who are inferred to have oral cancer by the test. In some embodiments, the diagnostic test can infer a presence or an absence of oral cancer with a sensitivity of greater than about any of: 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%.
Specificity refers to a value calculated according to the formula TN/(TN+FP), where TN is the number of true negative measurements (e.g., correctly inferring an absence of oral cancer in a subject) and FP is the number of false positive measurements (e.g., incorrectly inferring the presence of oral cancer in a subject). Specificity measures the percentage of subjects that actually do not have oral cancer who are inferred to not have oral cancer by the test. In some embodiments, the diagnostic test can infer a presence or an absence of oral cancer with a specificity of greater than about any of: 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%1, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%.
Positive Predictive Value (PPV) refers to a value calculated according to the formula TP/(TP+FP). A PPV value is the proportion of subjects inferred to be positive (presence of oral cancer) that actually have oral cancer. In some embodiments, the model, e.g., diagnostic test, may infer a presence or an absence of oral cancer in a subject at a PPV of greater than about any of: 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%.
Negative Predictive Value (NPV) refers to a value calculated according to the formula TN/(TN+FN). An NPV value is the proportion of subjects inferred to be negative (absence of oral cancer) that actually do not have oral cancer. In some embodiments, the model, e.g., diagnostic test, may infer a presence or an absence of oral cancer in a subject an NPV of greater than about any of: 70%, 75%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or
Accuracy can be measured by the percentage of subjects who test positive or negative that are true positives or true negatives, respectively. Accuracy can be calculated using the following formula: Accuracy=(TP+TN)/(TP+TN+FP+FN).
Precision can be measured by the percentage of subjects who test positive that are true positives and not false positives. Precision can be calculated using the following formula: precision=TP/(TP+FP).
Classifications can be provided to a subject for example, in the form of recommendations. In one embodiment, the recommendations include a positive recommendation to administer a therapeutic intervention, e.g., a chemotherapy drug.
Individual features may be found to contribute more or less to making an inference. Such significant features can be determined, for example, by leaving them out of a training data set and determining the deterioration in predictive ability of the ultimate models. Also, to the extent statistical analysis generates a plurality of predictive models, comparison of such models can show certain features present in many models.
Also provided herein are methods for using a companion diagnostic to infer response by a subject (e.g., will or will not respond positively or degree of response) to a therapeutic intervention for oral cancer. A companion diagnostic is an in vitro diagnostic test or device that provides information relevant to the safe and effective use of a corresponding therapeutic intervention, a therapy or adjuvant therapy. Such methods can infer possible adverse reactions to a therapeutic intervention or can infer responsiveness to a therapeutic intervention. Such inferences may include schedule, dose, discontinuation, or combinations of therapeutic agents. In some embodiments, the therapeutic intervention is selected by measuring one or more biomarkers in the subject.
Companion diagnostics can be developed by generating a dataset that includes subjects that are responsive to and nonresponsive to a particular therapeutic intervention. The dataset will further include nucleic acid sequence information derived from a biological sample comprising an oral microbiome of each subject. The dataset can be subject to statistical analysis to identify features, e.g. biomarkers, useful in inferring responsiveness. In some embodiments, the data set is used as a training dataset to train a machine learning algorithm to generate a classification model to classify a subject as responsive or nonresponsive to the particular therapeutic intervention.
The therapeutic intervention can be a primary intervention or an adjuvant therapy for the oral cancer. In adjuvant therapy is an additional therapeutic intervention given after a primary therapeutic intervention to lower the risk that the oral cancer will recur. Adjuvant therapies can include, for example, chemotherapy, radiation therapy, hormone therapy, targeted therapy, or biological therapy.
B. Microbiome Features Associated with Oral Cancer
Table 1 identifies microbial taxa and gene orthologs (e.g., microbial) (identified as KEGG orthologs) associated with oral cancer. The table indicates whether the association is positive (“+”) or negative (“−”). A classification model or rule to infer oral cancer in a subject can a feature set that includes one or more of these markers as features. A variety of combinations of features are possible. These include, without limitation, feature sets including at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, or 80 features selected from the features of Table 1. In another embodiment, all, some or none of the features selected from the features of Table 1 are positively associated with oral cancer. In another embodiment, all, some or none of the features selected from the features of Table 1 are negatively associated with oral cancer. In another embodiment, all, some or none of the features selected from the features of Table 1 are taxonomic features, including features that only positively associated with oral cancer, only negatively associated with oral cancer or a combination of positively and negatively associated features. In another embodiment, all, some or none of the features selected from the features of Table 1 are KEGG ortholog features, including features that only positively associated with oral cancer, only negatively associated with oral cancer or a combination of positively and negatively associated features. In another embodiment, features from Table 1 include both taxonomic features and KEGG ortholog features, including features that are only positively associated with oral cancer, only negatively associated with oral cancer or a combination of positively and negatively associated features. Each feature functions as a biomarker, that is, a measurable biological analyte associated with the condition in question.
Actinomyces gerencseriae
Actinomyces sp. ICM54
Actinomyces sp. oral taxon 170
Actinomyces sp. oral taxon 172
Actinomyces sp. oral taxon 181
Actinomyces sp. oral taxon 849
Actinomyces urogenitalis
Alloprevotella rava
Alloscardovia omnicolens
Arcanobacterium
urinimassiliense
Bifidobacterium longum
Capnocytophaga gingivalis
Capnocytophaga sp. oral taxon
Corynebacterium
argentoratense
Eikenella corrodens
Haemophilus sp. CCUG 66565
Lactobacillus fermentum
Mycoplasma salivarium
Parvimonas sp. oral taxon 110
Porphyromonas sp. oral taxon
Prevotella buccae
Rhodococcus sp. 008
Rothia aeria
Rothia sp. HMSC036D11
Rothia sp. HMSC061E04
Rothia sp. HMSC062F03
Rothia sp. HMSC062H08
Rothia sp. HMSC064D08
Rothia sp. HMSC069C01
Selenomonas sp. CM52
Selenomonas sp. oral taxon
Selenomonas sp. oral taxon
Selenomonas sputigena
Staphylococcus pasteuri
Streptococcus mitis
Streptococcus porcinus
Streptococcus sp. 343_SSPC
Streptococcus sp. oral taxon
Treponema medium
Treponema sp. 0MZ 838
Veillonella atypica
Xylanimonas cellulosilytica
Actinobaculum sp. oral taxon
Actinobaculum suis
Actinomyces cardiffensis
Actinomyces johnsonii
Actinomyces massiliensis
Actinomyces sp. oral taxon 448
Actinomyces sp. oral taxon 848
Aggregatibacter
actinomycetecomitans
Aggregatibacter aphrophilus
Cardiobacterium hominis
Corynebacterium matruchotii
Entamoeba nuttalli
Kocuria kristinae
Leptotrichia buccalis
Mogibacterium diversum
Neisseria cinerea
Neisseria sp. HMSC077D05
Ottowia sp. oral taxon 894
Porphyromonas endodontalis
Prevotella loescheii
Prevotella sp. oral taxon 473
Propionibacterium australiense
Streptococcus cristatus
Streptococcus australis
Streptococcus lutetiensis
Streptococcus mutans
Streptococcus phage YMC-
Streptococcus salivarius
Streptococcus sobrinus
Streptococcus sp. F0442
Streptococcus sp. HPH0090
Streptococcus sp. NPS 308
Streptococcus timonensis
Tannerella forsythia
In certain embodiments, the features used in the model include one or more features selected from Actinobaculum sp. oral taxon 183, Actinomyces massiliensis, Actinomyces sp. oral taxon 448, Alloscardovia omnicolens, Selenomonas sp. CM52, Mycoplasma salivarium, Parvimonas sp. oral taxon 110, Rothia sp. HMSC062H08, K01697, K12452, Actinomyces johnsonii, Prevotella loescheii, Streptococcus cristatus, Streptococcus sobrinus, Streptococcus sp. HPH0090, Tannerella forsythia, and K02909.
Features used by a classification algorithm to infer presence of oral cancer can include a combination of microbial taxa activity scores, microbial KO activity scores, and host gene activity scores. Exemplary features are presented in Tables 2, 3 and 4. In the tables, model coefficient indicates degree of correlation with oral cancer. Greater absolute values indicate higher correlation. Negative and positive scores indicate, respectively, down or up amount of a taxon, or regulation or activity or a KO or gene, compared with control.
Table 2 shows 88 expressed human genes that can be used in a model.
Table 3 shows 110 active microbial species that can be used in a model.
Corynebacterium matruchotii
Saccharomyces sp. ‘boulardii’
Tannerella forsythia
Actinomyces sp. oral taxon 180
Rothia sp. HMSC078H08
Streptococcus mutans
Campylobacter sp. 10_1_50
Prevotella sp. oral taxon 472
Porphyromonas endodontalis
Ralstonia sp. MD27
Gemella morbillorum
Ochrobactrum anthropi
Campylobacter concisus
Leucobacter chironomi
Capnocytophaga sp. ChDC OS43
Prevotella loescheii
Rothia sp. HMSC062F03
Actinomyces johnsonii
Actinobaculum sp. oral taxon 183
Actinomyces massiliensis
Prevotella nanceiensis
Capnocytophaga sp. oral taxon
Neisseria polysaccharea
Actinomyces sp. oral taxon 170
Bifidobacterium reuteri
Actinomyces viscosus
Selenomonas sp. CM52
Oribacterium parvum
Leptotrichia hofstadii
Peptoniphilus sp. oral taxon 836
Fusobacterium sp. oral taxon 370
Streptococcus vestibularis
Actinomyces sp. HMSC075C01
Selenomonas noxia
Actinomyces sp. oral taxon 849
Streptococcus sp. 343_SSPC
Actinomyces sp. Marseille-P2985
Alloscardovia omnicolens
Prevotella sp. oral taxon 299
Streptococcus sp. 1171_SSPC
Streptococcus sp. 400_SSPC
Fusobacterium sp. OBRC1
Actinomyces sp. oral taxon 877
Rothia aeria
Streptococcus anginosus
Eikenella corrodens
Streptococcus milleri
Bifidobacterium sp.
Actinomyces sp. oral taxon 448
Cardiobacterium hominis
Haemophilus sp. HMSC61B11
Streptococcus sp. HMSC034E12
Actinomyces sp. oral taxon 171
Actinomyces gerencseriae
Streptococcus sp. HMSC066F01
Haemophilus sp. HMSC71H05
Streptococcus viridans
Mogibacterium diversum
Streptococcus sanguinis
Abiotrophia sp. HMSC24B09
Fusobacterium sp. HMSC064B11
Rothia sp. HMSC036D11
Lactobacillus fermentum
Actinomyces sp. S6-Spd3
Streptococcus sp. HMSC072G04
Streptococcus sp. HMSC062D07
Corynebacterium durum
Haemophilus sp. HMSC073C03
Streptococcus timonensis
Bifidobacterium longum
Streptococcus sp. I-G2
Leptotrichia wadei
Bifidobacterium breve
Streptococcus sp. HMSC065C01
Streptococcus sp. I-P16
Fusobacterium nucleatum
Streptococcus sp. HMSC072D03
Rothia sp. HMSC064D08
Lactobacillus crispatus
Actinomyces sp. oral taxon 175
Haemophilus sp. HMSC061E01
Veillonella sp. oral taxon 158
Streptococcus constellatus
Streptococcus sp. AS20
Streptococcus sp. F0442
Rothia sp. HMSC071F11
Streptococcus sp. HMSC10E12
Rothia dentocariosa
Capnocytophaga sputigena
Oribacterium sinus
Streptococcus parasanguinis
Gemella sanguinis
Streptococcus sp. A12
Actinomyces sp. ICM47
Streptococcus sp. HMSC072C09
Rothia sp. HMSC069C01
Streptococcus sp. HMSC068F04
Streptococcus sp. SR4
Rothia sp. HMSC067H10
Prevotella melaninogenica
Leptotrichia sp. oral taxon 215
Actinomyces oris
Streptococcus salivarius
Prevotella sp. ICM33
Streptococcus sp. 449_SSPC
Bacteroides zoogleoformans
Streptococcus sp. HMSC064D12
Streptococcus cristatus
Streptococcus sp. HMSC065E03
Rothia mucilaginosa
Table 4 shows 72 active microbial KO functional features that can be used in a model.
3. Genesets Associated with Oral Cancer
Referring to Table 5, certain biological mechanisms are associated with oral cancer. Activity of taxa, microbial KOs and host genes that are involved in these mechanisms can be used as features in a classification model to infer oral cancer.
i. Pro-Inflammatory Activities Promoting Carcinogenesis
Among the prominent mechanisms of microbial oral carcinogenesis is the bacterial stimulation of chronic inflammation and production of proinflammatory mediators that facilitates cell proliferation, mutagenesis, oncogene activation, and angiogenesis.
Pathogens/pathobionts and their functions The creation of a sustained dysbiotic proinflammatory environment by periodontal bacteria serves to functionally link periodontal disease and oral cancer. Moreover, traditional periodontal pathogens, such as Porphyromonas gingivalis, Fusobacterium nucleatum, and Treponema denticola, are among the species most frequently identified as being enriched in OSCC, and they possess a number of oncogenic properties. Among the pathogens predictive of OSCC, Porphyromonas, Treponema and Fusobacterium have higher abundances in oral swabs of patients with oral cancer. These organisms share the ability to attack and invade oral epithelial cells, and communicate with the host epithelium, and ultimately acquire phenotypes associated with cancer such as inhibition of apoptosis, increased proliferation, and increased migration of epithelial cells. Additionally, emerging properties of structured bacterial communities may increase oncogenic potential, and consortia of P. gingivalis and F. nucleatum are synergistically pathogenic within in vivo oral cancer models.
Interestingly, some species of oral streptococci can antagonize the phenotypes induced oral pathogens indicating functionally specialized roles for commensals and early colonizers in the oral biofilm. A number of top taxa features that are predictive of controls are components of the Viridans streptococci and commensal flora such as Streptococcus milleri (Gossling, 1988), Actinomyces and Campylobacter concisus. C. concisus was associated with the human oral cavity and has been linked with periodontal lesions, including gingivitis and periodontitis. Clinical studies have linked Streptococcus sp. to both caries progression and early childhood caries. S. anginosus is thought to exist in the mouth as a normal flora and to be located mainly in the gingiva and dental plaque, but one study data strongly indicates the implication of S. anginosus infection in carcinogenesis of head and neck squamous cell carcinoma.
LPS Biosynthesis Bacterial outer membrane lipopolysaccharides are entities that mediate proinflammatory immune response and inflammation host cells. LPS regulates gene expression of pro-inflammatory cytokines through activation of toll-like receptor 4 (TLR4) via NF-kB. The ‘0 antigens’, an extremely polymorphic polysaccharide binds to LipidA to form the LPS outer-membrane of Gram-negative bacteria thereby imparting antigenic specificity to the organism. For instance, LPS from Porphyromonas, a positively associated taxa from the OSCC model, is known to activate macrophages and increase NO production of cancer cell lines.
Biofilm and Virulence The OSCC model predicts a number of functional features associated with bacterial virulence as predictive of oral cancer. CheR are sugar transport and chemotaxis associated KOs respectively present in the oral microbes that are deterministic of virulence and pathogenesis. Cas3, member of CRISPR-associated proteins (CRISPR-Cas) system, is found to be predictive of OSCC from the model, CRISPR-Cas is important in biofilm formation, acquisition of resistance genes, DNA repair, regulation of interspecific competition. Tar gene, TagA is involved in the biosynthesis pathway of poly(ribitol phosphate), with potential involvement in capsular polysaccharide synthesis mediated virulence, autolysin regulator LytS, rscC two-component system which is involved in capsular polysaccharide synthesis mediated virulence, eutL involved in ethanolamine utilization and virulence are all features predictive of oral cancer phenotype from the model.
ii. Hydrogen Sulfide Production in OSCC
Sulfide (H2S) Producers and functional activities in OSCC: Hydrogen sulfide (H2S), a gaseous transmitter, is associated with oral periodontitis and is one of the main causes of halitosis and is generally associated with many oral diseases including oral cancer. Hydrogen sulfide promoted oral cancer cell proliferation through activation of the COX2, AKT and ERK1/2 pathways in a dose-dependent manner. Hydrogen sulfide and the enzymes that synthesize it, cystathionine-b-synthase, cystathionine γ-lyase are increased in different human malignancies. The expression of both enzymes and cellular H2S levels increase tumor survival and promote tumor dedifferentiation. Among the taxa, members of the Streptococcus anginosus group, Fusobacterium and Porphyromonas endodontalis are known producers of oral H2S. The KO CBS (cystathionine beta-synthase) is implicated in the production of oral H2S. The sulfide producing bacteria as well as the functional KOs are all positive predictors of OSCC from the model.
iii. Microbial Contribution to Cancer-Specific Energy Metabolism
Sugar metabolism and alternative energy utilization pathways: Cancer cells strongly upregulate glucose uptake and give rise to increased pyruvate. Unlike in normal cells, the pyruvate is not coupled to the mitochondrial tricarboxylic acid (TCA) cycle, instead is shunted to lactate fermentation and kept away from mitochondrial oxidative metabolism. This shift from oxidative phosphorylation toward aerobic glycolysis, even in the presence of oxygen is known as the “Warburg effect”. In cancer cells, the Pentose Phosphate Pathway (PPP) together with glycolysis, coordinates glucose flux and supports the cellular biogenesis of macromolecules such as lipids, DNA and for energy production. An increased PPP flux in human cancer cells is indicative of its role in meeting the bioenergetic demands of cancer cell proliferation and contribution to the Warburg effect. Enzymes such araA (L-arabinose isomerase) involved in pentose interconversion, as well as 6-phospho-beta-glucosidase involved in sugar metabolism, are positively associated features from the model suggest microbial dysregulation of PPP flux in human cancer cells.
Anti-Inflammatory and Antimicrobial mechanism: The commensal bacteria Streptococcus salivarius establishes in the human oral cavity a few hours after birth and remains there as a predominant commensal and as a primary colonizer of biofilms. Upon strong adhesion mediated by the glycosylated surface-exposed proteins like SrpA, S. salivarius promotes innate immunity by suppressing proinflammatory cascades as well as by producing anti-microbial substances like bacteriocins that antagonizes the virulent streptococci involved in tooth decay or pharyngitis or pathogens involved in periodontitis (Kaci et al 2014). Similarly, Streptococcus gordonii, an early colonial member of oral biofilm produces H2O2 to inhibit the growth of competitors, like the mutans streptococci, as well as strict anaerobic middle and later colonizers of the dental biofilm. Interestingly, Veillonella species, possess a putative catalase gene (catA) that mediates resistance to the S. gordonii thereby enabling direct physical interaction (coaggregate) with S. gordonii as well as Fusobacterium nucleatum that are late colonizers of biofilm. It is interesting to note that Fusobacterium and Veillonella are positive predictors of OSCC.
iv. Protein Fermentation as a Tumorigenic Mechanism
Lysine, Cadaverine metabolism and production pathways: Protein fermentation is a favorable condition in the tumor microenvironment as it results in the accumulation of by-products that are resourceful for the cancer cells. Polyamines such as putrescine and spermidine are products of microbial protein fermentation and are implied in cancer initiation and development. Cancer cells accumulate increased concentrations of polyamines by increased uptake via their PTS (Polyamine Transport System) (Palmer et al 2009). production of amino acids such as Lysine synthesis (LYSN), enhanced putrescine production pathways (ornithine decarboxylase) is observed and predictive of oral cancer phenotype.
Microbial Ammonia production pathways: The cellular protein degradation produces ammonia as a by-product. However, the role of ammonia in cancer cells is still not very clear as ammonia is not merely considered a toxic waste product, but is recycled into central amino acid metabolism to maximize nitrogen utilization. The ammonia accumulated in the tumor microenvironment was used directly to generate amino acids through GDH activity. These data show that ammonia not only is a secreted waste product, but a fundamental nitrogen source that can support tumor biomass. Evidence of increased microbial ammonia production is noted from altered narX, gInD, dadA, tenA, pdxH that are positively predictive of OSCC.
v. Tox Burden
The exposure to synthetic chemicals such as dyes, organopesticides and pharmaceuticals increases the toxicity burden of cells that elevates the cancer causing potential in general. Features involved in benzoate degradation, and atrazine degradation is detected from the predictive model for OSCC. Further, traces of acetaldehyde production (ncd2, npd nitronate monooxygenase) KOs are also observed to be predictive of oral cancer.
vi. Antibiotic Resistance
Antibiotic resistance and drug efflux: Microbes such as streptococcus milleri (Han 2001), Prevotella and Fusobacterium species which are known to show antibiotic resistance are predictive of oral cancer phenotype from the model. Fusobacterium nucleatum via. via the TLR4/NF-κB pathway promoted chemoresistance in CRC. Further, other model predicted features mdtB, multidrug efflux pump, and eptA (via. LPS modification) may also potentially contribute to antibiotic resistance.
Porphyromonas, and
Fusobacterium,
Streptococcus cristatus,
Streptococcus milleri,
Streptococcus anginosus
Porphyromonas
endodontalis,
Streptococcus milleri,
Streptococcus cristatus, eptA
Fusobacterium and
Porphyromonas endodontalis,
Streptococcus,
Fusobacterium nucleatum
Diagnostic methods described herein can be used to screen subjects for further testing or for definitive diagnosis. The current standard of care for OSCC screening and diagnosis relies on a physical exam by a healthcare provider, identification of lesion(s), followed by imaging, invasive biopsy and histopathological evaluation. For oral cancer, the most common type is an incisional biopsy which is regarded as the ‘Gold Standard’ for oral cancer diagnosis. A small piece of tissue is cut from the area that appears to be abnormal. A biopsy can be completed in an outpatient setting or the doctor's office if the location and depths of the abnormal tissue is sufficiently accessible and small. While imaging scans may be completed as part of the diagnosing process, the images are intended to direct the biopsy.
Accordingly, a subject can be screened for oral cancer using the methods described herein. A subject who is inferred to have oral cancer by such methods can then be subject to more definitive diagnosis by other standard methods. So, for example, for such a subject, a provider can perform imaging (e.g., to determine the extent of the lesion), biopsy (e.g., incisional biopsy) and histological preparation (e.g., fixing the tissue, sectioning the tissue, staining the tissue) in the process of making a more definitive diagnosis.
A subject inferred to have oral cancer by the methods disclosed herein may need a therapeutic intervention. Provided herein are methods of treating a subject determined, by the methods disclosed herein, to have an oral cancer with a therapeutic intervention effective to treat the condition.
As used herein, the terms “therapeutic intervention”, “therapy” and “treatment” refer to an intervention that produces a therapeutic effect (e.g., treats) a pathological condition. A therapeutic effect is one that ameliorates, prevents, slows the progression of, delays the onset of symptoms of, improves the condition of (e.g., causes remission of), improves symptoms of, or cures a pathological condition, such as oral cancer.
As used herein, the term “effective” as modifying a therapeutic intervention or treatment (e.g., “therapeutic intervention effective to treat” or “an effective therapeutic intervention” or to amount of a pharmaceutical drug, supplement or food (e.g., “amount effective to treat” or “an effective amount”), refers to a therapeutic intervention or amount of such to produce a therapeutic effect. For example, for the given parameter, a therapeutic intervention effective to treat a condition will show an increase or decrease in the parameter of at least 5%, 10%, 15%, 20%, 25%, 40%, 50%, 60%, 75%, 80%, 90%, or at least 100%. Therapeutic efficacy can also be expressed as “-fold” increase or decrease. For example, a therapeutically effective amount can have at least a 1.2-fold, 1.5-fold, 2-fold, 5-fold, or more effect over a control.
A therapeutic intervention can include, for example surgical removal of cancerous tissue; administration of a chemotherapeutic agent; and administration of a dietary supplement, a food ingredient, or a food that diminishes a dysbiosis in the oral microbiome of the subject associated with the cancer, any of which can alleviate the cancer or its symptoms.
A therapeutic intervention can include, for example, administration of a treatment, administration of a pharmaceutical, or a biologic or nutraceutical substance with therapeutic intent. The response to a therapeutic intervention can be complete or partial. In some aspects, the severity of disease is reduced by at least 10%, as compared, e.g., to the individual before administration or to a control individual not undergoing treatment. In some aspects the severity of disease is reduced by at least 25%, 50%, 75%, 80%, or 90%, or in some cases, no longer detectable using standard diagnostic techniques.
Treatments can include administration of therapeutic interventions to re-balance the microbiome toward a taxonomic and/or functional biomarker profile associated with absence of cancer (e.g., associated with health). Such interventions can include administration of therapeutic compositions that reduce the taxa or proteins over-represented in oral cancer and/or encourage the growth of taxa or expression of proteins under-represented in oral cancer. For example, to the extent inflammation is associated with cancer, taxa and gene functions that promote inflammation may be re-balanced toward normal. For example, certain Gram-negative bacteria or production of lipopolysaccharide have been recognized as pro-inflammatory, while certain Clostridia or butyrate producing proteins have been recognized as anti-inflammatory.
One method involves increasing the abundance of an under-represented taxon. This can be achieved by directly providing taxon-specific nutrients to enhance its growth, providing substrates to other taxa that cross-feed the taxon of interest, reducing competing taxa that may inhibit the growth or sequester the nutrients from the taxon of interest, or providing the taxon of interest in the form of a probiotic.
Another method involves reducing the abundance of an over-represented taxon. This can be achieved by depriving the taxon of nutrients, targeting it with bacteriophages, targeting it with the immune system (for example with IgA or IgG antibodies), targeting it with small molecules, increasing the abundance of competing taxa, or reducing the abundance of cross-feeding taxa.
Another method involves reducing the abundance of a microbial function, that is, activity of a KO or a pathway (e.g., a function of Table 5). This can be achieved by reducing the taxon that is expressing the function, reducing the gene expression of the protein(s) involved in the function (by regulatory mechanisms or removal of the substrate), inhibition of the function, or stimulation of the redundant pathways (in the same taxon or another).
Another method involves increasing the abundance of a microbial function, that is, activity of a KO or a pathway (e.g., a function of Table 5). This can be achieved by increasing the taxon that is expressing the function, increasing the gene expression of the protein(s) involved in the function (by regulatory mechanisms or provision of the substrate), stimulation of the function (allosteric effects, post-transcriptional modification), or inhibition of the redundant pathways (in the same taxon or another).
Another method involves preventing the interactions between microorganisms or their molecules (metabolites, nucleic acids, proteins) and human tissue that may support cancer onset or progression. This can be achieved by maintaining a healthy mucosal barrier, reducing inflammation, avoiding detergents in food, avoiding alcohol, avoiding mouthwash, reducing taxa that consume the mucus, increasing the abundance of the taxa that stimulate mucus production, inhibiting human molecules that respond to microbial stimuli.
Another method involves enhancing the interactions between microorganisms or their molecules (metabolites, nucleic acids, proteins) and human tissue that may inhibit cancer onset or progression. Increasing the expression of the human genes that respond to microbial stimuli, increasing microbial taxa or functions, increasing mucus-consuming taxa, increasing the permeability of mucus.
In certain embodiments, after inferring presence of oral cancer in a subject and, optionally, a stage of cancer, the subject is provided with a therapeutic intervention to treat the cancer. Therapeutic interventions for oral cancer include, for example, surgery to remove the cancerous tissue, radiation therapy, chemotherapy, dietary changes, nutritional supplements and combinations of these. Examples include prebiotics (fibers, other molecules), probiotics, bacteriophages, and natural and synthetic small molecules. Providing a therapeutic intervention can include delivering to the subject a package containing a therapeutic composition, e.g., a drug, a food or a dietary supplement. Delivery can be, for example, by common carrier, such as a national postal system, or a private courier service, such as FedEx, UPS, or DHL.
The therapeutic intervention can include administration to a subject a probiotic in an amount to balance a dysbiosis in the subject. For example, described herein are microbial taxa that are over-represented or under-represented compared to normal in oral cancer. The therapeutic intervention can include administering to the subject the microbes that are under-represented, or one or more microbes other than those over-represented in order to re-balance the microbiome toward a healthy profile.
Models provided herein can be executed by programmable digital computer.
The CPU 9905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the computer readable memory 9910. The instructions can be directed to the CPU 9905, which can subsequently program or otherwise configure the CPU 9905 to implement methods of the present disclosure.
The storage unit 9915 can store files, such as drivers, libraries and saved programs. The storage unit 9915 can store user data, e.g., user preferences and user programs. The computer system 9901 in some cases can include one or more additional data storage units that are external to the computer system 9901, such as located on a remote server that is in communication with the computer system 9901 through an intranet or the Internet.
The computer system 9901 can communicate with one or more remote computer systems through the network 9930.
Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 9901, such as, for example, on the computer readable memory 9910 or electronic storage unit 9915. The machine executable or machine-readable code can be provided in the form of software. During use, the code can be executed by the processor 9905. In some cases, the code can be retrieved from the storage unit 9915 and stored on the memory 9910 for ready access by the processor 9905. In some situations, the electronic storage unit 9915 can be precluded, and machine-executable instructions are stored on memory 9910.
Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks.
The computer system 9901 can include or be in communication with an electronic display 9935 that comprises a user interface (UI) 9940 for providing, for example, input parameters for methods described herein. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface.
Processes described here can be performed using one or more computer systems that can be networked together. Calculations can be performed in a cloud computing system in which data on the host computer is communicated through the communications network to a cloud computer that performs computations and that communicates, or outputs results to a user through a communications network. For example, nucleic acid sequencing can be performed on sequencing machines located at a user site. The resulting sequence data files can be transmitted to a cloud computing system where the sequence classification algorithm performs one or more operations of the methods described herein. At any step cloud computing system can transmit results of calculations back to the computer operated by the user.
Data can be transmitted electronically, e.g., over the Internet. Electronic communication can be, for example, over any communications network include, for example, a high-speed transmission network including, without limitation, Digital Subscriber Line (DSL), Cable Modem, Fiber, Wireless, Satellite and, Broadband over Powerlines (BPL). Information can be transmitted to a modem for transmission, e.g., wireless or wired transmission, to a computer such as a desktop computer. Alternatively, reports can be transmitted to a mobile device. Reports may be accessible through a subscription program in which a user accesses a website which displays the report. Reports can be transmitted to a user interface device accessible by the user. The user interface device could be, for example, a personal computer, a laptop, a smart phone or a wearable device, e.g., a watch, for example worn on the wrist.
Inference models as described herein can be executed on subject data to produce predicted oral cancer and/or recommendations for therapeutic intervention. In one embodiment, after making an inference about a state of oral cancer, the method can comprise developing a model for therapeutic intervention in the subject. The model can comprise, for example, pharmaceutical compositions to administer to the subject to treat the condition. Such a model and be communicated to the subject, for example, transmitting the model and, optionally, the diagnosis, to a user interface of a personal computing device of the subject.
Inferences on a subject's cancer state and/or recommendations for therapeutic intervention can be provided to subjects through an Internet website. A website can be provided which can be accessed by a subject, e.g. a customer, through a password-protected portal. The website can include a clickable icon. Upon clicking the icon, the subject can receive personalized food recommendations. Such inferences and/or recommendations can be displayed on a webpage connected to the clickable icon. Subject can receive at an Internet connected server notification that inferences and/or recommendations for the subject are available.
After wellness/therapeutic interventions are implemented, the effect of these interventions on the subject's condition can be remeasured. Such remeasurements can be used to generate updated inferences and/or recommendations as described herein.
A subject's saliva sample is collected in a sample collection and transport kit. The kit includes a saliva collection device that consists of three injection-molded polypropylene components:
Prior to sample collection, the saliva sample collection and transport device has an ambient temperature stability of 12 months. Saliva is deposited into the funnel at the top of the tube. The tube contains a 1.2 mL graduation on the outside wall to ensure an appropriate amount of saliva is collected. Patients are instructed to deposit at least to the 1.2 mL mark (saliva+preservative). The lab process requires a minimum of 175 uL (saliva+preservative). Once sufficient saliva is collected, the funnel is turned counterclockwise, which removes the stem and releases the RNA stabilizer into the tube.
Patients are instructed to cap the tube and shake thoroughly to mix the RNA stabilizer, which preserves RNA in the sample at room temperature for at least 28 days. The secondary container is then placed in a return mailer that further protects the sample.
The RNA stabilizer (1.2 mL per tube) is a commercial product called DNA/RNA Shield from Zymo Research. Note: this same stabilizer is used in Zymo Research's 510(k)-cleared collection device (K202641). This solution both inactivates pathogens and preserves RNA at ambient temperature for prolonged periods without cold-chain. The manufacturer states that “DNA/RNA Shield” viral transport solution has been demonstrated to inactivate Ebola, Influenza, and Herpes Simplex viruses while preserving the integrity of the RNA and DNA for subsequent molecular detection.
Saliva Sample Processing
Once the sample arrives at the laboratory, the lab will visually inspect the tube integrity and approximate volume of the specimen to ensure it is adequate for processing. Each specimen is logged into a LIMS system and if there is more than 1 mL available, it is split into aliquots with any extra aliquots (beyond the 1 for testing) being stored at −80° C. in case repeat testing is necessary (e.g., in the case of an invalid result). The specimen (either fresh or after thawing from −80° C.) are then lysed to release contents using bead beating in a chemical denaturant. This step is performed using the MPBio FastPrep 24 instrument. The lysed specimen is centrifuged to clarify the lysate at 12,000 rpm for 3 minutes. Clarified lysate is transferred to a plate format and diluted with water (1:1).
Total RNA is extracted from clarified lysate using a modified mirVana protocol, which includes on-bead DNA removal by DNase. Total RNA is quantified using the RiboGreen kit, and up to 250 ng of total RNA is transferred to a new plate. Bacterial and human rRNAs are physically removed from the specimen using a subtractive hybridization method. Biotinylated DNA probes complementary to rRNAs are hybridized to the total RNA in a proprietary hybridization buffer. The probe-rRNA complexes are bound to streptavidin magnetic beads. The beads are removed from the solution with a magnet. The remaining RNAs, found in the supernatant, are aspirated and used downstream. Finally, the remaining RNAs are converted into Illumina sequencing libraries using template-switching mechanism with random hexamers for the reverse transcription step.
The patient samples are run using a 96 well tray. To prepare the RNA samples for this high-throughput analysis, each specimen is barcoded with 11 bp dual unique molecular barcodes. During barcoding, PCR is performed with a limited number of cycles and limited primer amounts, leading to an equimolar concentration of each sample library at the end of PCR (due to exhaustion of the primers). Sample libraries are pooled by mixing equal volumes. Sample library pools are purified using AMPure XP beads, which remove buffer components and unincorporated nucleotides. Concentration of each sample library pool is determined using the Qubit 2.0 method with high sensitivity DNA kits.
Sample library pools are sequenced on Illumina NovaSeq 6000 to produce sequencing data.
The raw sequencing data from each flowcell is demultiplexed into FASTQ files corresponding to individual samples and each sample's sequencing reads are then subjected to quality control steps. The quality control passing criteria included a minimum of 1 million reads and 50 strain-level taxa per sample. The remaining high quality paired-end reads are used for detection and quantification of human genes, microbial taxonomies and microbial functions.
For human gene (HG) detection, paired-end reads were mapped to the human genome. Gene expression levels were computed by aggregating transcripts per million estimates per gene using an approach based on Salmon version 1.1.0 (Patro et al., 2017). For taxonomic classification, reads are mapped to a custom catalog derived from genomic sequences from all domains of the phylogenetic tree, namely, bacteria, archaea, eukaryota, and viruses. Taxonomies are identified and their relative activities are calculated at three different taxonomic ranks (genus, species, and strain). To identify and quantify transcriptionally active genes in the microbial community, functional assignments (KOs) are obtained through alignment of the sequencing reads to another custom catalog of Genes (derived from Integrated non-redundant Gene Catalog of the human gut microbiome (IGC) among others) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases.
The identified and quantified HGs, species and KOs for a given sample are then provided to the OSCC classifier, which classifies the sample as belonging to the “OSCC class” or the “Not OSCC class” within pre-specified performance criteria.
The final model produced from our V128 BDR model development protocol, which was validated on an independent sample set, encapsulates the following features:
Total number of features: 270
Number of Human Gene features: 88
Number of Species features: 110
Number of KO features: 72
The particular features are provided in Tables 2, 3 and 4.
Bioinformatics
Sequenced data is processed through a cloud-based bioinformatics pipeline and an OSCC classifier.
For developing a model for OSCC classification, the following steps were performed:
1. Following sample processing, perform data quality check for effective sequencing depth, and preprocess the sample data for normalization, computing relative abundance, and removing low prevalence genes;
2. Set up the algorithmic experiments with various combinations of feature sets and hyperparameters;
3. Perform a grid search algorithm by fitting logistic regression models for each feature set and hyperparameter set, cross-validating on the hyperparameter space, and selecting hyperparameter sets that meet the minimum performance criteria;
4. Select the final hyperparameter set based on all relevant performance criteria, and re-train a final model with all available samples.
The classification algorithm was developed and trained on saliva specimens from 945 patients (80 OSCC Positive, 48 OPMD Positive, 12 OPC Positive, and 805 OSCC negative). The OSCC Positive cases were collected from a secondary care center (University Hospital). The patient data also included histopathology reports from Pathologists and Oncologists, spanning early and late stage OSCC. The 805 OSCC negative samples were obtained from a combination primary care centers (which use the previously described standard of care techniques) and individuals self-reporting their cancer status based on their primary care provider's assessment.
In development, numerous different combinations of features (e.g., human genes, microbes) were interrogated to determine which had the best performance. The trained algorithm (or model) was considered to have passed the testing phase if it is able to classify the testing dataset correctly for at least 90% (sensitivity) of the test samples. The performance characteristics of the model (accuracy, specificity, sensitivity, etc.) were then computed using the results from the known test dataset.
Out of the 93 hyperparameter sets (models) that meet the performance constraints, the cross-validation performance were inspected, including ROC-AUC, sensitivity, specificity and the variance of the performance metrics. Viome selected the model that had the highest performance score, defined as the sum of average CV sensitivity and average CV specificity, among the models trained on a feature set containing human genes. The locked-down model, for the independent validation contains a total of 270 features which are used by the classifier for determining the preliminary OSCC status.
Once the model passed the testing phase, the trained classification model was able to take as input the data from an unknown sample and classify it as belonging to the “Oral Cancer class” or the “Not Oral Cancer class” within the desired performance characteristics. At that point, the machine-learnt model is considered to have learned the key properties (or “patterns”) corresponding to Oral Cancer within the training dataset.
The model was validated using saliva samples from 157 subjects (20 OSCC Positive and 137 OSCC Negative).
OSCC Classifier—Molecular Signature
The OSCC Classifier is a model derived from 270 features that included 88 human gene features and 182 microbial features (110 species and 72 KO). The specific features are listed in Tables 2, 3 and 4. This set of 270 features is collectively called the “molecular signature” of patients likely to have OSCC. The features in this molecular signature are associated with molecular processes associated with the biology of cancer.
The 88 human genes have a statistically significant overlap with several cancer hallmark genesets such as interferon Gamma, interferon Alpha, KRAS signaling and p53 pathways, with an analysis done via a Gene Set Enrichment Analysis (GSEA) tool. GSEA analysis relies on the enrichment score as the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic to compute the overlaps of a curated set from a Molecular Signatures Database (MSigDB) to a new set of genes originating from a new study. MSigDB is a collection of annotated gene sets divided into major collections, representing a universe of biological processes and pathways which are meaningful for insightful interpretation, each based on published experimental findings. This analysis, detailed in Table 5 and
The 182 microbial features (110 species and 72 KOs listed in Tables 3 and 4) are also collectively consistent with the evidence from a modified polymicrobial synergy and dysbiosis model for bacterial involvement in OSCC. Table 5 and
Gene set enrichment analysis was performed to compute the overlap between the gene set found in our model consisting of 88 genes and the MSigDB which is a curated collection of over 30,000 gene sets.
The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets for use with gene set enrichment (GSEA) software (worldwideweb site: https://gsea-msigdb.org/gsea/msigdb/index.jsp). This method and the accompanying software focuses on groups of genes (genesets) that share a common biological function, location or regulation aspects. GSEA analysis relies on the enrichment score as the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov—Smirnov-like statistic to compute the overlaps of a curated set from MSigDB to a new set of genes originating from a new study. In this manner, we are able to compare a list of genes in our oral cancer study with 31117 gene sets (divided into 9 major collections) in the MSigDB [Liberzon, 2011]. MSigDB represents a universe of biological processes and pathways which are meaningful for insightful interpretation, each based on published experimental findings.
1. A method comprising:
a) providing a biological sample from a subject comprising mouth-sourced cells;
b) sequencing nucleic acids from the sample to produce sequence information;
c) determining, from the sequence information, (1) measures of activity of one or more microbial taxa, (2) measures of activity of one or more microbial gene orthologs, and/or (3) measures of activity of one or more somatic cell genes of the subject, wherein the one or more measures are included in a feature set;
d) executing by computer a classification model that infers, from one or more features in the feature set, a state of oral cancer in the subject.
2. The method of embodiment 1, wherein the biological sample comprises saliva.
3. The method of embodiment 1, wherein the biological sample comprises microbial cells and host cells.
4. The method of embodiment 1, wherein the subject is a human.
5. The method of embodiment 1, wherein the subject is over 50 years of age or has a history of tobacco use.
6. The method of embodiment 1, wherein the mouth-sourced cells comprise an oral microbio and, optionally, somatic cells from the subject.
7. The method of embodiment 6, wherein the somatic cells from the subject comprise cells selected from cheek cells, gum cells and tongue cells.
8. The method of embodiment 1, wherein the nucleic acids sequenced comprise mRNA and the sequence information comprises metatranscriptomic information.
9. The method of embodiment 1, wherein the feature set used by the classification algorithm includes at least: (1) measures of activity of one or more microbial taxa.
10. The method of embodiment 9, wherein the feature set used by the classification algorithm further includes: (2) measures of activity of one or more microbial gene orthologs.
11. The method of embodiment 10, wherein the feature set used by the classification algorithm further includes: (3) measures of activity of one or more host somatic cell genes.
12. The method of embodiment 1, wherein the feature set used by the classification algorithm includes at least two of: (1) measures of activity of one or more microbial taxa, (2) measures of activity of one or more microbial gene orthologs, or (3) measures of activity of one or more somatic cell genes of the subject.
13. The method of embodiment 1, wherein the classification model uses one or more features selected from the features of Table 1.
14. The method of embodiment 1, wherein the classification model uses at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, or 157 of the features selected from the features of Table 1.
15. The method of embodiment 1, wherein the classification model uses at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, or 17 of the features selected from: Actinobaculum sp. oral taxon 183, Actinomyces massiliensis, Actinomyces sp. oral taxon 448, Alloscardovia omnicolens, Selenomonas sp. CM52, Mycoplasma salivarium, Parvimonas sp. oral taxon 110, Rothia sp. HMSC062H08, K01697, K12452, Actinomyces johnsonii, Prevotella loescheii, Streptococcus cristatus, Streptococcus sobrinus, Streptococcus sp. HPH0090, Tannerella forsythia, and K02909.
16. The method of embodiment 15, wherein the features of Table 1 include one or more microbial taxa features and/or one or more gene ortholog features.
17. The method of embodiment 15, wherein the features of Table 1 include one or more positively associated features and/or one or more negatively associated features.
18. The method of embodiment 1, wherein the classification model uses only features selected from the features of Table 1.
19. The method of embodiment 1, wherein the feature set used by the classification algorithm includes at least 30, at least 50, at least 100, at least 200 or all of the features selected from Tables 2, 3 or 4.
20. The method of embodiment 19, wherein the feature set used by the classification algorithm includes at least 10 microbial taxa features, at least 10 microbial gene ortholog features and at least 10 host cell gene features.
21. The method of embodiment 19, wherein the feature set used by the classification algorithm further includes: mechanism feature, a toxic burden feature (3) measures of activity of one or more host somatic cell genes.
22. The method of embodiment 19, wherein the features of Table 1 include one or more microbial taxa features and/or one or more gene ortholog features.
23. The method of embodiment 19, wherein the features of Table 1 include one or more positively associated features and/or one or more negatively associated features.
24. The method of embodiment 1, wherein the classification model uses only features selected from the features of Tables 2, 3 and 4.
25. The method of embodiment 1, wherein the classification model uses at least, exactly or no more than any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, or 270 of the features selected from the features of Tables 2, 3 or 4.
26. The method of embodiment 1, wherein the feature set used by the classification algorithm includes one or more features selected from a pro-inflammatory activity feature, a hydrogen sulfide production activity feature, a microbial contribution to cancer-specific energy metabolism feature, a protein fermentation as a tumor genic mechanism feature, tox burden feature, and microbial antibiotic resistance in tumorigenesis feature.
27. The method of embodiment 26, wherein the selected features are from Table 5.
28. The method of embodiment 1, wherein the feature set used by the classification algorithm includes one or more features selected from a geneset of any of
29. The method of embodiment 1, wherein the feature set used by the classification algorithm includes an activity of microbial taxon or one or more taxa of
30. The method of embodiment 1, wherein the feature set used by the classification algorithm includes an activity of one or more microbial gene orthologs of
31. The method of embodiment 1, wherein the cancer is oral squamous cell carcinoma (“OSCC”).
32. The method of embodiment 31, wherein the inference is likely presence of OSCC″ or “unlikely presence of OSCC.”
33. The method of embodiment 1, wherein the oral cancer is selected from squamous cell carcinoma, verrucous carcinoma, minor salivary gland carcinoma, lymphoma, benign oral cavity tumor and basal cell carcinoma.
34. The method of embodiment 1, wherein the classification model classifies presence or absence of oral cancer.
35. The method of embodiment 1, wherein the classification model classifies a stage of oral cancer (e.g., selected from stage 0, stage 1, stage 2, stage 3, stage 4).
36. The method of embodiment 1, wherein the classification model is selected to have a sensitivity of at least 90% and a selectivity of at least 90%.
37. The method of embodiment 1, further comprising:
e) outputting the inference to a user interface device or to computer-readable memory.
38. The method of embodiment 1, further comprising:
e) delivering and/or administering to the subject a therapeutic intervention effective to treat the oral cancer.
39. The method of embodiment 1, further comprising:
e) for a subject inferred to have oral cancer, performing a confirmatory diagnostic step selected from biopsy or imaging.
40. A method comprising:
a) providing biological samples from each of a first set of subjects and a second set of subjects, wherein the biological samples comprise an oral microbiome, and, optionally, somatic host cells, and wherein the first set of subjects have oral cancer present and the second set of subjects have oral cancer absent;
b) sequencing nucleic acids in the biological samples to provide sequence information; and
c) performing a statistical analysis on the sequence information to produce a model that infers a state of oral cancer in a subject based on sequence information.
41. The method of embodiment 40, wherein the statistical analysis comprises a model developed by machine learning.
42. The method of embodiment 40, wherein the statistical analysis comprises an analysis selected from correlational, Pearson correlation, Spearman correlation, chi-square, comparison of means (e.g., paired T-test, independent T-test, ANOVA) regression analysis (e.g., simple regression, multiple regression, linear regression, non-linear regression, logistic regression, polynomial regression. stepwise regression, ridge regression, lasso regression, elasticnet regression) and non-parametric analysis (e.g., Wilcoxon rank-sum test, Wilcoxon sign-rank test, sign test).
43. A method comprising:
a) administering to a subject inferred to have oral cancer by a method of embodiment 1, a therapeutic intervention effective to treat the oral cancer.
44. The method of embodiment 43, wherein the therapeutic intervention is selected from surgical removal of cancerous tissue; administration of a chemotherapeutic agent; and administration of a dietary supplement, a food ingredient, or a food that diminishes a dysbiosis in oral microbiome of the subject associated with the cancer.
45. The method of embodiment 43, wherein the therapeutic intervention comprises one or more of:
1) increasing the abundance of an under-represented taxon;
2) reducing the abundance of an over-represented taxon;
3) reducing the abundance of a microbial function;
4) increasing the abundance of a microbial function;
5) decreasing interactions between microorganisms or their molecules (metabolites, nucleic acids, proteins) and human tissue that support cancer onset or progression; and
6) enhancing the interactions between microorganisms or their molecules (metabolites, nucleic acids, proteins) and human tissue that inhibit cancer onset or progression.
46. A system comprising:
(a) a computer comprising: (i) a processor; and (II) a memory, coupled to the processor, the memory storing a module comprising:
(1) nucleic acid sequence information from a biological sample from a subject comprising an oral microbiome;
(2) a classification model which, based on values including the measurements, classifies the subject as having oral cancer present or absent, wherein the classification model is selected to have a sensitivity of at least 75%, at least 85% or at least 95%; and
(3) computer executable instructions for implementing the classification model on the test data.
47. A method for developing a computer model for inferring, from feature data, a state of oral cancer in a subject, the method comprising:
a) training a machine learning algorithm on a training data set, wherein the training data set comprises, for each of a plurality of subjects, (1) a class label classifying a subject as having or not having an oral cancer; and (2) feature data comprising quantitative measures for each of a plurality of features selected from oral microbiome transcriptome expression, and wherein the machine learning algorithm develops a model that infers a class label for a subject based on the feature data.
48. A method that infers a state of oral cancer in a subject, the method comprising:
(a) providing a data set comprising, for the subject, feature data for each of a plurality of features selected from oral microbiome transcriptome gene expression data and taxa activity data; and
(b) executing a computer model on the data set to infer the presence or absence of oral cancer in the subject.
49. A software product comprising a computer readable medium in tangible form comprising machine executable code, which, when executed by a computer processor, infers a state of oral cancer in a subject by:
(a) accessing a data set comprising, for a subject, feature data for each of a plurality of features selected from oral microbiome transcriptome gene expression data and taxa activity data; and
(b) executing a computer model on the data set to infer the state of oral cancer in the subject.
50. A method of treating oral cancer in a subject comprising:
(a) inferring the presence of oral cancer in a subject according to a method as described herein; and
(b) administering a therapeutic intervention to the subject effective to treat the oral cancer.
51. A method for diagnosing and treating an oral cancer in a subject, the method comprising:
(a) receiving from a subject a sample comprising an oral microbiome and, optionally, host somatic cells;
(b) determining nucleic acid sequences of a microorganism component of the sample;
(c) determining alignments of the nucleic acid sequence to reference nucleic acid sequences associated with the oral cancer;
(d) generating a microbiome feature dataset for the subject based upon the alignments;
(e) generating an inference of the oral cancer in the subject upon processing the microbiome feature dataset with an inference model derived from a population of subjects; and
(f) at an output device associated with the subject, providing a therapy to the subject with the oral cancer upon processing the inference with a therapy model designed to treat the oral cancer.
52. A method comprising:
(a) measuring, in a sample from a subject comprising an oral microbiome and, optionally, host somatic cells, activity of one or more biomarkers selected from Table 1, Table 2, Table 3 and/or Table 4;
(b) inferring, from the measurements, presence of oral cancer in the subject; and
(c) delivering to the subject a therapeutic intervention to treat the oral cancer.
53. The method of embodiment 52, wherein measuring comprises:
(i) optionally, amplifying microbial metatranscriptome sequences in the sample;
(ii) sequencing the microbial metatranscriptome from the sample to produce sequence reads;
(iii) searching reference sequences in a reference sequence catalog for matches with the sequence reads;
(iv) determining amounts of sequence reads matching references sequences in the catalog to produce a data set; and
(v) determining, from the data set, activity of each of the one or more biomarkers.
54. The method of embodiment 53, wherein determining activity comprises:
(1) for biomarkers that are taxa categories, performing a taxonomic analysis with a metagenomic classifier to measure taxa activity;
(2) for biomarkers that are gene orthologs, performing a functional analysis by determining activity of genes having the same function across taxa based on sequences corresponding to microbial open reading frames (ORFs), and combing the activities to produce gene ortholog activity.
55. The method of embodiment 52, wherein inferring comprises:
(i) executing by computer a classification model that infers presence or absence of oral cancer based on the biomarkers.
56. The method of embodiment 52, wherein measuring comprises:
(i) selectively amplifying in the sample nucleic acids specific for the biomarkers; and
(ii) determining amounts of the amplified nucleic acids.
57. A method comprising:
a) providing biological samples from each of a first set of subjects and a second set of subjects having an oral cancer and having been subject to a therapeutic intervention, wherein the biological samples comprise an oral microbiome, and, optionally, host somatic cells, and wherein the first set of subjects responded positively to the therapeutic intervention and the second set of subjects did not respond positively to the therapeutic intervention;
b) sequencing nucleic acids in the biological samples to provide sequence information; and
c) performing a statistical analysis on the sequence information to produce a model that infers subject oral cancer having a positive response or lack of positive response to the therapeutic intervention.
58. A method of treating a subject with oral cancer comprising:
(a) inferring that the subject will respond positively to each of one or more therapeutic interventions by executing a model on nucleic acid information from a biological sample from the subject comprising or oral microbiome and, optionally, host somatic cells; and
(b) administering to the subject one or more therapeutic interventions to treat the cancer.
59. A method comprising:
(a) identifying a subject inferred to have oral cancer by a method of embodiment 1; and
(b) performing imaging or biopsy to confirm the inference.
60. The method of embodiment 59, wherein the oral cancer is squamous cell carcinoma (“OSCC”).
As used herein, the following meanings apply unless otherwise specified. The word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). The words “include”, “including”, and “includes” and the like mean including, but not limited to. The singular forms “a,” “an,” and “the” include plural referents. Thus, for example, reference to “an element” includes a combination of two or more elements, notwithstanding use of other terms and phrases for one or more elements, such as “one or more.” The phrase “at least one” includes “one”, “one or more”, “one or a plurality” and “a plurality”. The term “or” is, unless indicated otherwise, non-exclusive, i.e., encompassing both “and” and “or.” The term “any of” between a modifier and a sequence means that the modifier modifies each member of the sequence. So, for example, the phrase “at least any of 1, 2 or 3” means “at least 1, at least 2 or at least 3”. The term “consisting essentially of” refers to the inclusion of recited elements and other elements that do not materially affect the basic and novel characteristics of a claimed combination.
It should be understood that the description and the drawings are not intended to limit the invention to the particular form disclosed, but to the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. Further modifications and alternative embodiments of various aspects of the invention will be apparent to those skilled in the art in view of this description. Accordingly, this description and the drawings are to be construed as illustrative only and are for the purpose of teaching those skilled in the art the general manner of carrying out the invention. It is to be understood that the forms of the invention shown and described herein are to be taken as examples of embodiments. Elements and materials may be substituted for those illustrated and described herein, parts and processes may be reversed or omitted, and certain features of the invention may be utilized independently, all as would be apparent to one skilled in the art after having the benefit of this description of the invention. Changes may be made in the elements described herein without departing from the spirit and scope of the invention as described in the following claims. Headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description.
All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.
This application claims the benefit of U.S. provisional patent application 63/001,236, filed Mar. 27, 2020, the contents of which are incorporated herein in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US21/24547 | 3/28/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63001236 | Mar 2020 | US |