The subject matter disclosed herein is generally directed to determining whether a subject suffering from inflammatory bowel disease (IBD) will respond to anti-TNF-blockade and treating the subject.
Inflammatory bowel diseases (IBDs) arise when homeostatic mechanisms regulating gastrointestinal (GI) tract tissue integrity, nutrient absorption, and protective immunity are replaced by pathogenic inflammation (Baumgart and Sandborn, 2012; Chang, 2020; Corridoni et al., 2020a; Friedrich et al., 2019; Graham and Xavier, 2020; Selin et al., 2021). The initiating triggers are not fully known, but host genetics and the microbiome are being increasingly appreciated to play important, and in some cases causal roles in the IBDs (Chang, 2020; Cohen et al., 2019; Franzosa et al., 2019; Jain et al., 2021; Limon et al., 2019). Though symptoms are widespread and overlapping, the endoscopic inflammation in the IBDs is often anatomically restricted: ulcerative colitis (UC) manifests primarily as an superficial inflammatory response restricted to the colon, while Crohn's disease (CD) presents predominantly in the terminal ileum and the proximal colon, though lesions may develop anywhere along the gastrointestinal tract (Baumgart and Sandborn, 2012; Chang, 2020; Kobayashi et al., 2020; Roda et al., 2020). Among the IBDs, pediatric-onset Crohn's disease (pediCD) is particularly common (25% of all IBD cases, 60-70% of pediatric IBD) and is a debilitating form due to its early presentation, impact on the terminal ileum and proximal colon, and the lack of disease-specific therapies developed with children in mind (Hyams et al., 1991; Ruemmele et al., 2014; Sykora et al., 2018; Turner et al., 2012; Ye et al., 2020). In contrast to pediCD, a group of pediatric disorders termed functional gastrointestinal disorders (FGIDs) include GI symptoms but lack laboratory markers, endoscopic findings, and histologic evidence associated with inflammation (Black et al., 2020; Hyams et al., 2016; McOmber and Shulman, 2008; Santucci et al., 2020). FGID thus represents a critical non-inflamed control cohort with which to contextualize the inflammation observed in pediCD.
The current standard of care for pediCD (as with adult CD) is tailored to the patient's disease location, clinical behavior and severity, though use of prednisone, immunomodulators, as well as biologics including anti-TNF-α monoclonal antibodies, are common (Hyams et al., 1991; Ruemmele et al., 2014; Turner et al., 2012). While targeting TNF is common across many autoimmune and inflammatory conditions, it is not successful in all patients, and many go on to develop anti-TNF-refractory disease. It is of tremendous importance for the field to precisely understand and characterize for which patients anti-TNF therapy is not necessary, in which patients it may succeed in controlling disease, and which patients will be refractory to treatment. Several ideas based on individual immunogenicity and pharmacokinetics have been proposed to explain TNF-refractory disease, including gender (M>F), low albumin levels, high BMI, and high baseline C-Reactive Protein (CRP) (Atreya et al., 2020; Digby-Bell et al., 2020). However, no single identifiable clinical or biochemical biomarker reliably predicts disease response versus resistance to anti-TNF antibodies suggesting a more complex etiology (Stevens et al., 2018). It would be of tremendous interest for the field to precisely understand in which patients anti-TNF therapy is not necessary (not on anti-TNF: NOA), may succeed in controlling disease (full-responders: FRs), and which patients will either immediately or progressively gain resistance to treatment (partial-responders: PRs).
The primary cellular lineages sampled from intestinal biopsies of CD patients (from the terminal ileum or colon) represent both the epithelium and lamina propria, and include epithelial cells, stromal cells, hematopoietic cells and neuronal processes whose cell bodies are present outside of these regions (Buisine et al., 2001; Leeb et al., 2003; Leonard et al., 1995; Lilja et al., 2000; Müller et al., 1998; Souza et al., 1999; Stappenbeck and McGovern, 2017; Takayama et al., 2010). Alterations to all cellular lineages have been implicated in CD (Furey et al., 2019; Martin et al., 2019). Histologically, CD is characterized by a granulomatous inflammation, and noted alterations in almost every leukocyte cell type studied including an increase of cytotoxic lymphocytes, potential but equivocal alterations in γδ T cells, increases in mast cells and their production of TNF, activation and shifts in antibody isotypes towards IgM and IgG from B cells and plasma cells, and cytokine production by macrophages (Catalan-Serra et al., 2017; Lilja et al., 2000; Meijer et al., 1979; Mitsialis et al., 2020; Müller et al., 1998; Sieber et al., 1984; Takayama et al., 2010). In the stromal compartment, there is evidence for enhanced vascularization and increased expression of ICAM-1 and MAdCAM-1 by vascular beds, substantial remodeling of collecting lymphatics, and altered migratory potential of fibroblasts (Leeb et al., 2003; Souza et al., 1999). Epithelial barrier dysfunction has also been noted, including alterations to mucus production, microvilli, and Paneth cell dysfunction (Buisine et al., 2001; Stappenbeck and McGovern, 2017). Collectively previous studies have identified that all cell types may be meaningfully altered during CD, and highlight the important need to comprehensively understand the concerted cellular changes that define CD. Importantly, which changes are predictive of more severe disease or treatment resistance in pediCD remain largely unknown.
Massively parallel single-cell RNA-sequencing (scRNA-seq) is enhancing our ability to comprehensively map and resolve the cell types, subsets, and states present during health and disease. This has been particularly evident in the elucidation of novel human cell subsets and states within epithelial, stromal, immune, and neuronal cell lineages. Recent work has generated cellular atlases of previously treated (pre-treated) adult CD and UC, though a comprehensive single-cell atlas for untreated pediatric disease with follow-up of patient outcomes has yet to be reported for IBD (Corridoni et al., 2020a, 2020b; Drokhlyansky et al., 2019; Elmentaite et al., 2020; Huang et al., 2019; Kinchen et al., 2018; Martin et al., 2019; Parikh et al., 2019; Smillie et al., 2019). The potential impact of scRNA-seq on our understanding of IBD is evidenced by studies of adult UC, which have identified potential functional roles for poorly understood colonic cell subsets, such as the BEST4+ enterocyte, and identified pathological alterations in UC pinch biopsies compared to healthy controls, including an expansion of microfold-like cells, IL13RA2+IL11+ inflammatory fibroblasts, CD4+CD8+IL17A+ T cells and CD8+GZMK+ T cells (Smillie et al., 2019). A single-cell study of pre-treated adult CD patients comparing non-inflamed and inflamed tissue from surgically-resected bowel found IgG+ plasma cells, inflammatory mononuclear phagocytes, activated T cells and stromal cells comprising the “GIMATS” pathogenic cellular module (Martin et al., 2019). This module was used to derive a gene signature associated with resistance to anti-TNF therapy in a distinct cohort profiled by bulk RNA-seq. Very recent work has profiled how fetal transcription factors are reactivated in Crohn's disease epithelium (Elmentaite et al., 2020). Together, these and other studies have demonstrated the power of scRNA-seq to nominate individual and collective cell states that are associated with disease, and have also underscored the unmet need to apply these techniques to untreated disease and associate them with disease severity in order to more specifically identify pathognomonic and prognostic cell states.
Citation or identification of any document in this application is not an admission that such a document is available as prior art to the present invention.
In one aspect, the present invention provides for a method of treating a subject suffering from inflammatory bowel disease (IBD) comprising: determining whether the subject belongs to a risk group selected from: (i) well controlled without anti-TNF-blockade (NOA), (ii) anti-TNF-blockade full responder (FR), and (iii) anti-TNF-blockade partial responder (PR) by: detecting in a sample obtained from the subject at diagnosis or before treatment the frequency of one or more T cell/Natural Killer/Innate lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets selected from Table 1, determining the risk group of the subject by comparing the frequency of the detected cell subsets to a control frequency for the subsets along a trajectory of disease severity from NOA to FR to PR; and if the subject is in the NOA group, then treating the subject with a treatment that does not comprise anti-TNF-blockade; if the subject is in the FR group, then treating the subject with a treatment comprising anti-TNF-blockade; if the subject is in the PR group, then treating the subject with a treatment comprising anti-TNF-blockade and/or an additional treatment. In certain embodiments, the cell subsets are selected from the group consisting of: CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, CD.NK.GNLY.FCER1G, CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, CD.Mono.FCN1.S100A4, CD.Endth/Ven.LAMP3.LIPG, CD.Goblet.TFF1.TPSG1, CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, CD.T.IFI6.IRF7, CD.cDC2.CLEC10A.FCGR2B, CD.Fibro.IFI6.IFI44L, CD.Tuft.GNAT3.TRPM5, CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15, wherein the frequency of the CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, CD.NK.GNLY.FCER1G, CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, CD.Mono.FCN1.S100A4, CD.Endth/Ven.LAMP3.LIPG, and CD.Goblet.TFF1.TPSG1 subsets is increased in PR subjects as compared to NOA subjects, and wherein the frequency of the CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, and CD.T.IFI6.IRF7, CD.cDC2.CLEC10A.FCGR2B, CD.Fibro.IFI6.IFI44L, CD.Tuft.GNAT3.TRPM5, CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15 subsets is decreased in PR subjects as compared to NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: CD.NK.MKI67.GZMA, CD.T.MKI67.IL22, CD.Fibro.CCL19.IRF7 and CD.EC.SLC28A2.GSTA2, wherein the frequency of the CD.NK.MKI67.GZMA and CD.T.MKI67.IL22 subsets is increased in FR and PR subjects as compared to NOA subjects, and wherein the frequency of the CD.Fibro.CCL19.IRF7 and CD.EC.SLC28A2.GSTA2 subsets is decreased in FR and PR subjects as compared to NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: cDC2.CD1C.AREG, T.MAF.CTLA4, T.CCL20.RORA, Goblet.RETNLB.ITLN1, Mac.C1QB.CD14, Mono.CXCL3.FCN1, pDC.IRF7.IL3RA, Mac.CXCL3.APOC1, EC.NUPR1.LCN2, T.GNLY.CSF2, Mono.Mac.CXCL10.FCN1, T.MKI67.FOXP3, T.MKI67.IFNG, Mac.DC.CXCL10.CLEC4E, NK.GNLY.FCER1G, T.MKI67.IL22, NK.GNLY.IFNG, EC.OLFM4.MT.ND2, NK.GNLY.GZMB, Mono.Mac.CXCL10.CXCL11, Mono.FCN1.S100A4, T.CARD16.GB2, Mono.CXCL10.TNF, and NK.MKI67.GZMA, wherein the frequency of at least one subset from each of the T/NK/ILLC, myeloid and epithelial cell states subsets is increased in PR subjects as compared to FR and NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF, wherein the frequency of the CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF subsets is decreased in FR subjects as compared to NOA subjects. In certain embodiments, the cell subset is the CD.B/DZ.HIST1H1B.MKI67 subset, wherein the frequency of the CD.B/DZ.HIST1H1B.MKI67 subset is increased in PR subjects as compared to FR subjects. In certain embodiments, the anti-TNF-blockade is a monoclonal antibody.
In another aspect, the present invention provides for a method of treating a subject suffering from inflammatory bowel disease (IBD) comprising: detecting in a sample obtained from the subject at diagnosis or before treatment the expression of one or more genes selected from Table 2; determining whether the subject is in the FR or PR risk group by comparing to a control level in FR and/or PR subjects; and if the subject is in the FR group, then treating the subject with a treatment comprising anti-TNF-blockade; if the subject is in the PR group, then treating the subject with a treatment comprising anti-TNF-blockade and/or an additional treatment. In certain embodiments, the one or more genes are detected in one or more cell subsets selected from the group consisting of CD.NK.CCL3.CD160, CD.Fibro.TFPI2.CCL13, CD.Paneth.DEFA6.ITLN2 and CD.Mac.APOE.PTGDS, wherein the one or more cell subsets are detected according to one or more genes in Table 1. In certain embodiments, the one or more genes are selected from the group consisting of IFITM1, APOA1, TPT1, FABP6, NACA, APOA4, MIF, HOPX, SPINK4, CMC1, TNFRSF11B, BRI3, COL1A2, NKG7, APOE, TFPI2, AREG, KLRC1, HTRA3, COL1A1, HIF1A, STAT1, SLC16A4, SERPINE2, CCL11, SAMHD1, TAX1BP1, TXN, GPR65, CEBPB, GSN, EMILIN1, CTNNB1, COL4A1, CLEC12A, PTGER4, BDKRB1, SKIL, and PFN1, wherein APOA1, FABP6, NACA, APOA4, TPT1, SPINK4, MIF, IFITM1, and HOPX are increased in FR relative to PR, and wherein TNFRSF11B, TFPI2, SERPINE2, GSN, COL1A1, HIF1A, COL1A2, CTNNB1, CCL11, EMILIN1, CEBPB, SLC16A4, HTRA3, CMC1, AREG, COL4A1, SKIL, KLRC1, PTGER4, BRI3, APOE, BDKRB1, TXN, GPR65, NKG7, SAMIID1, CLEC12A, STAT1, PFN1, and TAX1BP1 are increased in PR relative to FR. In certain embodiments, the anti-TNF-blockade is a monoclonal antibody.
In another aspect, the present invention provides for a method of treating a subject suffering from inflammatory bowel disease (IBD) comprising: detecting in a sample obtained from the subject at diagnosis or before treatment the expression of one or more genes selected from the group consisting of TNFAIP6, GZMB, S100A8, CSF2, CLEC4E, S100A9, IL1RN, FCGR1A, CLIC3, CD14, PLA2G7, FAM26F, IL3RA, NKG7, IL32, CCL3, OLR1, LILRA4, APOC1, and MYBL2; or Table 14; and if the subject has decreased expression of the one or more genes compared to a control, then treating the subject with a treatment comprising anti-TNF-blockade; if the subject has increased expression of the one or more genes compared to a control, then treating the subject with a treatment comprising anti-TNF-blockade and/or an additional treatment. In certain embodiments, the anti-TNF-blockade is a monoclonal antibody.
In another aspect, the present invention provides for a method of stratifying subjects suffering from IBD into a risk group comprising detecting in a sample obtained from a subject at diagnosis or before treatment the frequency of one or more T cell/Natural Killer/Innate lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets selected from Table 1, and determining if the subject is in a well-controlled without anti-TNF-blockade (NOA) risk group, an anti-TNF-blockade full responder (FR) risk group, or anti-TNF-blockade partial responder (PR) risk group of the subject by comparing the frequency of the detected cell subsets to a control frequency for the subsets along a trajectory of disease severity from NOA to FR to PR. In certain embodiments, the cell subsets are selected from the group consisting of: CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, CD.NK.GNLY.FCER1G, CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, CD.Mono.FCN1.S100A4, CD.Endth/Ven.LAMP3.LIPG, CD.Goblet.TFF1.TPSG1, CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, CD.T.IFI6.IRF7, CD.cDC2.CLEC10A.FCGR2B, CD.Fibro.IFI6.IFI44L, CD.Tuft.GNAT3.TRPM5, CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15, wherein the frequency of the CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, CD.NK.GNLY.FCER1G, CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, CD.Mono.FCN1.S100A4, CD.Endth/Ven.LAMP3.LIPG, and CD.Goblet.TFF1.TPSG1 subsets is increased in PR subjects as compared to NOA subjects, and wherein the frequency of the CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, and CD.T.IFI6.IRF7, CD.cDC2.CLEC10A.FCGR2B, CD.Fibro.IFI6.IFI44L, CD.Tuft.GNAT3.TRPM5, CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15 subsets is decreased in PR subjects as compared to NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: CD.NK.MKI67.GZMA, CD.T.MKI67.IL22, CD.Fibro.CCL19.IRF7 and CD.EC.SLC28A2.GSTA2, wherein the frequency of the CD.NK.MKI67.GZMA and CD.T.MKI67.IL22 subsets is increased in FR and PR subjects as compared to NOA subjects, and wherein the frequency of the CD.Fibro.CCL19.IRF7 and CD.EC.SLC28A2.GSTA2 subsets is decreased in FR and PR subjects as compared to NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: cDC2.CD1C.AREG, T.MAF.CTLA4, T.CCL20.RORA, Goblet.RETNLB.ITLN1, Mac.C1QB.CD14, Mono.CXCL3.FCN1, pDC.IRF7.IL3RA, Mac.CXCL3.APOC1, EC.NUPR1.LCN2, T.GNLY.CSF2, Mono.Mac.CXCL10.FCN1, T.MKI67.FOXP3, T.MKI67.IFNG, Mac.DC.CXCL10.CLEC4E, NK.GNLY.FCER1G, T.MKI67.IL22, NK.GNLY.IFNG, EC.OLFM4.MT.ND2, NK.GNLY.GZMB, Mono.Mac.CXCL10.CXCL11, Mono.FCN1.S100A4, T.CARD16.GB2, Mono.CXCL10.TNF, and NK.MKI67.GZMA, wherein the frequency of at least one subset from each of the T/NK/ILC, myeloid and epithelial cell states subsets is increased in PR subjects as compared to FR and NOA subjects. In certain embodiments, the cell subsets are selected from the group consisting of: CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF, wherein the frequency of the CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF subsets is decreased in FR subjects as compared to NOA subjects. In certain embodiments, the cell subset is the CD.B/DZ.HIST1H1B.MKI67 subset, wherein the frequency of the CD.B/DZ.HIST1H1B.MKI67 subset is increased in PR subjects as compared to FR subjects. In certain embodiments, the IBD is Crohn's Disease (CD).
In another aspect, the present invention provides for a method of stratifying subjects suffering from IBD into a risk group comprising: detecting in a sample obtained from a subject at diagnosis or before treatment the expression of one or more genes selected from the group consisting of TNFAIP6, GZMB, S100A8, CSF2, CLEC4E, S100A9, IL1RN, FCGR1A, CLIC3, CD14, PLA2G7, FAM26F, IL3RA, NKG7, IL32, CCL3, OLR1, LILRA4, APOC1, and MYBL2; or Table 14, and determining if the subject is in a well-controlled without anti-TNF-blockade (NOA) risk group, an anti-TNF-blockade full responder (FR) risk group, or anti-TNF-blockade partial responder (PR) risk group by comparing the expression of the one or more genes to a control expression for the subsets along a trajectory of disease severity from NOA to FR to PR. In certain embodiments, the IBD is Crohn's Disease (CD).
In certain embodiments, the cell states or genes are detected by RNA-seq, immunohistochemistry (IHC), fluorescently bar-coded oligonucleotide probes, RNA FISH, FACS, or any combination thereof. In certain embodiments, the cell states are inferred from bulk RNA-seq. In certain embodiments, the cell states are determined by single cell RNA-seq. In certain embodiments, the sample is obtained by biopsy. In certain embodiments, the subject is younger than 35, 25, 20, or 18 years old. In certain embodiments, when the frequency of a cell state increases, the frequency of a cell state in the parent cells for the control subject is less than 0, 5, 10, or 50 percent of the parent cell. In certain embodiments, when the frequency of a cell state decreases, the frequency of a cell state in the parent cells for the control subject is greater than 0, 5, 10, or 50 percent of the parent cell.
In certain embodiments, the CD.NK.MKI67.GZMA cell state is detected by detecting one or more genes selected from the group consisting of GNLY, CCL3, KLRD1, IL2RB and EOMES. In certain embodiments, the CD.T.MKI67.IL22 cell state is detected by detecting one or more genes selected from the group consisting of IFNG, CCL20, IL22, IL26, CD40LG and ITGAE. In certain embodiments, the CD.Fibro.CCL9.IRF7 cell state is detected by detecting one or more genes selected from the group consisting of CCL19, CCL11, CXCL1, CCL2, OAS1 and IRF7. In certain embodiments, the CD.EC.SLC28A2.GSTA2 cell state is detected by detecting one or more genes selected from the group consisting of SLC28A2 and GSTA2. In certain embodiments, the CD.T.MKI67.IFNG cell state is detected by detecting one or more genes selected from the group consisting of IFNG, GNLY, HOPX, ITGAE and IL26. In certain embodiments, the CD.T.MKI67.FOXP3cell state is detected by detecting one or more genes selected from the group consisting of IL2RA, BATF, CTLA4, TNFRSF1B, CXCR3, and FOXP3. In certain embodiments, the CD.T.GNLY.CSF2 cell state is detected by detecting one or more genes selected from the group consisting of GNLY, GZMB, GZMA, PRF1, IFNG, CXCR6, and CSF2. In certain embodiments, the CD.NK.GNLY.FCER1G cell state is detected by detecting one or more genes selected from the group consisting of GNLY, GZMB, GZMA, PRF1, AREG, TYROBP, and KLRF1. In certain embodiments, the CD.Mac.CXCL3.APOC1 cell state is detected by detecting one or more genes selected from the group consisting of CCL3, CCL4, CXCL3, CXCL2, CXCL1, CCL20, CCL8, TNF and IL1B. In certain embodiments, the CD.Mono/Mac.CXCL10.FCN1 cell state is detected by detecting one or more genes selected from the group consisting of CXCL9, CXCL10, CXCL11, GBP1, GBP2, GBP4, GBP5, and Type II IFN-gamma. In certain embodiments, the CD.Mono.FCN1.S100A4 cell state is detected by detecting one or more genes selected from the group consisting of S100A4, S100A6, and FCN1.
These and other aspects, objects, features, and advantages of the example embodiments will become apparent to those having ordinary skill in the art upon consideration of the following detailed description of example embodiments.
An understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention may be utilized, and the accompanying drawings of which:
The figures herein are for illustrative purposes only and are not necessarily drawn to scale.
Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Definitions of common terms and techniques in molecular biology may be found in Molecular Cloning: A Laboratory Manual, 2nd edition (1989) (Sambrook, Fritsch, and Maniatis); Molecular Cloning: A Laboratory Manual, 4th edition (2012) (Green and Sambrook); Current Protocols in Molecular Biology (1987) (F. M. Ausubel et al. eds.); the series Methods in Enzymology (Academic Press, Inc.): PCR 2: A Practical Approach (1995) (M. J. MacPherson, B. D. Hames, and G. R. Taylor eds.): Antibodies, A Laboratory Manual (1988) (Harlow and Lane, eds.): Antibodies A Laboratory Manual, 2nd edition 2013 (E. A. Greenfield ed.); Animal Cell Culture (1987) (R. I. Freshney, ed.); Benjamin Lewin, Genes IX, published by Jones and Bartlet, 2008 (ISBN 0763752223); Kendrew et al. (eds.), The Encyclopedia of Molecular Biology, published by Blackwell Science Ltd., 1994 (ISBN 0632021829); Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by VCH Publishers, Inc., 1995 (ISBN 9780471185710); Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, N.Y. 1994), March, Advanced Organic Chemistry Reactions, Mechanisms and Structure 4th ed., John Wiley & Sons (New York, N.Y. 1992); and Marten H. Hofker and Jan van Deursen, Transgenic Mouse Methods and Protocols, 2nd edition (2011).
As used herein, the singular forms “a” “an”, and “the” include both singular and plural referents unless the context clearly dictates otherwise.
The term “optional” or “optionally” means that the subsequent described event, circumstance or substituent may or may not occur, and that the description includes instances where the event or circumstance occurs and instances where it does not.
The recitation of numerical ranges by endpoints includes all numbers and fractions subsumed within the respective ranges, as well as the recited endpoints.
The terms “about” or “approximately” as used herein when referring to a measurable value such as a parameter, an amount, a temporal duration, and the like, are meant to encompass variations of and from the specified value, such as variations of +/−10% or less, +/−5% or less, +/−1% or less, and +/−0.1% or less of and from the specified value, insofar such variations are appropriate to perform in the disclosed invention. It is to be understood that the value to which the modifier “about” or “approximately” refers is itself also specifically, and preferably, disclosed.
As used herein, a “biological sample” may contain whole cells and/or live cells and/or cell debris. The biological sample may contain (or be derived from) a “bodily fluid”. The present invention encompasses embodiments wherein the bodily fluid is selected from amniotic fluid, aqueous humour, vitreous humour, bile, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), chyle, chyme, endolymph, perilymph, exudates, feces, female ejaculate, gastric acid, gastric juice, lymph, mucus (including nasal drainage and phlegm), pericardial fluid, peritoneal fluid, pleural fluid, pus, rheum, saliva, sebum (skin oil), semen, sputum, synovial fluid, sweat, tears, urine, vaginal secretion, vomit and mixtures of one or more thereof. Biological samples include cell cultures, bodily fluids, cell cultures from bodily fluids. Bodily fluids may be obtained from a mammal organism, for example by puncture, or other collecting or sampling procedures.
The terms “subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed.
Various embodiments are described hereinafter. It should be noted that the specific embodiments are not intended as an exhaustive description or as a limitation to the broader aspects discussed herein. One aspect described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced with any other embodiment(s). Reference throughout this specification to “one embodiment”, “an embodiment,” “an example embodiment,” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” or “an example embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention. For example, in the appended claims, any of the claimed embodiments can be used in any combination.
All publications, published patent documents, and patent applications cited herein are hereby incorporated by reference to the same extent as though each individual publication, published patent document, or patent application was specifically and individually indicated as being incorporated by reference.
Embodiments disclosed herein provide methods of treating IBD based on detection of specific cell types, subsets, and states in the subject that indicate whether the subject will respond to anti-TNF-blockade. Single-cell approaches are transforming our ability to understand the barrier tissue biology of inflammatory diseases. Crohn's disease is an inflammatory bowel disease (IBD) which most often presents with patchy lesions in the terminal ileum and proximal colon and requires complex clinical care. Recent advances in the targeting of cytokines and leukocyte migration have greatly advanced treatment options, but most patients still relapse and inevitably progress. As comprehensive single-cell RNA-sequencing (scRNA-seq) atlases of IBD to date have been conflated by sampling treated patients with established disease, there is a lack of a rigorous understanding of which cell types, subsets, and states at diagnosis are predictive of disease severity and response to treatment. Here, through the combined clinical, flow cytometric, and single-cell RNA-sequencing study, Applicants profile primary human biopsies from the terminal ileum of treatment-naïve pediatric patients with non-inflammatory functional gastrointestinal disorder (FGID; n=13) or Crohn's disease (pediCD; n=14). Applicants report transcriptomes of 201,883 cells which enabled for deploying a principled and unbiased tiered clustering approach, ARBOL, to fully resolve and annotate epithelial, stromal, and immune cell states, yielding 138 FGID and 305 pediCD end cell clusters. Thus, Applicants have generated a single cell pediatric Crohn's disease (pediCD) and FGID atlas. Notably, through both flow cytometry and scRNA-seq, Applicants observe that at the level of broad cell types, treatment-naïve Crohn's disease (pediCD) does not significantly vary from FGID in cellular composition. However, by using the high-resolution scRNA-seq analysis, Applicants identified significant differences in cell states that arise during Crohn's disease relative to FGID. Furthermore, by closely linking the scRNA-seq analysis with clinical meta-data, Applicants resolved a vector of T/NK/ILC (lymphoid), myeloid, and epithelial cell states in treatment-naïve samples which can distinguish patients with less severe disease (those not on anti-TNF therapies (NOA)), from those with more severe disease at presentation who require anti-TNF therapies. Moreover, this vector was also able to distinguish those patients that achieve a full response (FR) to anti-TNF blockade from those more treatment-resistant patients who only achieve a partial response (PR). Applicants find significant changes in cell states across all cell types in PRs relative to NOAs and FRs, highlighting cytotoxic lymphocytes (NK.MKI67.GZMA, NK.GNLY.FCER1G), substantial remodeling of the myeloid compartment (Mono.FCN1.S100A4, Mono/Mac.CXCL10.FCN1, Mac.CXCL3.APOC1) and shifts in epithelial cell phenotypes (Goblet.RETNLB.ITLN1, EC.NUPR1.LCN2) associated with increased disease severity and anti-TNF treatment non-response. Cell subsets described further herein are defined by the specific cell states identified and the terms can be used interchangeably. This study jointly leverages a treatment-naïve cohort, high-resolution principled scRNA-seq data analysis, and clinical outcomes to understand which baseline cell states can be used to predict inflammatory disease trajectory. Thus, the present invention advantageously provides for predicting patient response in IBD. Applicants provide a first treatment naïve atlas from any inflammatory disease. Applicants identify cell states specific in severe ileal Crohn's. Baseline cell states are disclosed that can predict treatment response and non-response in IBD. Applicants provide for novel analysis methods.
As used herein, the terms “NOA” or “Not On Anti-TNE” refers to a subject having biopsy-proven pediCD, but for whom clinical symptoms were sufficiently mild that the treating physician did not prescribe anti-TNF agents. NOA can also refer to subjects in which anti-TNF therapy is not necessary. As used herein, the terms “FR” and “full responder” refers to a subject having pediCD and treated with anti-TNF agents who achieved a full response (FR). FR can also refer to subjects in which anti-TNF therapy may succeed in controlling disease. As used herein, the terms “PR” and “partial responder” refers to a subject having pediCD and treated with anti-TNF agents who achieved a partial response (PR). PR can also refer to subjects in which subjects will either immediately or progressively gain resistance to anti-TNF therapy. PR can also refer to subjects that will not succeed in controlling disease. As used herein, “controlling disease” refers to clinical symptom control and biochemical response (measuring CRP, ESR, albumin, and complete blood counts (CBC)), and with a weighted Pediatric Crohn's Disease Activity Index (PCDAI) score of <12.5 on maintenance anti-TNF therapy with no dose adjustments required (Cappello and Morreale, 2016; Hyams et al., 1991; Sandborn, 2014; Turner et al., 2012, 2017). FR can be defined as clinical symptom control and biochemical response. PR to anti-TNF therapy can be defined as a lack of full clinical symptom control as determined by the treating physician or lack of full biochemical response, with documented escalation of anti-TNF therapy or addition of other agents.
In certain embodiments, shifts in cell types or subsets of a cell type are used to predict a disease state and for selecting a treatment. In certain embodiments, shifts in cell states in cell types or subsets of a cell type and are used to predict a disease state and for selecting a treatment. As used herein, cell state refers to the differential expression of genes in specific cell subsets. As used herein, gene expression is not limited to mRNA expression and may also include protein expression. In certain embodiments, the cell subset frequency and/or cell states can be detected for screening novel therapeutics. The present invention, provides for subsets of cell types in CD and FGID. In certain embodiments, the frequency of the cell subsets are shifted in disease states. Disease states may include disease severity or response to any treatment in the standard of care for the disease. In certain embodiments, the disease is an inflammatory disease. In certain embodiments, the inflammatory disease is a disease of a barrier tissue. As used herein a “barrier cell” or “barrier tissues” refers generally to various epithelial tissues of the body such, but not limited to, those that line the respiratory system, digestive system, urinary system, and reproductive system as well as cutaneous systems. The epithelial barrier may vary in composition between tissues but is composed of basal and apical components, or crypt/villus components in the case of intestine.
In certain embodiments, disease states or conditions are treated, monitored or detected. In certain embodiments, diseases relevant to the present invention are inflammatory diseases of a barrier tissue. In certain embodiments, the cell subset composition or frequency and cell states are shifted in any such inflammatory disease. In certain embodiments, detection of specific cell subsets and/or cell states indicates whether the disease can be treated with anti-TNF blockade. Exemplary diseases include, but are not limited to inflammatory bowel disease (IBD) including Crohn's disease (CD) and ulcerative colitis (UC), asthma, allergy, allergic rhinitis, allergic airway inflammation, atopic dermatitis (AD), chronic obstructive pulmonary disease (COPD), Irritable bowel syndrome (IBS), arthritis, psoriasis, eosinophilic esophagitis, eosinophilic pneumonia, eosinophilic psoriasis, hypereosinophilic syndrome, and Eosinophilic Granulomatosis with Polyangiitis (Churg-Strauss Syndrome).
In certain embodiments, the methods of the present invention use control values for the frequency of subsets and cell states. In example embodiments, the control values can be determined for control samples that represent different states of severity along a trajectory from least severe to most severe (e.g., NOA to FR to PR). As used herein, “cell subset” refers to cells that belong to a specific cell type, such as T cells, goblet cells, dendritic cells, but can be distinguished among the specific cell type by a specific cell state or expression of specific genes. For example, subsets of T cells can include proliferating T cells, subsets of NK cells can include cytotoxic NK cells, subsets of monocytes/macrophages can include specific monocytes/macrophages, subsets of dendritic cells can include plasmacytoid dendritic cells (pDCs) and subsets of epithelial cells can include metabolically-specialized epithelial cell subsets. For example, the present cell atlases provide for the frequency of cell subsets and cell states for each of NOA, FR and PR, but control values can also be determined using additional annotated samples. As used herein the frequency of cell subsets (i.e., comparison of the number of cells) may be determined by the frequency of a subset amongst total cells or the frequency of a subset amongst its own cell type (e.g., T cell/Natural Killer/Innate lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets; or individual cell types within T cell/Natural Killer/Innate lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets). Applicants determined that the composition of cell types does not significantly differ between CD and FG samples. Thus, a change in frequency of a subset of the cell types in a sample can be detected by comparing the number of cells of a subset to the total of all cells or the total of all cells of the cell type. In example embodiments, the frequency of a subset of a specific cell type is compared to the total of the specific cell type. The determined frequency can then be compared to control values to determine risk for severity and treatment groups.
Cells such as disclosed herein may in the context of the present specification be said to “comprise the expression” or conversely to “not express” one or more markers, such as one or more genes or gene products; or be described as “positive” or conversely as “negative” for one or more markers, such as one or more genes or gene products; or be said to “comprise” a defined “gene or gene product signature”. Such terms are commonplace and well-understood by the skilled person when characterizing cell phenotypes. By means of additional guidance, when a cell is said to be positive for or to express or comprise expression of a given marker, such as a given gene or gene product, a skilled person would conclude the presence or evidence of a distinct signal for the marker when carrying out a measurement capable of detecting or quantifying the marker in or on the cell. Suitably, the presence or evidence of the distinct signal for the marker would be concluded based on a comparison of the measurement result obtained for the cell to a result of the same measurement carried out for a negative control (for example, a cell known to not express the marker) and/or a positive control (for example, a cell known to express the marker). Where the measurement method allows for a quantitative assessment of the marker, a positive cell may generate a signal for the marker that is at least 1.5-fold higher than a signal generated for the marker by a negative control cell or than an average signal generated for the marker by a population of negative control cells, e.g., at least 2-fold, at least 4-fold, at least 10-fold, at least 20-fold, at least 30-fold, at least 40-fold, at least 50-fold higher or even higher. Further, a positive cell may generate a signal for the marker that is 3.0 or more standard deviations, e.g., 3.5 or more, 4.0 or more, 4.5 or more, or 5.0 or more standard deviations, higher than an average signal generated for the marker by a population of negative control cells. In regards to frequency, a cell subset may be present or not present. In certain embodiments, a cell subset may be 5, 10, 20, 30, 40, 50, 60, 70, 80 or 90% more frequent in a parent cell population as compared to a control level.
A Method of Stratifying Subjects Suffering from IBD
In one example embodiment, a method for stratifying subjects suffering from IBD into risk groups comprises detecting in a sample obtained from a subject the frequency of one or more T cell/Natural Killer/Innate Lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets selected from Table 1, and determining if the subject is in a well-controlled without anti-TNF-blockade (NOA) risk group, an anti-TNF-blockade full responder (FR) risk group, or anti-TNF-blockade partial responder (PR) risk group by comparing the frequency of the detected cell subsets to a control frequency for the subject along a trajectory of disease severity from NOA, to FR, to PR. Table 10 provides for frequencies of each subset in each pediCD patient.
Table 1 provides for cell subset specific gene markers in the pediCD atlas. Table 1A provides for subset specific markers for all subsets identified in the pediCD atlas. The genes with the lowest adjusted p value are shown first for each subset. As discussed herein, when the adj. p-value=0, expression within-cluster >40% of cells are positive for the gene and in other cells <6% of cells are positive for the gene. In one example embodiment, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 or more genes are detected. In another example embodiment, detecting 2 or more of the subset markers increases the probability of detecting a cell subset. Table 1B provides for subset specific markers with a higher adjusted p value cutoff for subsets that are shifted in frequency between NOA, FR and PR.
In one example embodiment, the cell subsets have higher expression of one or more principle components (PC) determined using dimension reduction (see, e.g., Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236-240, doi:10.1038/naturel2172 (2013)). Cell subsets can be identified as clusters of cells using any dimension reduction method (see, e.g., Becht et al., Evaluation of UMAP as an alternative to t-SNE for single-cell data, bioRxiv 298430; doi.org/10.1101/298430; Becht et al., 2019, Dimensionality reduction for visualizing single-cell data using UMAP, Nature Biotechnology volume 37, pages 38-44; and Moon et al., PHATE: A Dimensionality Reduction Method for Visualizing Trajectory Structures in High-Dimensional Biological Data, bioRxiv 120378; doi: doi.org/10.1101/120378). Cell subsets or cell states can also be referred to by a cluster name. Table 3 shows PC loadings for the cell subsets in the pediCD atlas. Table 11 shows PCA Loadings for the joint Epithelial, Myeloid, T/NK/ILC vectors. In one example embodiment, cell subsets that are the top negative loadings of PC2 are most predictive of NOA, FR and PR. In certain embodiments, top cell subsets for the negative loadings of PC2 include one or more of cDC2.CD1C.AREG, T.MAF.CTLA4, T.CCL20.RORA, Goblet.RETNLB.ITLN1, Mac.C1QB.CD14, Mono.CXCL3.FCN1, pDC.IRF7.IL3RA, Mac.CXCL3.APOC1, EC.NUPR1.LCN2, T.GNLY.CSF2, Mono.Mac.CXCL10.FCN1, T.MKI67.FOXP3, T.MKI67.IFNG, Mac.DC.CXCL10.CLEC4E, NK.GNLY.FCER1G, T.MKI67.IL22, NK.GNLY.IFNG, EC.OLFM4.MT.ND2, NK.GNLY.GZMB, Mono.Mac.CXCL10.CXCL11, Mono.FCN1.S100A4, T.CARD16.GB2, Mono.CXCL10.TNF, and NK.MKI67.GZMA. In one example embodiment, marker genes are detected for the top negative loadings for PC2. In certain embodiments, the subsets detected include one or more of cDC2.CD1C.AREG, T.MAF.CTLA4, T.CCL20.RORA, Goblet.RETNLB.ITLN1, Mac.C1QB.CD14, Mono.CXCL3.FCN1, pDC.IRF7.IL3RA, Mac.CXCL3.APOC1, EC.NUPR1.LCN2, T.GNLY.CSF2, Mono.Mac.CXCL10.FCN1, T.MKI67.FOXP3, T.MKI67.IFNG, NK.GNLY.FCER1G, T.MKI67.IL22, NK.GNLY.IFNG, NK.GNLY.GZMB, Mono.FCN1.S100A4 and NK.MKI67.GZMA. In one example embodiment, the subsets detected include one or more of Goblet.RETNLB.ITLN1, Mac.CXCL3.APOC1, EC.NUPR1.LCN2, Mono.Mac.CXCL10.FCN1, NK.GNLY.FCER1G, Mono.FCN1.S100A4, and NK.MKI67.GZMA In one example embodiment, the subsets detected include one or more of cDC2.CD1C.AREG, T.MAF.CTLA4, T.CCL20.RORA, Mac.C1QB.CD14, Mono.CXCL3.FCN1, pDC.IRF7.IL3RA, T.GNLY.CSF2, Mono.Mac.CXCL10.FCN1, T.MKI67.FOXP3, T.MKI67.IFNG, NK.GNLY.FCER1G, NK.GNLY.IFNG, NK.GNLY.GZMB, and NK.MKI67.GZMA. In one example embodiment, the subsets detected include one or more of Mono.Mac.CXCL10.FCN1, NK.GNLY.FCER1G, and NK.MKI67.GZMA.
In one example embodiment, one or more cell subsets are detected that have a shift in frequency in NOA as compared to FR and PR. In one example embodiment, an increase in frequency of CD.NK.MKI67.GZMA and CD.T.MKI67.IL22 indicates FR or PR and a decreased frequency indicates NOA. In one example embodiment, a decrease in frequency of CD.Fibro.CCL19.IRF7 and CD.EC.SLC28A2.GSTA2 indicates FR or PR and an increased frequency indicates NOA.
In one example embodiment, one or more cell subsets are detected that have a shift in frequency in NOA as compared to PR. In one example embodiment, an increase in frequency of CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, CD.NK.GNLY.FCER1G, CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, CD.Mono.FCN1.S100A4, CD.Endth/Ven.LAMP3.LIPG, and CD.Goblet.TFF1.TPSG1 indicates PR and a decreased frequency indicates NOA. In certain embodiments, a decrease in frequency of CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, and CD.T.IFI6.IRF7, CD.cDC2.CLEC10A.FCGR2B, CD.Fibro.IFI6.IFI44L, CD.Tuft.GNAT3.TRPM5, CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15 indicates PR and an increased frequency indicates NOA.
In one example embodiment, one or more cell subsets are detected that have a shift in frequency in NOA as compared to FR. In one example embodiment, a decrease in frequency of CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF indicates FR and an increased frequency indicates NOA.
In certain embodiments, one or more cell subsets are detected that have a shift in frequency in FR as compared to PR. In certain embodiments, an increase in frequency of CD.B/DZ.HIST1H1B.MKI67 indicates PR and a decreased frequency indicates FR.
In one example embodiment, cell subsets identified in FGID are detected. Table 4 provides for subset specific markers for each subset.
In another example embodiment, a method for stratifying subjects suffering from IBD into risk groups comprises detecting in a sample obtained from a subject one or more signature genes or a gene signature. Applicants have identified specific cell states, gene signatures, that are shifted along a trajectory of disease severity. Thus, detecting cell states can be used for diagnostic and therapeutic methods. In particular, the cell states are shifted between anti-TNF-blockade full responder (FR) and anti-TNF-blockade partial responder (PR) subjects. In one example embodiment, one or more differentially expressed genes are detected (Table 2). In one example embodiment, the one or more genes are detected in a specific cell subset. In one example embodiment, cell subset specific markers are used to determine a subset and one or more differentially expressed genes in that subset are detected in combination. Thus, one or more markers can be used to identify the cell subset and differentially genes can be detected in only that subset. In one example embodiment, genes differentially expressed between FR and PR are selected from Table 2A, 2B or 2C. Table 2A shows the top differentially expressed genes in each subset. Table 2B shows genes differentially expressed in the cell subsets having the most differentially expressed genes. In certain embodiments, APOA1, FABP6, NACA, APOA4, TPT1, SPINK4, MIF, IFITM1, HOPX, and HOPX are increased in FR relative to PR, and TNFRSF11B, TFPI2, SERPINE2, GSN, COL1A1, HIF1A, COL1A2, CTNNB1, CCL11, EMILIN1, CEBPB, SLC16A4, HTRA3, CMC1, AREG, COL4A1, SKIL, KLRC1, PTGER4, BRI3, APOE, BDKRB1, TXN, GPR65, NKG7, SAMHD1, CLEC12A, STAT1, PFN1, and TAX1BP1 are increased in PR relative to FR. Table 2C shows all of the differentially expressed genes in the two subsets with the most differentially expressed genes. In certain embodiments, the cell state is a gene program comprising one or more up and down regulated genes. In example embodiments, one or more genes of cell states associated with disease severity and treatment outcomes are detected. In example embodiments, the disease severity gene signature includes one or more of the top 92 markers of the 25 cell states associated with disease severity and treatment outcomes (Table 14). In example embodiments, one or more of TNFAIP6, GZMB, S100A8, CSF2, CLEC4E, S100A9, IL1RN, FCGR1A, CLIC3, CD14, PLA2G7, FAM26F, IL3RA, NKG7, IL32, CCL3, OLR1, LILRA4, APOC1 and MYBL2 are detected to predict anti-TNF therapy outcome in newly diagnosed patients. In example embodiments, the one or more genes are detected in bulk samples or in single cells.
Clusters (subsets) and gene programs as described herein can also be described as a metagene. As used herein a “metagene” refers to a pattern or aggregate of gene expression and not an actual gene. Each metagene may represent a collection or aggregate of genes behaving in a functionally correlated fashion within the genome. The metagene can be increased if the pattern is increased. As used herein the term “gene program” or “program” can be used interchangeably with “cell state”, “biological program”, “expression program”, “transcriptional program”, “expression profile”, “signature”, “gene signature” or “expression program” and may refer to a set of genes that share a role in a biological function (e.g., an inflammatory program, cell differentiation program, proliferation program). Biological programs can include a pattern of gene expression that result in a corresponding physiological event or phenotypic trait (e.g., inflammation). Biological programs can include up to several hundred genes that are expressed in a spatially and temporally controlled fashion. Expression of individual genes can be shared between biological programs. Expression of individual genes can be shared among different single cell subtypes; however, expression of a biological program may be cell subtype specific or temporally specific (e.g., the biological program is expressed in a cell subtype at a specific time). Multiple biological programs may include the same gene, reflecting the gene's roles in different processes. Expression of a biological program may be regulated by a master switch, such as a nuclear receptor or transcription factor.
As used herein a “signature” or “gene program” may encompass any gene or genes, protein or proteins, or epigenetic element(s) whose expression profile or whose occurrence is associated with a specific cell type, subtype, or cell state of a specific cell type or subtype within a population of cells. For ease of discussion, when discussing gene expression, any of gene or genes, protein or proteins, or epigenetic element(s) may be substituted. Levels of expression or activity or prevalence may be compared between different cells in order to characterize or identify for instance signatures specific for cell (sub)populations. Increased or decreased expression or activity or prevalence of signature genes may be compared between different cells in order to characterize or identify for instance specific cell (sub)populations. The detection of a signature in single cells may be used to identify and quantitate for instance specific cell (sub)populations. A signature may include a gene or genes, protein or proteins, or epigenetic element(s) whose expression or occurrence is specific to a cell (sub)population, such that expression or occurrence is exclusive to the cell (sub)population. A gene signature as used herein, may thus refer to any set of up- and down-regulated genes that are representative of a cell type or subtype. A gene signature as used herein, may also refer to any set of up- and down-regulated genes between different cells or cell (sub)populations derived from a gene-expression profile. For example, a gene signature may comprise a list of genes differentially expressed in a distinction of interest.
The signature as defined herein (being it a gene signature, protein signature or other genetic or epigenetic signature) can be used to indicate the presence of a cell type, a subtype of the cell type, the state of the microenvironment of a population of cells, a particular cell type population or subpopulation, and/or the overall status of the entire cell (sub)population. Furthermore, the signature may be indicative of cells within a population of cells in vivo. The signature may also be used to suggest for instance particular therapies, or to follow up treatment, or to suggest ways to modulate immune systems. The presence of subtypes or cell states may be determined by subtype specific or cell state specific signatures. The presence of these specific cell (sub)types or cell states may be determined by applying the signature genes to bulk sequencing data in a sample. Not being bound by a theory the signatures of the present invention may be microenvironment specific, such as their expression in a particular spatio-temporal context. Not being bound by a theory, signatures as discussed herein are specific to a particular pathological context. Not being bound by a theory, a combination of cell subtypes having a particular signature may indicate an outcome. Not being bound by a theory, the signatures can be used to deconvolute the network of cells present in a particular pathological condition. Not being bound by a theory the presence of specific cells and cell subtypes are indicative of a particular response to treatment, such as including increased or decreased susceptibility to treatment. The signature may indicate the presence of one particular cell type. In one embodiment, the novel signatures are used to detect multiple cell states or hierarchies that occur in subpopulations of immune cells that are linked to particular pathological condition (e.g., inflammation), or linked to a particular outcome or progression of the disease (e.g., autoimmunity), or linked to a particular response to treatment of the disease.
The signature according to certain embodiments of the present invention may comprise or consist of one or more genes, proteins and/or epigenetic elements, such as for instance 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of two or more genes, proteins and/or epigenetic elements, such as for instance 2, 3, 4, 5, 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of three or more genes, proteins and/or epigenetic elements, such as for instance 3, 4, 5, 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of four or more genes, proteins and/or epigenetic elements, such as for instance 4, 5, 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of five or more genes, proteins and/or epigenetic elements, such as for instance 5, 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of six or more genes, proteins and/or epigenetic elements, such as for instance 6, 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of seven or more genes, proteins and/or epigenetic elements, such as for instance 7, 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of eight or more genes, proteins and/or epigenetic elements, such as for instance 8, 9, 10 or more. In certain embodiments, the signature may comprise or consist of nine or more genes, proteins and/or epigenetic elements, such as for instance 9, 10 or more. In certain embodiments, the signature may comprise or consist of ten or more genes, proteins and/or epigenetic elements, such as for instance 10, 11, 12, 13, 14, 15, or more. It is to be understood that a signature according to the invention may for instance also include genes or proteins as well as epigenetic elements combined.
It is to be understood that “differentially expressed” genes/proteins include genes/proteins which are up- or down-regulated as well as genes/proteins which are turned on or off. When referring to up-or down-regulation, in certain embodiments, such up- or down-regulation is preferably at least two-fold, such as two-fold, three-fold, four-fold, five-fold, or more, such as for instance at least ten-fold, at least 20-fold, at least 30-fold, at least 40-fold, at least 50-fold, or more. Alternatively, or in addition, differential expression may be determined based on common statistical tests, as is known in the art.
As discussed herein, differentially expressed genes/proteins, or differential epigenetic elements may be differentially expressed on a single cell level, or may be differentially expressed on a cell population level. Preferably, the differentially expressed genes/proteins or epigenetic elements as discussed herein, such as constituting the gene signatures as discussed herein, when as to the cell population level, refer to genes that are differentially expressed in all or substantially all cells of the population (such as at least 80%, preferably at least 90%, such as at least 95% of the individual cells). This allows one to define a particular subpopulation of tumor cells. As referred to herein, a “subpopulation” of cells preferably refers to a particular subset of cells of a particular cell type which can be distinguished or are uniquely identifiable and set apart from other cells of this cell type. The cell subpopulation may be phenotypically characterized, and is preferably characterized by the signature as discussed herein. A cell (sub)population as referred to herein may constitute of a (sub)population of cells of a particular cell type characterized by a specific cell state.
When referring to induction, or alternatively suppression of a particular signature, preferable is meant induction or alternatively suppression (or upregulation or downregulation) of at least one gene/protein and/or epigenetic element of the signature, such as for instance at least two, at least three, at least four, at least five, at least six, or all genes/proteins and/or epigenetic elements of the signature.
As used herein, all gene name symbols refer to the gene as commonly known in the art. The examples described herein that refer to the human gene names are to be understood to also encompasses mouse genes, as well as genes in any other organism (e.g., homologous, orthologous genes). Any reference to the gene symbol is a reference made to the entire gene or variants of the gene. Any reference to the gene symbol is also a reference made to the gene product (e.g., protein). The term, homolog, may apply to the relationship between genes separated by the event of speciation (e.g., ortholog). Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Gene symbols may be those referred to by the HUGO Gene Nomenclature Committee (HGNC) or National Center for Biotechnology Information (NCBI). The signature as described herein may encompass any of the genes described herein.
In certain embodiments, detecting cell subset markers or differentially expressed genes can be used to determine a treatment for a subject suffering from a disease or stratify a subject. The invention provides biomarkers (e.g., phenotype specific or cell subtype) for the identification, diagnosis, prognosis and manipulation of cell properties, for use in a variety of diagnostic and/or therapeutic indications. Biomarkers in the context of the present invention encompasses, without limitation nucleic acids, proteins, reaction products, and metabolites, together with their polymorphisms, mutations, variants, modifications, subunits, fragments, and other analytes or sample-derived measures. In certain embodiments, biomarkers include the signature genes or signature gene products, and/or cells as described herein.
The terms “diagnosis” and “monitoring” are commonplace and well-understood in medical practice. By means of further explanation and without limitation the term “diagnosis” generally refers to the process or act of recognising, deciding on or concluding on a disease or condition in a subject on the basis of symptoms and signs and/or from results of various diagnostic procedures (such as, for example, from knowing the presence, absence and/or quantity of one or more biomarkers characteristic of the diagnosed disease or condition).
The terms “prognosing” or “prognosis” generally refer to an anticipation on the progression of a disease or condition and the prospect (e.g., the probability, duration, and/or extent) of recovery. A good prognosis of the diseases or conditions taught herein may generally encompass anticipation of a satisfactory partial or complete recovery from the diseases or conditions, preferably within an acceptable time period. A good prognosis of such may more commonly encompass anticipation of not further worsening or aggravating of such, preferably within a given time period. A poor prognosis of the diseases or conditions as taught herein may generally encompass anticipation of a substandard recovery and/or unsatisfactorily slow recovery, or to substantially no recovery or even further worsening of such.
The biomarkers of the present invention are useful in methods of identifying patient populations who would benefit or not benefit from anti-TNF blockade based on a detected level of expression, activity and/or function of one or more biomarkers. These biomarkers are also useful in monitoring subjects undergoing treatments and therapies for suitable or aberrant response(s) to determine efficaciousness of the treatment or therapy and for selecting or modifying therapies and treatments that would be efficacious in treating, delaying the progression of or otherwise ameliorating a symptom. The biomarkers provided herein are useful for selecting a group of patients at a specific state of a disease with accuracy that facilitates selection of treatments.
The term “monitoring” generally refers to the follow-up of a disease or a condition in a subject for any changes which may occur over time.
The terms also encompass prediction of a disease. The terms “predicting” or “prediction” generally refer to an advance declaration, indication or foretelling of a disease or condition in a subject not (yet) having said disease or condition. For example, a prediction of a disease or condition in a subject may indicate a probability, chance or risk that the subject will develop said disease or condition, for example within a certain time period or by a certain age. Said probability, chance or risk may be indicated inter alia as an absolute value, range or statistics, or may be indicated relative to a suitable control subject or subject population (such as, e.g., relative to a general, normal or healthy subject or subject population). Hence, the probability, chance or risk that a subject will develop a disease or condition may be advantageously indicated as increased or decreased, or as fold-increased or fold-decreased relative to a suitable control subject or subject population. As used herein, the term “prediction” of the conditions or diseases as taught herein in a subject may also particularly mean that the subject has a ‘positive’ prediction of such, i.e., that the subject is at risk of having such (e.g., the risk is significantly increased vis-à-vis a control subject or subject population). The term “prediction of no” diseases or conditions as taught herein as described herein in a subject may particularly mean that the subject has a ‘negative’ prediction of such, i.e., that the subject's risk of having such is not significantly increased vis-à-vis a control subject or subject population.
Suitably, an altered quantity or phenotype of the cells in the subject compared to a control subject having normal status or not having a disease indicates response to treatment. Hence, the methods may rely on comparing the quantity of cell populations, biomarkers, or gene or gene product signatures measured in samples from patients with reference values, wherein said reference values represent known predictions, diagnoses and/or prognoses of diseases or conditions as taught herein.
For example, distinct reference values may represent the prediction of a risk (e.g., an abnormally elevated risk) of having a given disease or condition as taught herein vs. the prediction of no or normal risk of having said disease or condition. In another example, distinct reference values may represent predictions of differing degrees of risk of having such disease or condition.
In a further example, distinct reference values can represent the diagnosis of a given disease or condition as taught herein vs. the diagnosis of no such disease or condition (such as, e.g., the diagnosis of healthy, or recovered from said disease or condition, etc.). In another example, distinct reference values may represent the diagnosis of such disease or condition of varying severity.
In yet another example, distinct reference values may represent a good prognosis for a given disease or condition as taught herein vs. a poor prognosis for said disease or condition. In a further example, distinct reference values may represent varyingly favourable or unfavourable prognoses for such disease or condition.
Such comparison may generally include any means to determine the presence or absence of at least one difference and optionally of the size of such difference between values being compared. A comparison may include a visual inspection, an arithmetical or statistical comparison of measurements. Such statistical comparisons include, but are not limited to, applying a rule.
Reference values may be established according to known procedures previously employed for other cell populations, biomarkers and gene or gene product signatures. For example, a reference value may be established in an individual or a population of individuals characterised by a particular diagnosis, prediction and/or prognosis of said disease or condition (i.e., for whom said diagnosis, prediction and/or prognosis of the disease or condition holds true). Such population may comprise without limitation 2 or more, 10 or more, 100 or more, or even several hundred or more individuals.
A “deviation” of a first value from a second value may generally encompass any direction (e.g., increase: first value>second value; or decrease: first value<second value) and any extent of alteration.
For example, a deviation may encompass a decrease in a first value by, without limitation, at least about 10% (about 0.9-fold or less), or by at least about 20% (about 0.8-fold or less), or by at least about 30% (about 0.7-fold or less), or by at least about 40% (about 0.6-fold or less), or by at least about 50% (about 0.5-fold or less), or by at least about 60% (about 0.4-fold or less), or by at least about 70% (about 0.3-fold or less), or by at least about 80% (about 0.2-fold or less), or by at least about 90% (about 0.1-fold or less), relative to a second value with which a comparison is being made.
For example, a deviation may encompass an increase of a first value by, without limitation, at least about 10% (about 1.1-fold or more), or by at least about 20% (about 1.2-fold or more), or by at least about 30% (about 1.3-fold or more), or by at least about 40% (about 1.4-fold or more), or by at least about 50% (about 1.5-fold or more), or by at least about 60% (about 1.6-fold or more), or by at least about 70% (about 1.7-fold or more), or by at least about 80% (about 1.8-fold or more), or by at least about 90% (about 1.9-fold or more), or by at least about 100% (about 2-fold or more), or by at least about 150% (about 2.5-fold or more), or by at least about 200% (about 3-fold or more), or by at least about 500% (about 6-fold or more), or by at least about 700% (about 8-fold or more), or like, relative to a second value with which a comparison is being made.
Preferably, a deviation may refer to a statistically significant observed alteration. For example, a deviation may refer to an observed alteration which falls outside of error margins of reference values in a given population (as expressed, for example, by standard deviation or standard error, or by a predetermined multiple thereof, e.g., ±1×SD or ±2×SD or ±3×SD, or 1×SE or ±2×SE or ±3×SE). Deviation may also refer to a value falling outside of a reference range defined by values in a given population (for example, outside of a range which comprises ≥40%, ≥50%, ≥60%, ≥70%, ≥75% or ≥80% or ≥85% or ≥90% or ≥95% or even ≥100% of values in said population).
In a further embodiment, a deviation may be concluded if an observed alteration is beyond a given threshold or cut-off. Such threshold or cut-off may be selected as generally known in the art to provide for a chosen sensitivity and/or specificity of the prediction methods, e.g., sensitivity and/or specificity of at least 50%, or at least 60%, or at least 70%, or at least 80%, or at least 85%, or at least 90%, or at least 95%.
For example, receiver-operating characteristic (ROC) curve analysis can be used to select an optimal cut-off value of the quantity of a given immune cell population, biomarker or gene or gene product signatures, for clinical use of the present diagnostic tests, based on acceptable sensitivity and specificity, or related performance measures which are well-known per se, such as positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), negative likelihood ratio (LR−), Youden index, or similar.
In one embodiment, the signature genes, biomarkers, and/or cells may be detected by immunofluorescence, immunohistochemistry (IHC), fluorescence activated cell sorting (FACS), mass spectrometry (MS), mass cytometry (CyTOF), RNA-seq, single cell RNA-seq (described further herein), quantitative RT-PCR, single cell qPCR, FISH, RNA-FISH, MERFISH (multiplex (in situ) RNA FISH) (Chen et al., Spatially resolved, highly multiplexed RNA profiling in single cells. Science, 2015, 348:aaa6090; and Xia et al., Multiplexed detection of RNA using MERFISH and branched DNA amplification. Sci Rep. 2019 May 22; 9(1):7721. doi: 10.1038/s41598-019-43943-8), ExSeq (Alon, S. et al. Expansion Sequencing: Spatially Precise In Situ Transcriptomics in Intact Biological Systems. biorxiv.org/lookup/doi/10.1101/2020.05.13.094268 (2020) doi:10.1101/2020.05.13.094268), and/or by in situ hybridization. Other methods including absorbance assays and colorimetric assays are known in the art and may be used herein. detection may comprise primers and/or probes or fluorescently bar-coded oligonucleotide probes for hybridization to RNA (see e.g., Geiss G K, et al., Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat Biotechnol. 2008 March; 26(3):317-25).
In certain embodiments, a tissue sample may be obtained and analyzed for specific cell markers (IHC) or specific transcripts (e.g., RNA-FISH). Tissue samples for diagnosis, prognosis or detecting may be obtained by endoscopy. In one embodiment, a sample may be obtained by endoscopy and analyzed by FACS. As used herein, “endoscopy” refers to a procedure that uses an endoscope to examine the interior of a hollow organ or cavity of the body. The endoscope may include a camera and a light source. The endoscope may include tools for dissection or for obtaining a biological sample (e.g., a biopsy).
The present invention also may comprise a kit with a detection reagent that binds to one or more biomarkers or can be used to detect one or more biomarkers.
Immunoassay methods are based on the reaction of an antibody to its corresponding target or analyte and can detect the analyte in a sample depending on the specific assay format. To improve specificity and sensitivity of an assay method based on immunoreactivity, monoclonal antibodies are often used because of their specific epitope recognition. Polyclonal antibodies have also been successfully used in various immunoassays because of their increased affinity for the target as compared to monoclonal antibodies Immunoassays have been designed for use with a wide range of biological sample matrices Immunoassay formats have been designed to provide qualitative, semi-quantitative, and quantitative results.
Quantitative results may be generated through the use of a standard curve created with known concentrations of the specific analyte to be detected. The response or signal from an unknown sample is plotted onto the standard curve, and a quantity or value corresponding to the target in the unknown sample is established.
Numerous immunoassay formats have been designed. ELISA or EIA can be quantitative for the detection of an analyte/biomarker. This method relies on attachment of a label to either the analyte or the antibody and the label component includes, either directly or indirectly, an enzyme. ELISA tests may be formatted for direct, indirect, competitive, or sandwich detection of the analyte. Other methods rely on labels such as, for example, radioisotopes (I125) or fluorescence. Additional techniques include, for example, agglutination, nephelometry, turbidimetry, Western blot, immunoprecipitation, immunocytochemistry, immunohistochemistry, flow cytometry, Luminex assay, and others (see ImmunoAssay: A Practical Guide, edited by Brian Law, published by Taylor & Francis, Ltd., 2005 edition).
Exemplary assay formats include enzyme-linked immunosorbent assay (ELISA), radioimmunoassay, fluorescent, chemiluminescence, and fluorescence resonance energy transfer (FRET) or time resolved-FRET (TR-FRET) immunoassays. Examples of procedures for detecting biomarkers include biomarker immunoprecipitation followed by quantitative methods that allow size and peptide level discrimination, such as gel electrophoresis, capillary electrophoresis, planar electrochromatography, and the like.
Methods of detecting and/or quantifying a detectable label or signal generating material depend on the nature of the label. The products of reactions catalyzed by appropriate enzymes (where the detectable label is an enzyme; see above) can be, without limitation, fluorescent, luminescent, or radioactive or they may absorb visible or ultraviolet light. Examples of detectors suitable for detecting such detectable labels include, without limitation, x-ray film, radioactivity counters, scintillation counters, spectrophotometers, colorimeters, fluorometers, luminometers, and densitometers.
Any of the methods for detection can be performed in any format that allows for any suitable preparation, processing, and analysis of the reactions. This can be, for example, in multi-well assay plates (e.g., 96 wells or 384 wells) or using any suitable array or microarray. Stock solutions for various agents can be made manually or robotically, and all subsequent pipetting, diluting, mixing, distribution, washing, incubating, sample readout, data collection and analysis can be done robotically using commercially available analysis software, robotics, and detection instrumentation capable of detecting a detectable label.
Such applications are hybridization assays in which a nucleic acid that displays “probe” nucleic acids for each of the genes to be assayed/profiled in the profile to be generated is employed. In these assays, a sample of target nucleic acids is first prepared from the initial nucleic acid sample being assayed, where preparation may include labeling of the target nucleic acids with a label, e.g., a member of a signal producing system. Following target nucleic acid sample preparation, the sample is contacted with the array under hybridization conditions, whereby complexes are formed between target nucleic acids that are complementary to probe sequences attached to the array surface. The presence of hybridized complexes is then detected, either qualitatively or quantitatively. Specific hybridization technology which may be practiced to generate the expression profiles employed in the subject methods includes the technology described in U.S. Pat. Nos. 5,143,854; 5,288,644; 5,324,633; 5,432,049; 5,470,710; 5,492,806; 5,503,980; 5,510,270; 5,525,464; 5,547,839; 5,580,732; 5,661,028; 5,800,992; the disclosures of which are herein incorporated by reference; as well as WO 95/21265; WO 96/31622; WO 97/10365; WO 97/27317; EP 373 203; and EP 785 280. In these methods, an array of “probe” nucleic acids that includes a probe for each of the biomarkers whose expression is being assayed is contacted with target nucleic acids as described above. Contact is carried out under hybridization conditions, e.g., stringent hybridization conditions as described above, and unbound nucleic acid is then removed. The resultant pattern of hybridized nucleic acids provides information regarding expression for each of the biomarkers that have been probed, where the expression information is in terms of whether or not the gene is expressed and, typically, at what level, where the expression data, i.e., expression profile, may be both qualitative and quantitative.
Optimal hybridization conditions will depend on the length (e.g., oligomer vs. polynucleotide greater than 200 bases) and type (e.g., RNA, DNA, PNA) of labeled probe and immobilized polynucleotide or oligonucleotide. General parameters for specific (i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al., supra, and in Ausubel et al., “Current Protocols in Molecular Biology”, Greene Publishing and Wiley-interscience, NY (1987), which is incorporated in its entirety for all purposes. When the cDNA microarrays are used, typical hybridization conditions are hybridization in 5×SSC plus 0.2% SDS at 65C for 4 hours followed by washes at 25° C. in low stringency wash buffer (1×SSC plus 0.2% SDS) followed by 10 minutes at 25° C. in high stringency wash buffer (0.1SSC plus 0.2% SDS) (see Shena et al., Proc. Natl. Acad. Sci. USA, Vol. 93, p. 10614 (1996)). Useful hybridization conditions are also provided in, e.g., Tijessen, Hybridization With Nucleic Acid Probes”, Elsevier Science Publishers B.V. (1993) and Kricka, “Nonisotopic DNA Probe Techniques”, Academic Press, San Diego, Calif. (1992).
In certain embodiments, sequencing comprises high-throughput (formerly “next-generation”) technologies to generate sequencing reads. In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules or generating complementary DNA (cDNA) fragments, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads. Methods for constructing sequencing libraries are known in the art (see, e.g., Head et al., Library construction for next-generation sequencing: Overviews and challenges. Biotechniques. 2014; 56(2): 61-77; Trombetta, J. J., Gennert, D., Lu, D., Satija, R., Shalek, A. K. & Regev, A. Preparation of Single-Cell RNA-Seq Libraries for Next Generation Sequencing. Curr Protoc Mol Biol. 107, 4 22 21-24 22 17, doi:10.1002/0471142727.mb0422s107 (2014). PMCID:4338574). A “library” or “fragment library” may be a collection of nucleic acid molecules derived from one or more nucleic acid samples, in which fragments of nucleic acid have been modified, generally by incorporating terminal adapter sequences comprising one or more primer binding sites and identifiable sequence tags. In certain embodiments, the library members (e.g., genomic DNA, cDNA) may include sequencing adaptors that are compatible with use in, e.g., Illumina's reversible terminator method, long read nanopore sequencing, Roche's pyrosequencing method (454), Life Technologies' sequencing by ligation (the SOLiD platform) or Life Technologies' Ion Torrent platform. Examples of such methods are described in the following references: Margulies et al (Nature 2005 437: 376-80); Schneider and Dekker (Nat Biotechnol. 2012 Apr. 10; 30(4):326-8) Ronaghi et al (Analytical Biochemistry 1996 242: 84-9); Shendure et al (Science 2005 309: 1728-32); Imelfort et al (Brief Bioinform. 2009 10:609-18); Fox et al (Methods Mol. Biol. 2009; 553:79-108); Appleby et al (Methods Mol Biol. 2009; 513:19-39); and Morozova et al (Genomics. 2008 92:255-64), which are incorporated by reference for the general descriptions of the methods and the particular steps of the methods, including all starting products, reagents, and final products for each of the steps. In example embodiments, sequencing includes bulk RNA sequencing (R-NA-seq)
In certain embodiments, the invention involves single cell RNA sequencing (see, e.g., Kalisky, T., Blainey, P. & Quake, S. R. Genomic Analysis at the Single-Cell Level. Annual review of genetics 45, 431-445, (2011); Kalisky, T. & Quake, S. R. Single-cell genomics. Nature Methods 8, 311-314 (2011); Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Research, (2011); Tang, F. et al. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nature Protocols 5, 516-535, (2010); Tang, F. et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods 6, 377-382, (2009); Ramskold, D. et al. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nature Biotechnology 30, 777-782, (2012); and Hashimshony, T., Wagner, F., Sher, N. & Yanai, I. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification. Cell Reports, Cell Reports, Volume 2, Issue 3, p666-673, 2012).
In certain embodiments, the invention involves plate based single cell RNA sequencing (see, e.g., Picelli, S. et al., 2014, “Full-length RNA-seq from single cells using Smart-seq2” Nature protocols 9, 171-181, doi:10.1038/nprot.2014.006).
In certain embodiments, the invention involves high-throughput single-cell RNA-seq. In this regard reference is made to Macosko et al., 2015, “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets” Cell 161, 1202-1214; International patent application number PCT/US2015/049178, published as WO2016/040476 on Mar. 17, 2016; Klein et al., 2015, “Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells” Cell 161, 1187-1201; International patent application number PCT/US2016/027734, published as WO2016168584A1 on Oct. 20, 2016; Zheng, et al., 2016, “Haplotyping germline and cancer genomes with high-throughput linked-read sequencing” Nature Biotechnology 34, 303-311; Zheng, et al., 2017, “Massively parallel digital transcriptional profiling of single cells” Nat. Commun. 8, 14049 doi: 10.1038/ncomms14049; International patent publication number WO2014210353A2; Zilionis, et al., 2017, “Single-cell barcoding and sequencing using droplet microfluidics” Nat Protoc. January; 12(1):44-73; Cao et al., 2017, “Comprehensive single cell transcriptional profiling of a multicellular organism by combinatorial indexing” bioRxiv preprint first posted online Feb. 2, 2017, doi: dx.doi.org/10.1101/104844; Rosenberg et al., 2017, “Scaling single cell transcriptomics through split pool barcoding” bioRxiv preprint first posted online Feb. 2, 2017, doi: dx.doi.org/10.1101/105163; Rosenberg et al., “Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding” Science 15 Mar. 2018; Vitak, et al., “Sequencing thousands of single-cell genomes with combinatorial indexing” Nature Methods, 14(3):302-308, 2017; Cao, et al., Comprehensive single-cell transcriptional profiling of a multicellular organism. Science, 357(6352):661-667, 2017; Gierahn et al., “Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput” Nature Methods 14, 395-398 (2017); and Hughes, et al., “Highly Efficient, Massively-Parallel Single-Cell RNA-Seq Reveals Cellular States and Molecular Features of Human Skin Pathology” bioRxiv 689273; doi: doi.org/10.1101/689273, all the contents and disclosure of each of which are herein incorporated by reference in their entirety.
In certain embodiments, the invention involves single nucleus RNA sequencing. In this regard reference is made to Swiech et al., 2014, “In vivo interrogation of gene function in the mammalian brain using CRISPR-Cas9” Nature Biotechnology Vol. 33, pp. 102-106; Habib et al., 2016, “Div-Seq: Single-nucleus RNA-Seq reveals dynamics of rare adult newborn neurons” Science, Vol. 353, Issue 6302, pp. 925-928; Habib et al., 2017, “Massively parallel single-nucleus RNA-seq with DroNc-seq” Nat Methods. 2017 October; 14(10):955-958; International Patent Application No. PCT/US2016/059239, published as WO2017164936 on Sep. 28, 2017; International Patent Application No. PCT/US2018/060860, published as WO/2019/094984 on May 16, 2019; International Patent Application No. PCT/US2019/055894, published as WO/2020/077236 on Apr. 16, 2020; and Drokhlyansky, et al., “The enteric nervous system of the human and mouse colon at a single-cell resolution,” bioRxiv 746743; doi: doi.org/10.1101/746743, which are herein incorporated by reference in their entirety.
Biomarker detection may also be evaluated using mass spectrometry methods. A variety of configurations of mass spectrometers can be used to detect biomarker values. Several types of mass spectrometers are available or can be produced with various configurations. In general, a mass spectrometer has the following major components: a sample inlet, an ion source, a mass analyzer, a detector, a vacuum system, and instrument-control system, and a data system. Difference in the sample inlet, ion source, and mass analyzer generally define the type of instrument and its capabilities. For example, an inlet can be a capillary-column liquid chromatography source or can be a direct probe or stage such as used in matrix-assisted laser desorption. Common ion sources are, for example, electrospray, including nanospray and microspray or matrix-assisted laser desorption. Common mass analyzers include a quadrupole mass filter, ion trap mass analyzer and time-of-flight mass analyzer. Additional mass spectrometry methods are well known in the art (see Burlingame et al., Anal. Chem. 70:647 R-716R (1998); Kinter and Sherman, New York (2000)).
Protein biomarkers and biomarker values can be detected and measured by any of the following: electrospray ionization mass spectrometry (ESI-MS), ESI-MS/MS, ESI-MS/(MS)n, matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF-MS), surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS), desorption/ionization on silicon (DIOS), secondary ion mass spectrometry (SIMS), quadrupole time-of-flight (Q-TOF), tandem time-of-flight (TOF/TOF) technology, called ultraflex III TOF/TOF, atmospheric pressure chemical ionization mass spectrometry (APCI-MS), APCI-MS/MS, APCI-(MS).sup.N, atmospheric pressure photoionization mass spectrometry (APPI-MS), APPI-MS/MS, and APPI-(MS).sup.N, quadrupole mass spectrometry, Fourier transform mass spectrometry (FTMS), quantitative mass spectrometry, and ion trap mass spectrometry.
Sample preparation strategies are used to label and enrich samples before mass spectroscopic characterization of protein biomarkers and determination biomarker values. Labeling methods include but are not limited to isobaric tag for relative and absolute quantitation (iTRAQ) and stable isotope labeling with amino acids in cell culture (SILAC). Capture reagents used to selectively enrich samples for candidate biomarker proteins prior to mass spectroscopic analysis include but are not limited to aptamers, antibodies, nucleic acid probes, chimeras, small molecules, an F(ab′)2 fragment, a single chain antibody fragment, an Fv fragment, a single chain Fv fragment, a nucleic acid, a lectin, a ligand-binding receptor, affybodies, nanobodies, ankyrins, domain antibodies, alternative antibody scaffolds (e.g. diabodies etc) imprinted polymers, avimers, peptidomimetics, peptoids, peptide nucleic acids, threose nucleic acid, a hormone receptor, a cytokine receptor, and synthetic receptors, and modifications and fragments of these.
In one example embodiment, a method of treatment comprises stratifying subjects suffering from IBD into risk groups as described herein and further comprising selecting a treatment, wherein if the subject is in the NOA group, then treating the subject with a treatment that does not comprise anti-TNF-blockade; if the subject is in the FR group, then treating the subject with a treatment comprising anti-TNF-blockade; and if the subject is in the PR group, then treating the subject with a treatment comprising anti-TNF-blockade and/or an additional treatment. In one example embodiment, the method for stratifying subjects suffering from IBD into risk groups comprises detecting in a sample obtained from a subject the frequency of one or more T cell/Natural Killer/Innate Lymphoid cell (T/NK/ILC), myeloid and/or epithelial cell subsets selected from Table 1, and determining if the subject is in a well-controlled without anti-TNF-blockade (NOA) risk group, an anti-TNF-blockade full responder (FR) risk group, or anti-TNF-blockade partial responder (PR) risk group by comparing the frequency of the detected cell subsets to a control frequency for the subject along a trajectory of disease severity from NOA, to FR, to PR. In one example embodiment, the method for stratifying subjects suffering from IBD into risk groups comprises detecting in a sample obtained from a subject one or more signature genes or a gene signature selected from Table 2 or Table 14.
There is currently no cure for Crohn's disease, and there is no single treatment that works for all subjects. In certain embodiments, the methods of the present invention are used to select any treatment within the current standard of care and provide for less toxicity and improved treatment. In preferred embodiments, the treatment selected is anti-TNF blockade. The term “standard of care” as used herein refers to the current treatment that is accepted by medical experts as a proper treatment for a certain type of disease and that is widely used by healthcare professionals. Standard of care is also called best practice, standard medical care, and standard therapy. In example embodiments, the present invention provides improved treatment selection, for example, PCDAI (Pediatric Crohn's Disease Activity Index) (see, e.g., Zubin G, Peter L. Predicting Endoscopic Crohn's Disease Activity Before and After Induction Therapy in Children: A Comprehensive Assessment of PCDAI, CRP, and Fecal Calprotectin. Inflamm Bowel Dis. 2015; 21(6):1386-1391).
As used herein, “treatment” or “treating,” or “palliating” or “ameliorating” are used interchangeably. These terms refer to an approach for obtaining beneficial or desired results including but not limited to a therapeutic benefit and/or a prophylactic benefit. By therapeutic benefit is meant any therapeutically relevant improvement in or effect on one or more diseases, conditions, or symptoms under treatment. For prophylactic benefit, the compositions may be administered to a subject at risk of developing a particular disease, condition, or symptom, or to a subject reporting one or more of the physiological symptoms of a disease, even though the disease, condition, or symptom may not have yet been manifested. As used herein “treating” includes ameliorating, curing, preventing it from becoming worse, slowing the rate of progression, or preventing the disorder from re-occurring (i.e., to prevent a relapse).
In certain embodiments, the therapeutic agents are administered in an effective amount or therapeutically effective amount. The term “effective amount” or “therapeutically effective amount” refers to the amount of an agent that is sufficient to effect beneficial or desired results. The therapeutically effective amount may vary depending upon one or more of: the subject and disease condition being treated, the weight and age of the subject, the severity of the disease condition, the manner of administration and the like, which can readily be determined by one of ordinary skill in the art. The term also applies to a dose that will provide an image for detection by any one of the imaging methods described herein. The specific dose may vary depending on one or more of: the particular agent chosen, the dosing regimen to be followed, whether it is administered in combination with other compounds, timing of administration, the tissue to be imaged, and the physical delivery system in which it is carried.
In certain embodiments, IBD is treated by selecting subject who will benefit from anti-TNF blockade. Inflammatory bowel disease (IBD) is a chronic disabling inflammatory process that affects mainly the gastrointestinal tract and may present associated extraintestinal manifestations (see, e.g., Catalan-Serra I, Brenna Ø. Immunotherapy in inflammatory bowel disease: Novel and emerging treatments. Hum Vaccin Immunother. 2018; 14(11):2597-2611). IBD includes both ulcerative colitis (UC) and Crohn's disease (CD). Id. Current pharmacological treatments used in clinical practice like thiopurines or anti-TNF are effective but can produce significant side effects and their efficacy may diminish over time. Id. The current treatment of IBD includes mesalazine (oral and rectal formulations), glucocorticoids (conventional and other forms like budesonide or beclomethasone), antibiotics (typically ciprofloxacine and metronidazole), immunosuppressants (mostly azathioprine/6-mercaptopurine or methotrexate) and anti-TNF agents (infliximab, adalimumab, certolizumab pegol and golimumab). Recently, the anti-integrin antibody vedolizumab and the antibody against IL-12/23 ustekinumab have been approved for IBD. Id. Corticosteroids may be used for short-term (three to four months) symptom improvement and to induce remission. Corticosteroids may also be used in combination with an immune system suppressor. Azathioprine (Azasan, Imuran) and mercaptopurine (Purinethol, Purixan) are the most widely used immunosuppressants for treatment of inflammatory bowel disease. Taking them requires follow up to look for side effects, such as a lowered resistance to infection and inflammation of the liver. Methotrexate (Trexall) is sometimes used for people with Crohn's disease who don't respond well to other medications.
In certain embodiments, selecting subjects that are responsive can be used to avoid producing significant side effects in subjects that will not benefit from the treatment. In certain embodiments, an alternative treatment is administered to non-responsive subjects such that side effects are diminished. In certain embodiments, a drug is administered to shift a subject to be responsive.
The present invention also contemplates use of tumor necrosis factor (TNF) inhibitors for treatment (e.g., anti-TNF blockade). In certain embodiments, the invention described herein is related to a method of treatment in which one or more TNF inhibitors are administered to a patient in need thereof, treatment which may be determined in whole or in part by the systems and methodologies described herein. In one embodiment, TNF-α inhibitor antibodies, or antigen binding fragments thereof, are contemplated for use. In an aspect, the TNF inhibitor is an immunosuppressive medication. In an embodiment, the TNF inhibitor is a monoclonal antibody. In particular embodiments, the TNF inhibitor binds to soluble forms of TNF-alpha, the transmembrane form of TNF-alpha, or both forms of TNF-alpha. In one example embodiment, the TNF inhibitor is adalimumab or a biosimilar thereof. The TNF inhibitor may comprise a chimeric antibody, such as infliximab or a biosimilar thereof, which comprises the TNF alpha trimer, a variable murine binding site for TNF-alpha and an Fc constant region. In an embodiment, the anti-TNF antibody is certolizumab pegol or golimumab or a biosimilar thereof. In an aspect, the inhibitor may comprise enhancing soluble TNF receptor 2, a receptor that binds to TNF-alpha by either delivery of a fusion protein or by the upregulation of TNF receptor 2 expression. Thus, in an embodiment, the TNF inhibitor is etanercept, a circulating TNF receptor-IgG fusion protein that binds to TNF-alpha. Administration of treatments etanercept, adalimumab, certolizumab and golimumab may be subcutaneous. Administration of infliximab and golimumab may be intravenous.
Small molecules such as thalidomide, lenalidomide and pomalidomide may also be used for treatment. Additionally, oral pentoxifylline or bupropion have also been used as TNF-alpha inhibitor treatment. See, e.g. Brustolim D, Ribeiro-dos-Santos R, Kast R E, Altschuler E L, Soares M B,). Int. Immunopharmacol. 6 (6): 903-7. doi:10.1016/j.intimp.2005.12.007 (June 2006) (buprioprion lowers production of TNF-alpha in mice. In an aspect, 5-HT2A receptor agonists such as (R)-DOI, N,N-Dimethyltryptamine, paliperidone, APD791, YKP-1358, lurasidone, lisuride, methysergide, lorcaserin and other agonists known in the art may be utilized for treatment. See, eg. Yu et al., “Serotonin 5-Hydroxytryptamine2A Receptor Activation Suppresses Tumor Necrosis Factor-α-Induced Inflammation with Extraordinary Potency,” J. Pharm and Exp Ther. Nov. 2008, 327(2) 316-323; doi: 10.1124/jpet.108.143461. Additionally, activation of HT2A receptors via genome editing may also be utilized for inhibition of TNF-alpha.
TNFR1 and/or TNFR2 receptors of TNF-alpha may be targeted for inhibition of TNF-alpha. In an example embodiment, CRISPR based systems may be used for the repression or activation of inflammatory cytokine cell receptor TNFR1 and/or anti-inflammatory and antiapoptotic interactions at TNFR2 receptors of TNF-alpha. See, Farhang et al., Tissue Eng Part A. 2017 Aug. 1; 23(15-16): 738-749, doi: 10.1089/ten.tea.2016.0441. Inhibition of the activation of the extracellular signal-regulated kinase may also be a target for RNAi or CRISPR related treatments or small molecule administration. In one embodiment, gliovirin, an epipolythiodiketopiperazine that suppresses TNF-alpha synthesis by inhibiting the activation of extracellular signal-regulated kinase (ERK) may be utilized. See, Rether et al., Biol Chem. 2007 Jun; 388(6):627-37 doi: 10.1515/BC.2007.066. Knockdown of TNF-alpha by DNAzyme gold nanoparticles is also contemplated for use as treatment, with local injection being one approach for treatment with DNA-zyme-conjugated particles. See, e.g. Somasuntharam et al., Biomaterials. 2016 March; 83:12-22. doi: 10.1016/j.biomaterials.2015.12.02.
In example embodiments, subjects that are not fully responsive to TNF inhibitors are treated with additional treatments specific to those subjects. In example embodiments, the additional treatments target cell subsets enriched in frequency in subjects that are partial responders. In example embodiments, the additional treatments target genes or pathways differentially expressed in cell subsets in subjects that are partial responders. In example embodiments, the additional treatments are administered in combination with TNF inhibitors. In example embodiments, additional treatments include CD40L-blocking antibodies, IL-22 agonists, agents blocking inflammatory cytokines, such as IL-1, targeted anti-proliferation agents, and anti-GM-CSF antibodies (Betts et al., 2017; Lindemans et al., 2015; Miura et al., 2021; Ramanujam et al., 2020; Sootome et al., 2020; Ai et al., 2021; Aschenbrenner et al., 2021; Castro-Dopico et al., 2020; Mehta et al., 2020; Mitsialis et al., 2020; Muro and Mrowiec, 2015). In other example embodiments, any standard of care treatment discussed above can be used as an additional treatment. In example embodiments, one or more of the additional treatments are administered in combination with a standard treatment. The combinations may provide for enhanced or otherwise previously unknown activity in the treatment of disease. In certain embodiments, targeting the combination may require less of the standard agent as compared to the current standard of care and provide for less toxicity and improved treatment.
Non-limiting examples of CD40L inhibitors include toralizumab/IDEC-131 (see, e.g., Fadul C E, Mao-Draayer Y, Ryan K A, et al. Safety and Immune Effects of Blocking CD40 Ligand in Multiple Sclerosis. Neurol Neuroimmunol Neuroinflamm. 2021; 8(6):e1096) and CDP7657 (see, e.g., Shock A, Burkly L, Wakefield I, et al. CDP7657, an anti-CD40L antibody lacking an Fc domain, inhibits CD40L-dependent immune responses without thrombotic complications: an in vivo study. Arthritis Res Ther. 2015; 17(1):234).
Non-limiting examples of IL-22 agonists include an IL-22 polypeptide, an IL-22 Fc fusion protein, an IL-22 agonist, an IL-19 polypeptide, an IL-19 Fc fusion protein, an IL-19 agonist, an IL-20 polypeptide, an IL-20 Fc fusion protein, an IL-20 agonist, an IL-24 polypeptide, an IL-24 Fc fusion protein, an IL-24 agonist, an IL-26 polypeptide, an IL-26 Fc fusion protein, an IL-26 agonist, an IL-22R1, an antibody that binds IL-22BP and blocks or inhibits binding of IL-22BP to IL-22, and TLR7 agonists (see, e.g., U.S. patent Ser. No. 11/155,591B2; US Patent Application US20210338778A1; Wang Q, Kim S Y, Matsushita H, et al. Oral administration of PEGylated TLR7 ligand ameliorates alcohol-associated liver disease via the induction of IL-22. Proc Natl Acad Sci USA. 2021; 118(1):e2020868118).
Non-limiting examples of anti-GM-CSF antibodies include Gimsilumab, lenzilumab, namilumab, and otilimab, which target GM-CSF directly, neutralizing the biological function of GM-CSF by blocking the interaction of GM-CSF with its cell surface receptor (see, e.g., Mehta P, Porter J C, Manson J J, et al. Therapeutic blockade of granulocyte macrophage colony-stimulating factor in COVID-19-associated hyperinflammation: challenges and opportunities. Lancet Respir Med. 2020; 8(8):822-830; Lang F M, Lee K M, Teijaro J R, Becher B, Hamilton J A. GM-CSF-based treatments in COVID-19: reconciling opposing therapeutic approaches. Nat Rev Immunol. 2020; 20(8):507-514; and Temesgen Z, Assi M, Shweta F N U, et al. GM-CSF Neutralization with lenzilumab in severe COVID-19 pneumonia: a case-cohort study. Mayo Clin Proc. 2020; 95(11):2382-2394). Non-limiting examples of anti-GM-CSF antibodies also include Mavrilimumab, which targets the alpha subunit of the GM-CSF receptor, blocking intracellular signaling of GM-CSF (see, e.g., Lang F M, Lee K M, Teijaro J R, Becher B, Hamilton J A. GM-CSF-based treatments in COVID-19: reconciling opposing therapeutic approaches. Nat Rev Immunol. 2020; 20(8):507-514; and Burmester G R, Feist E, Sleeman M A, Wang B, White B, Magrini F. Mavrilimumab, a human monoclonal antibody targeting GM-CSF receptor-alpha, in subjects with rheumatoid arthritis: a randomised, double-blind, placebo-controlled, Phase I, first-in-human study. Ann Rheum Dis. 2011; 70(9):1542-1549).
In certain embodiments, the cell subset frequency and/or differential cell states can be detected for screening novel therapeutic agents. In certain embodiments, the present invention can be used to identify improved treatments by monitoring the identified cell states in a subject undergoing an experimental treatment. In certain embodiments, an animal model is used to detect shifts in the identified cell states to identify agents capable of shifting a subject from a PR to FR or NOA.
In certain embodiments, the cell states identified herein are detected in a mouse model of an inflammatory disease. Exemplary IBD mouse models include those which are chemically-induced, those which are achieved by adoptive transfer of T cell subsets, and those that develop spontaneously in genetically modified mice, such as Acute and chronic dextran sulfate sodium (DSS)-induced colitis mouse models, poly I:C-induced intestinal inflammation model, trinitrobenzene sulfonic acid (TNBS)-induced colitis mouse model, Adoptive transfer of CD4+CD45RBhigh T cells, IL-10 KO mice (see, e.g., Boismenu R, Chen Y. Insights from mouse models of colitis. J Leukoc Biol. 2000 March; 67(3):267-78, Table 2).
In certain embodiments, candidate agents are screened. The term “agent” broadly encompasses any condition, substance or agent capable of modulating one or more phenotypic aspects of a cell or cell population as disclosed herein. Such conditions, substances or agents may be of physical, chemical, biochemical and/or biological nature. The term “candidate agent” refers to any condition, substance or agent that is being examined for the ability to modulate one or more phenotypic aspects of a cell or cell population as disclosed herein in a method comprising applying the candidate agent to the cell or cell population (e.g., exposing the cell or cell population to the candidate agent or contacting the cell or cell population with the candidate agent) and observing whether the desired modulation takes place.
Agents may include any potential class of biologically active conditions, substances or agents, such as for instance antibodies, proteins, peptides, nucleic acids, oligonucleotides, small molecules, or combinations thereof, as described herein.
The terms “therapeutic agent”, “therapeutic capable agent” or “treatment agent” are used interchangeably and refer to a molecule or compound that confers some beneficial effect upon administration to a subject. The beneficial effect includes enablement of diagnostic determinations; amelioration of a disease, symptom, disorder, or pathological condition; reducing or preventing the onset of a disease, symptom, disorder or condition; and generally counteracting a disease, symptom, disorder or pathological condition.
In certain embodiments, the present invention provides for gene signature screening to identify agents that shift expression of the gene targets described herein (e.g., cell subset markers and differentially expressed genes). The concept of signature screening was introduced by Stegmaier et al. (Gene expression-based high-throughput screening (GE-HTS) and application to leukemia differentiation. Nature Genet. 36, 257-263 (2004)), who realized that if a gene-expression signature was the proxy for a phenotype of interest, it could be used to find small molecules that effect that phenotype without knowledge of a validated drug target. The signatures or biological programs of the present invention may be used to screen for drugs that reduce the signature or biological program in cells as described herein.
The Connectivity Map (cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes (see, Lamb et al., The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 29 Sep. 2006: Vol. 313, Issue 5795, pp. 1929-1935, DOI: 10.1126/science.1132939; and Lamb, J., The Connectivity Map: a new tool for biomedical research. Nature Reviews Cancer January 2007: Vol. 7, pp. 54-60). In certain embodiments, Cmap can be used to identify small molecules capable of modulating a signature or biological program of the present invention in silico.
Further embodiments are illustrated in the following Examples which are given for illustrative purposes only and are not intended to limit the scope of the invention.
To Applicants knowledge, all present scRNA-seq comprehensive atlases of inflammatory disease conditions consist of patients being treated with a variety of agents, and for which the biopsies included in these studies often reflect a partial treatment-refractory state to combinations of antibiotics, 5-ASA, corticosteroids, and anti-TNF mAbs. A treatment-naïve single-cell atlas in any inflammatory disease condition has yet to be reported. In order to address this unmet need and generate a comprehensive cellular atlas from treatment-naïve pediCD compared to uninflamed age-matched controls, Applicants created the prospective PREDICT study (Clinicaltrials.gov #NCT03369353) to help identify, profile, and understand pediatric IBD and FGID. Here, Applicants present detailed diagnostic and treatment data from the first cohort of 27 patients enrolled on PREDICT, including 14 pediCD and 13 FGID patients, together with flow cytometric and scRNA-seq studies of the cellular composition of the terminal ileum (
Applicants collected terminal ileum biopsies from 13 FGID patients and from 14 pediCD patients, and prepared single cell suspensions for flow cytometry and scRNA-seq. Biopsies from pediCD were from inflamed areas adjacent to active ulcerations. Biopsies from FGID were also taken. The epithelium was first separated from the lamina propria before enzymatic dissociation, and flow cytometric analysis was performed on the remaining viable single-cell fraction which recovered predominantly hematopoietic cells with some remnant epithelial cells (<20% of all cells), likely representing those in deeper crypt regions (
In addition to flow cytometry, Applicants performed droplet-based scRNA-seq on cell suspensions from the 14 pediCD/13 FGID patient cohort using the 10× Genomics V2 3′ platform (
Following library preparation and sequencing, Applicants derived a unified cells-by-genes expression matrix from the 27 samples, containing digital gene expression values for all cells passing quality thresholds (n=254,911 cells). Applicants then performed dimensionality reduction and graph-based clustering, noting that despite no integration methods being used, FGID and CD were essentially indistinguishable from each other when visualized on a uniform manifold approximation and projection (UMAP) plot. Applicants recovered the following cell types from both patient groups: epithelial cells, T cells, B cells, plasma cells, glial cells, endothelial cells, myeloid cells, mast cells, fibroblasts, and a proliferating cluster. Applicants noted that the fractional composition amongst all cells of T cells, B cells, and myeloid cells was not significantly different between FGID and pediCD, similar to the flow cytometric data, and this was also the case for endothelial, epithelial, fibroblasts, glial, mast and plasma cells, which were not measured through flow cytometry. This provided validation and extension of the flow cytometry data that the broad cell type composition of FGID and pediCD is not significantly altered, despite highly distinct clinical diseases.
Applicants then systematically re-clustered each broad cell type, identifying increasing heterogeneity within each type. Given that Applicants detected changes in the frequency of HLA-DR+ macrophages/dendritic cells and pDCs by flow cytometry, Applicants initially focused on the myeloid cell type sub-clustering, containing dendritic cells, macrophages, monocytes, and pDCs. However, it soon became evident that this traditional clustering approach raised several challenges with identifying the boundaries of clusters, and whether a cluster composed primarily of pediCD cells represented a unique cell subset, or a cell state overlaid onto a core cell subset gene expression program (Methods). This would influence whether a comparison would primarily focus on differential expression testing or differential composition testing. It also raised the possibility that this joint clustering approach, informed by the inclusion of both FGID and pediCD cell types, subsets and states could muddle some of the unique biology of FGID and pediCD. This could lead to clusters, and correspondingly critical gene-reference lists for each cluster, that may not accurately represent that cell type, subset, or state, as the cluster is representative of a hybrid informed by cells from an FGID and pediCD intestine.
In order to approach this challenge from a more principled direction, Applicants made four key changes to the analytical workflow: 1. Applicants proceeded to analyze FGID and pediCD samples separately to define cell type, subset, and state clusters and markers, 2. implemented an automated iterative tiered clustering (ITC) approach to optimize the silhouette score at each tier of iterative sub-clustering and stop when a specific granularity is reached, 3. accounted for the diversity of patients which compose that cluster using Simpson's Index of Diversity, and 4. generated and optimized a Random Forest classifier to identify correspondence between the resultant FGID and pediCD atlases (Methods). Using this approach, each tier of analysis is typically under-clustered relative to traditional empirical analyses, but the automation proceeds through several more tiers (typically 6 to 7) until stop conditions (e.g. cell numbers and differentially expressed genes, see Methods) are met. Applicants then inspected all outputs (FGID and pediCD clusters) and provided descriptive cell cluster names independently for FGID and pediCD. Applicants also focused at this stage on flagging putative doublet clusters or clusters where the majority of differentially-expressed genes which triggered further clustering consist of known technical confounders in scRNA-seq data (e.g. mitochondrial, ribosomal, and spillover genes from cells with high secretory capacity) but did not remove them, as end users of this resource are likely to encounter these clusters and may be interested in their prospective identification.
Applicants then hierarchically clustered all end cell state clusters in order to generate the final dendrograms for FGID and pediCD, and performed 1 vs. rest within-cell-type differential expression to provide systematic names for cells based on their cell type classification and two genes (Methods). As several cell types contained readily identifiable and meaningful cell subsets, Applicants utilized curation of literature-based markers to provide further guidance within each cell type. For example, within Tier 1 T Cells Applicants could identify T cells, NK cells and ILCs, within Tier 1 Myeloid cells, Applicants could identify monocytes, cDC1, cDC2, macrophages and pDCs, within Tier 1 B cells germinal center, germinal center dark zone and light zone cells, and within Tier 1 Endothelial cells Applicants could identify arterioles, capillaries, lymphatics, mural cells and venules, and so forth for other cell types. To illustrate this process for one cluster, upon automated hierarchical tiered clustering of T cells, Applicants identified a cluster that was Tier 0: pediCD, Tier 1: T cells, Tier 2: cytotoxic, Tier 3: IEL_FCER1G_NKG7_TYROBP_CD160_AREG. Upon inspection of CD3 genes (CD247, CD3D, etc.), TCR genes (TRAC, TRBC1, etc.), and NK cell genes (NCAM1, NCR1), it became readily apparent these cells were NK cells, and 1 vs. rest within-cell-type differential expression identified CCL3 and CD160 as two genes significantly enriched in this cluster (adj. p-value=0, expression within-cluster>40% cells positive and in other Tier 1 T cells<6%). This resulted in a final name for this cluster of CD.NK.CCL3.CD160. Applicants repeated this process for all FGID (183 end clusters) and pediCD (426 end clusters) within Tier 1 B cell, Endothelial, Epithelial, Fibroblast, Plasma Cell, Myeloid Cell, Mast Cell, and T Cell identified clusters, and provide systematically generated names for all, as well as 1 vs. rest within-cell type gene lists (Table 1 and Table 4).
Using this analytical workflow, Applicants present two comprehensive cellular atlases of FGID (
From the 99,488 cells profiled from 13 FGID patients, Applicants recovered 12 Tier 1 clusters which Applicants display on a t-stochastic neighbor embedding (t-SNE) plot colored by cluster identity (
Within B cells, Applicants identified a strong division between non-cycling and cycling B cells, with those found in the cycling compartment readily identifiable by germinal center markers and further dark zone (AICDA) and light zone (CD83) genes resulting in FG.B/DZ.AICDA.IGKC, and FG.B/LZ.CD74.CD83 clusters (
Within Myeloid cells, Applicants identified, and confirmed using extensive inspection of literature curated markers, cell subsets corresponding to monocytes (CD14, FCGR3A, FCN1, S100A8, S100A9, etc.), macrophages (CSF1R, MERTK, MAF, C1QA, etc.), cDC1 (CLEC9A, XCR1, BATF3), cDC2 (FCER1A, CLEC10A, CD1C, IRF4 etc.), and pDCs (IL3RA, LILRA4, IRF7) (
Within T cells, Applicants followed a similar approach as utilized for Myeloid cells and identified principal cell subsets of T cells (joint expression of CD247, CD3D, CD3E, CD3G with TRAC, TRBC1, TRBC2, or TRGC1, TRGC2 and TRDC), and a combined cluster of cytotoxic cells (FG.T/NK/ILC.GNLY.TYROBP) likely including T cells, NK cells (lower expression of TCR-complex genes with NCAM1, NCR1 and TYROBP), and some TLCs (KIT, NCR2, RORC and low expression of CD3-complex genes) (
Within Epithelial cells, most cells expressed high levels of OLFM4, identifying them as crypt-localized cells. Applicants readily identified subsets of stem cells (LGR5), proliferating cells (TOP2A), goblet cells (SPINK4, ZG16, various MUCs), enteroendocrine cells (SCG3, ISL1), Paneth cells (ITLN2, PRSS2, LYZ), tuft cells (GNG13, SH2D6, TRPM5) and enterocytes (APOC3, APOA1, FABP6, etc.).
Within Endothelial cells, Applicants readily identified vascular and lymphatic endothelial cells (LYVE1, PROX1), with the vascular cells able to be further identified as capillaries (CA4) or venular endothelial cells (ACKR1, MADCAM1). Applicants also identified a subset of cells (FG.Endth/Peri.FRZB.NOTCH3) expressing high levels of FRZB and NOTCH3, which, rather than being arterioles, likely represent arteriole-associated pericytes or smooth muscle cells given the absence of EFNB2, SOX17, BMX, and HEY1, and the presence of ACTA2 and MYL9, as cluster-defining genes. Applicants highlight that the FG.Endth/Ven.ACKR1.MADCAM1 cluster is characterized by expression of markers for postcapillary venules specialized in leukocyte recruitment.
Within Fibroblasts, Applicants identified principal subsets characterized by their structural roles (COL3A1, ADAMDEC1, FBLN1, LUM, etc.), myofibroblasts (MYH11, ACTA2, ACTG2, etc.), and organization of lymphoid cells (CCL19, CCL21 etc.). Within the lymphoid-organizing fibroblasts, Applicants draw attention to the FG.Fibro.C3.FDCSP, FG.Fibro.CCL19.C3, and FG.Fibro,CCL21.CCL19 subsets, which appear to have some characteristics of follicular dendritic cells and variable expression of CCL19/CCL21 (T-cell or migratory dendritic cell chemoattractants) and CXCL13 (B-cell chemoattractant). Applicants also identified a separate Tier 1 cluster of Glial cells characterized by CRYAB and CLU. Intriguingly within the Glial cell Tier 1 cluster, Applicants then recovered a cell subset expressing FDCSP, CXCL13, and CR2, a key complement receptor which allows for complement-bound antigens to be recycled and presented by follicular dendritic cells. This highlights the power of iterative tiered clustering to recover discrete cell states that may, through the process of clustering not be fully resolved, and thus not identified and furthermore altering the gene signatures of their larger parent cell cluster. This FG.Glial/fDC.FDCSP.CXCL13 in the hierarchical cluster tree then assorts within the lymphoid-organizing stromal cells.
The Mast cells recovered did not further sub-cluster in an automated fashion, and were largely marked by TPSB2 and TPSAB1 (>97%), with minimal CMA1 (<20%) expressing cells, suggesting they are largely classical MC-T cells in FGID intestine.
Applicants identified four Tier 1 clusters for Plasma cells, which are characterized by their strong expression of IGH* immunoglobulin heavy-chain genes together with either a IGK* (kappa light chain) or IGL* (lambda light chain) genes. This resolved IgA IgK plasma cells, IgA IgL plasma cells, IgM plasma cells, and IgG plasma cells. Iterative tiered clustering identified further heterogeneity within all clusters of IgA and IgG plasma cells, though given the 3′-bias of this dataset, Applicants note that a principled investigation of these clusters would ideally use 5′ sequencing with targeted VDJ amplification.
Together, the treatment-naïve cell atlas from 13 FGID patients captures 118 cell clusters from a non-inflammatory state of pediatric ileum.
From the 124,054 cells profiled from 14 pediCD patients, Applicants recovered 12 Tier 1 clusters which here Applicants display on a t-stochastic neighbor embedding (t-SNE) plot colored by cluster identity, and represent the main cellular lineages found in the epithelium and lamina propria of an ileal biopsy (
Within B cells, Applicants also identified a strong division between non-cycling and cycling B cells, with those found in the cycling compartment readily identifiable by germinal center markers and further dark zone (AICDA) and light zone (CD83) genes, as in FGID. Within cells expressing germinal centers markers, a highly-proliferative branch including clusters such as CD.B/LZ.CCL22.NPW, CD.B/GC.MKI67.RRM2, and CD.B/DZ.HIST1H1B.MKI67 emerged (
Within Myeloid cells, Applicants identified, and confirmed using the same extensive inspection of literature curated markers as in FGID, cell subsets corresponding to monocytes (CD14, FCGR3A, FCN1, S100A8, S100A9, etc.), macrophages (CSF1R, MERTK, MAF, C1QA, etc.), cDC1 (CLEC9A, XCR1, BATF3), cDC2 (FCER1A, CLEC10A, CD1C, IRF4 etc.), and pDCs (IL3RA, LILRA4, IRF7) (
Within T cells, Applicants followed a similar approach as utilized for FGID T cells and identified cell subsets of T cells (joint expression of CD247, CD3D, CD3E, CD3G with TRAC, TRBC1, TRBC2, or TRGC1, TRGC2 and TRDC), but in pediCD also identified several discrete clusters of NK cells (lower expression of TCR-complex genes with FCGR3A or NCAM1, NCR1 and TYROBP), and ILCs (KIT, NCR2, RORC and low expression of CD3-complex genes) (
Within Epithelial cells, most cells expressed high levels of OLFM4 as well, identifying them as crypt-localized cells. Applicants readily identified subsets of stem cells (LGR5), proliferating cells (TOP2A), goblet cells (SPINK4, ZG16, various MUCs), enteroendocrine cells (SCG3, ISL1), Paneth cells (ITLN2, PRSS2, LYZ), tuft cells (GNG13, SH2D6, TRPM5) and enterocytes (APOC3, APOA1, FABP6, etc.). Amongst several clusters characterized by CCL25 and OLFM4 expression, Applicants identified a subset marked by LGR5 expression, characteristic of intestinal stem cells (CD.EpithStem.LINC00176.RPS4YA1). Applicants identified several subsets expressing CD24, indicative of crypt localization, with expression of REG1B (CD.Secretory.GSTA1.REG1B; CD.Secretory.REG1B.REG1A). Applicants also identified early enterocyte cluster CD.EC.ANPEP.DUOX2, characterized by FABP4 and ALDOB and expressing DUOX2 and MUC1. Applicants resolved several clusters of enteroendocrine cells, including CD.Enteroendocrine.TFPI2.TPH1 and CD.Enteroendocrine.NEUROG3.MLN. Applicants also found two clusters Applicants labeled as M cells based on expression of SPIB (CD.Mcell.CCL23.SPIB; CD.MCell.CSRP2.SPIB). Paneth cells did not further sub-cluster despite forming an independent Tier 1 cluster (CD.Epith.Paneth). Most strikingly, Applicants identified a diversity of goblet cells recovered across multiple patients including CD.Goblet.HES6.COLCA2 expressing REG4 and LGALS9, and CD.Goblet.TFF1.TPSG1 expressing TFF1 and ITLN1, amongst others. Applicants also identified a cluster of Tufts cells: CD.EC.GNAT3.TRPM5.
Within Endothelial cells, Applicants readily identified vascular and lymphatic endothelial cells (LYVE1, PROX1), with the vascular cells able to be further identified as capillaries (CA4) or venular endothelial cells (ACKR1, MADCAM1). Applicants also identified a subset of cells (FG.Endth/Peri.FRZB.NOTCH3) expressing high levels of FRZB and NOTCH3, which, rather than being arterioles, likely represent arteriole-associated pericytes or smooth muscle cells given the absence of EFNB2, SOX17, BMX, and HEY1, and the presence of ACTA2 and MYL9, as cluster-defining genes. In pediCD, Applicants also identified a cluster of arteriole endothelial cells, CD.Endth/Art.SEMA3G.SSUH2, identified by expression of HEY1, EFNB2, and SOX17. Applicants also highlight that the endothelial venules characterized by expression of markers for postcapillary venules specialized in leukocyte recruitment, such as CD.Endth/Ven.ADGRG6.ACKR1 and CD.Endth/Ven.POSTN.ACKR1, exhibited greater diversity than in FGID with multiple end cell clusters identified.
Within Fibroblasts, Applicants identified principal subsets characterized by their structural roles (COL3A1, ADAMDEC1, FBLN1, LUM, etc.), myofibroblasts (MYH11, ACTA2, ACTG2, etc.), and organization of lymphoid cells (CCL19, CCL21 etc.). The principal hierarchy in fibroblasts in pediCD was between FRZB-, EDRNB- and F3-expressing subsets such as CD.Fibro.LY6H.PAPPA2 and CD.Fibro.AGT.F3, which were also enriched for CTGF and MMP1 expression, and ADAMDEC1-expressing fibroblasts, which were enriched for several chemokines such as CXCL12, and in some specific clusters CXCL6, CXCL1, CCL11, and other chemokines. Amongst three fibroblast subsets marked by C3 expression, Applicants identified follicular dendritic cells (CD.Fibro/fDC.FCSP.CXCL13), along with fibroblasts expressing CCL21, CCL19, and the interferon-stimulated chemokines CXCL9 and CXCL10 (CD.Fibro.CCL21.CCL19; CD.Fibro.TNFSF11.CD24). Distinct from the FGID atlas, within the pediCD atlas, glial cells clustered within fibroblasts, but were also marked by S100B, PLP1 and SPP1 expression. Applicants note that many fibroblasts were found with T cells, generating extensive doublet clusters.
The Mast cells recovered in pediCD did further sub-cluster in an automated fashion, were largely marked by TPSB2 (>90%), with minimal CMA1 (<16%) expressing cells, suggesting they are largely classical MC-T cells in pediCD intestine. Intriguingly, some subsets (CD.Mstcl.AREG.ADCYAP1) were enriched for IL13-expression. Applicants also detected a small cluster of proliferating mast cells from several patients (CD.Mstcl.CDK1.KIAA0101).
Applicants also identified four Tier 1 clusters for Plasma cells, which are characterized by their strong expression of IGH* immunoglobulin heavy-chain genes together with either a IGK* (kappa light chain) or IGL* (lambda light chain) genes. This resolved IgA IgK plasma cells, IgA IgL plasma cells, IgM plasma cells, and IgG plasma cells. Iterative tiered clustering identified further heterogeneity within all clusters of IgA plasma cells, though given the 3′-bias of this dataset, Applicants note that a principled investigation of these clusters would ideally use 5′ sequencing with targeted VDJ amplification.
Together, the treatment-naïve cell atlas from 14 pediCD patients captures 305 cell clusters from an inflammatory state of pediatric ileum.
Because this pediCD atlas was curated from treatment-naïve diagnostic samples, Applicants were able to interrogate the data to determine to test if overall shifts in cellular composition, specific cell states, and/or gene expression signatures underlie clinically-appreciated disease severity and treatment decisions (NOA vs. FR/PR), and those that are associated with either FRs or PRs to anti-TNF blockade. Here, Applicants leveraged the detailed clinical trajectories collected from all patients in order to resolve distinctions between cellular composition and cell states with disease and treatment outcomes.
In order to capture the overall principal axes of variation explaining changes in cellular composition, Applicants calculated the fractional composition of all 305 end cell clusters in pediCD within its parent cell type (“per cell type”), or within all cells (“per total cells”), and performed a principal component analysis (PCA) over both of these sample x cell cluster frequency tables. Applicants then used the PC1 (13.4% variation “per cell type” and 13.5% variation “per total cells”) and PC2 (12.7% variation “per cell type” and 11.8% variation “per total cells”) as numerical variables which Applicants correlated with clinical metadata including categorical variables (patient ID, ethnicity, gender, etc.), ordinal variables (TI-macroscopic, TI-microscopic, Anti-TNF in 30 days, anti-TNF_NOA_FR_PR, etc.) and numerical variables (Height, BMI, CRP, ESR, PLT, PCDAI (Pediatric Crohn's Disease Activity Index), wPCDAI, etc.) (
In order to understand if any cell types were predominantly driving associations with disease severity, Applicants then further decomposed the overall PCA on 305 end clusters and performed PCA over each cell type's fractional composition of end clusters individually (B cells: 33 clusters, Endothelial: 18 clusters, Epithelial: 68 clusters, Fibroblast 45 clusters, Myeloid: 54 clusters, T cells: 57 clusters), and correlated the first two PC's (all PC1's and PC2's accounted for >13% variance each) with all of the clinical variables. The PCs derived from T/NK/ILC cells, Myeloid cells, and Epithelial cells were all moderately correlated with anti-TNF_NOA_FR_PR status (>0.49) and had higher values than the other cell types, so Applicants asked if a PCA-based metric considering all three cell types would capture disease severity and treatment response. When Applicants calculated the PCA accounting for frequencies within each cell type of T/NK/ILC cells, Myeloid cells, and Epithelial cells, Applicants found strong correlation for PC2 with both anti-TNF within 30 days (r=−0.83) and anti-TNF-NOA_FR_PR status (r=−0.87) (
Applicants next focused on further deconstructing this severity vector: identifying which cell clusters accounted for the most significant changes in abundance based on the relative frequency of an end cell cluster within its parent cell type. Applicants focus on this form of analysis, as may typically be reported for flow cytometry, and further discuss approaches to enumerate total cell numbers which would be critical to identify changes in overall cellularity in the different pediCD treatment and response categories (Discussion). Applicants first performed a Fisher's exact test between NOA vs. FR, NOA vs. PR or FR vs. PR, and then performed a Mann-Whitney U test to highlight specific clusters and discuss results from clusters with high Simpson's index of diversity (i.e. recovered from multiple patients) as shown for T/NK/ILCs and Myeloid Cell Types (
Cell Subsets that are NOA→RESP and PR
Between NOAs and both FRs and PRs, two subsets with significantly increased frequency amongst T cells, NK cells, and ILCs were identified. These were CD.NK.MKI67.GZMA and CD.T.MKI67.IL22 (
Cell subsets that are NOA→PR
Applicants next focused on those cell subsets that were significantly changed only between NOAs and PRs. Here Applicants note several more distinct clusters within the lymphocyte cell type, including increases of CD.T.MKI67.IFNG, CD.T.MKI67.FOXP3, CD.T.GNLY.CSF2, and CD.NK.GNLY.FCER1G in the PR patients compared to NOA patients. The two MKI67 clusters again highlighted an increase in proliferative cells, specifically cells enriched for IFNG, GNLY, HOPX, ITGAE and IL26 (CD.T.MKI67.IFNG), and IL2RA, BATF, CTLA4, TNFRSF1B, CXCR3, and FOXP3 (CD.T.MKI67.FOXP3), the latter of which may be indicative of proliferating regulatory T cells. The two GNLY clusters emphasized cytotoxicity, specifically cell clusters were both enriched for GNLY, GZMB, GZMA, PRF1 and more specifically for IFNG, CXCR6, and CSF2 (CD.T.GNLY.CSF2), or AREG, TYROBP, and KLRF1 (CD.NK.GNLY.FCER1G). Amongst myeloid cells, there was an increase in CD.Mac.CXCL3.APOC1, CD.Mono/Mac.CXCL10.FCN1, and CD.Mono.FCN1.S100A4 in PR versus NOA. The CD.Mac.CXCL3.APOC1 cluster was enriched for a variety of chemokines including CCL3, CCL4, CXCL3, CXCL2, CXCL1, CCL20, and CCL8. It was also enriched for TNF and IL1B. The CD.Mono/Mac.CXCL10.FCN1 cluster was enriched for CXCL9, CXCL10, CXCL11, GBP1, GBP2, GBP4, GBP5, suggestive of activation by IFN, and more specifically Type II IFN-gamma, based on the GBP gene cluster. CD.Mono.FCN1.S100A4 was characterized by S100A4, S100A6, and FCN1 expression. These two hematopoietic clusters were paralleled by increases in certain clusters within endothelial cells (CD.Endth/Ven.LAMP3.LIPG) and epithelial cells (CD.Goblet.TFF1.TPSG1).
Several clusters of cells were decreased in PR versus NOA, including CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, and CD.T.IFI6.IRF7 amongst lymphocytes. Amongst myeloid cells, CD.cDC2.CLEC10A.FCGR2B were decreased, and amongst fibroblasts CD.Fibro.IFI6.IFI44L were decreased. In epithelial cells, CD.Tuft.GNAT3.TRPM5 cells were decreased. Alongside the decrease in Tuft cells amongst epithelial cells, two more clusters closely related to the aforementioned CD.EC.GSTA2.SLC28A3 cluster, also marked by GSTA2 expression, were significantly decreased (CD.EC.GSTA2.CES3, and CD.EC.GSTA2.TMPRSS15).
Cell Subsets that are NOA→RESP
Applicants also detected significant decreases in FRs relative to NOAs in certain cell types, particularly within Epithelial cells including CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF. Applicants note that the relative decrease in M cells is in stark contrast to the “ectopic” M-like cells that were detected in adult ulcerative colitis.
Cell Subsets that are FR→PR
Lastly, Applicants assessed the compositional differences between FRs and PRs and only identified one cell cluster which was significantly increased in PRs: CD.B/DZ.HIST1H1B.MKI67, which are proliferating dark zone B cells. Together, these data suggest that at the earlier stages of pediCD, there are a series of gradual changes in the multiple cell types that encapsulate the progression from NOA to FR to PR patients. These changes were particularly notable within proliferating T cells, cytotoxic NK cells, and monocytes/macrophages that together provide a numerical variable in PC2-“T/NK/ILC/Myeloid/Epithelial” which correlates strongly with both anti-TNF use within 30 days and anti-TNF_NOA_FR_PR status.
As Applicants had generated independent cellular atlases for FGID and pediCD, Applicants next sought to identify correspondence between jointly detected cell subsets. As the study progressed, several analytical methods to integrate scRNA-seq emerged which utilize distinct principles to either predict cell type names given reference gene lists or directly integrate two datasets that were collected from distinct perturbations, tissue, or even species. However, many of these methods are benchmarked on broad cell type or subset integration, and thus their applicability for fine cell states, as in the end clusters Applicants identify here, remains unknown. Thus, Applicants employed a random forest classifier based approach, which has recently also been applied successfully in work to identify correspondence in fine sub-clusters in the mammalian retina. Specifically, Applicants employed cross validation within FGID or pediCD cell types before running between FGID and pediCD in both directions (Methods). Applicants applied this to all cell types, and here focus the discussion on Myeloid cells and T/NK/ILC cells (
Comparing across Myeloid cells between pediCD and FGID, Applicants could identify strong correspondence of specific cell subsets such as cDC1's and pDCs (
For T/NK/ILLC cells, Applicants identified more discrete patterns relative to Myeloid cells based on comparison of the Random Forest result. Within the two FGID cytotoxic T cell clusters, Applicants identified correspondence by 18 pediCD clusters, representing Type 17 ILCs, and cytotoxic NK cells and T cells (
Importantly, when Applicants jointly clustered macrophages from FGID and pediCD together, Applicants identified that several of the original end clusters identified to be through iterative tiered clustering in pediCD were divided across the UMAP, ending split up in distinct clusters of cells. This highlights the challenge of multiple cell type, subset, and state vectors which are simultaneously accounted for by clustering over a set of highly variable genes jointly derived from multiple cells and disease conditions, and that highly homogenous cell clusters may be dispersed across a space based on the other cells that they are being compared against and the parameters used for clustering.
Based on their over-representation within clusters showing more significant differences within pediCD, Applicants then focused on performing pseudotime over a shared gene expression space of the T/NK/ILCs and monocytes/macrophages. Applicants utilized a list of genes that were cell-type defining genes in either FGID or pediCD (Table 1 and Table 4), but removed genes that were differentially-expressed between FGID and pediCD (Table 2), to allow for cell type/subset to drive placement on the pseudotime axis (Methods). This allowed Applicants to place the fine-grained clusters within a joint gene-expression space to relate FGID to pediCD. In the T/NK/ILC analysis, Applicants observed a gradient of naïve T cells to the left, and two paths leading to helper T cells and ILCs along the top axis, and cytotoxic T cells and NK cells on the bottom axis (
Within monocytes and macrophages, Applicants identified a gradient from right to left, with macrophages having a more homeostatic gene expression signature (MMP9 APOE) as the origin (
Most comprehensive scRNA-seq atlases of inflammatory disease conditions consist of patients being treated with a variety of agents, and for which the biopsies included often reflect a partial treatment-refractory state to combinations of antibiotics, corticosteroids, immunomodulators, and biologics including anti-TNF monoclonal antibodies. A treatment-naïve single-cell atlas in an inflammatory disease condition linking observed baseline cell clusters with disease trajectory and treatment outcomes has yet to be reported. In order to address this unmet need in pediCD, Applicants created the prospective PREDICT study (Clinicaltrials.gov #NCT03369353) to help identify, profile, and understand pediatric IBD and FGID controls. Here, Applicants present detailed diagnostic data from the first cohort of 27 patients enrolled on PREDICT, including 14 pediCD and 13 FGID patients, together with flow cytometric and scRNA-seq studies of the cellular composition of the terminal ileum (
Several analytical approaches have been developed to enable the generation and interrogation of clusters during the curation of single-cell transcriptomic atlases (Hie et al., 2020). One such method, sub-clustering of broad clusters, has proven to be a powerful tool for isolating highly specific axes of variation that are obscured by analyses whose principal axes of variation are broad cell types (La Manno et al., 2021; Tasic et al., 2018; Zeisel et al., 2018). However, while sub-clustering analysis is a powerful tool allowing access to the hierarchy of cell states, this method is manually intensive and there is little consensus, control, or standard in clustering parameters or annotation methods. To address this issue, Applicants developed a principled, modular, automated sub-clustering routine made possible by application of parameter scanning methods (Rousseeuw, 1987; Shekhar et al., 2016). Applicants developed this tool, ARBOL, of which iterative tiered clustering (ITC) is a key component, in R, integrating with Seurat functions, to make it accessible and easily incorporated into common workflows and have curated a GitHub repository with illustrative vignettes. Here, Applicants use ARBOL to standardize fine-grained cell state discovery by the creation and cultivation of a tree of cell states, followed by the generation of automated cell names to aid in the annotation of end clusters by unique and descriptive genes.
Together Applicants present two cellular atlases for pediatric GI disease, consisting of 94,451 cells for FGID and 107,432 for pediCD. Applicants provide key gene-list resources for further studies, identify correspondence between disease states, and nominate a vector of lymphoid, myeloid and epithelial cell states which predicts disease severity and treatment outcomes. This cellular vector correlates strongly with both the clinical presentation of pediCD severity, and to the distinction between anti-TNF full or partial response. The significant changes in cell composition associated with disease severity were increases of proliferating T cells, cytotoxic NK cells, specific monocytes/macrophages, and plasmacytoid dendritic cells (pDCs) accompanied by decreases of metabolically-specialized epithelial cell subsets. Applicants further validate this vector in two bulk RNA-seq treatment-naïve IBD cohorts.
The PREDICT study prospectively enrolled treatment-naïve, previously undiagnosed pediatric patients with GI complaints necessitating diagnostic endoscopy. The current analysis focuses on patients enrolled in the first year of the study, during which time 14 patients with pediCD and 13 patients with FGID were enrolled and had adequate ileal samples for single cell analysis (
Patients with pediCD were initially divided into two cohorts. Those with milder disease characteristics (n=4) as determined by their treating physician, were not put on anti-TNF therapies, and are noted as NOA. For patients with more severe disease (n=10), anti-TNF therapy (with either infliximab or adalimumab, Table 6) was initiated within 90 days of diagnostic endoscopy. All pediCD patients were followed prospectively and categorized as FR (n=5) or PR (n=5) to anti-TNF therapy based on the following criteria: FR was defined as clinical symptom control and biochemical response (measuring CRP, ESR, albumin, and complete blood counts (CBC)), and with a weighted Pediatric Crohn's Disease Activity Index (PCDAI) score of <12.5 on maintenance anti-TNF therapy with no dose adjustments required (Cappello and Morreale, 2016; Hyams et al., 1991; Sandborn, 2014; Turner et al., 2012, 2017). PR to anti-TNF therapy was defined as a lack of full clinical symptom control as determined by the treating physician or lack of full biochemical response, with documented escalation of anti-TNF therapy or addition of other agents (
Applicants collected terminal ileum biopsies from 14 pediCD patients and from 13 uninflamed FGID patients, and prepared single-cell suspensions for flow cytometry and scRNA-seq. Biopsies from pediCD were from actively-inflamed areas adjacent to ulcerations. Biopsies from FGID were from non-inflamed terminal ileum. The epithelium was first separated from the lamina propria before enzymatic dissociation, and flow cytometric analysis was performed on the viable single-cell fraction, which recovered predominantly hematopoietic cells with some remnant epithelial cells (<20% of all cells), likely representing those in deeper crypt regions (
In addition to flow cytometry, Applicants performed droplet-based scRNA-seq on cell suspensions from the 14 pediCD/13 FGID patient cohort using the 10× Genomics V2 3′ platform (
Following library preparation and sequencing, Applicants derived a unified cells-by-genes expression matrix from the 27 samples, containing digital gene expression values for all cells passing quality thresholds (n=254,911 cells;
Applicants then systematically re-clustered each broad cell type, identifying increasing cellular heterogeneity. Given that Applicants detected changes in the frequency of HLA-DR+ macrophages/dendritic cells and pDCs between pediCD and FGID by flow cytometry, Applicants initially focused on the myeloid cell type sub-clustering, containing dendritic cells, macrophages, monocytes, and pDCs (
In order to approach this challenge from a more principled direction, Applicants made five key changes to the analytical workflow, which Applicants jointly refer to as ARBOL (github.com/jo-m-lab/ARBOL): 1. Applicants proceeded to analyze FGID and pediCD samples separately to define corresponding cell type, subset, and state clusters and markers, 2.Applicants implemented an automated ITC approach to optimize the silhouette score at each tier of iterative sub-clustering and stop when a specific granularity was reached (
Applicants then hierarchically clustered all end cell state clusters to generate the final dendrograms for FGID and pediCD, and performed 1 vs. rest within-Tier 1 clusters (i.e. broad cell types) differential expression to provide systematic names for cells based on their cell type classification and two genes (
Applicants generated gene lists for cell types (1 vs. rest across all cells), subsets (1 v. rest across all cells), and states (1 vs. rest within-Tier 1 cell-type) (see, Table 1 and Table 4). To select marker genes for naming in a data driven manner Applicants used 1 vs. rest within-cell-type differential expression (Table 1 and Table 4; Wilcoxon, Bonferroni adjusted p<0.05). To account for genes that might be highly expressed in just a few cells Applicants ranked the marker genes by a score combining their significance, the fold change in expression, and fold change of percent gene positive cells in the subset versus the percent of gene positive cells outside the subset. The collected metrics were multiplied together to provide a single score by which the genes were ranked: (−log(sig+1)*avg_log FC*(pct.in/pct.out)). For most subsets Applicants selected the top 2 of these marker genes. For T/NK/ILC cells and myeloid cells Applicants occasionally chose a slightly lower ranking gene from the top 10 if it was well supported and recognized by the literature. Using this ranking system, Applicants identified CCL3 and CD160 as two genes significantly enriched in one NK cluster (adj. p-value=0, expression within-cluster >40% cells positive and in other Tier 1 T cells <6%). This resulted in a final name for this cluster of CD.NK.CCL3.CD160. Applicants repeated this process for all FGID (183 end clusters) and pediCD (426 end clusters) within Tier 1 B cell, endothelial, epithelial, fibroblast, plasma cell, myeloid cell, mast cell, and T cell identified clusters, and provide systematically generated names for all (Tables 8 and 9).
Using this analytical workflow, Applicants present comprehensive cellular atlases of FGID (
From 13 FGID patients, Applicants recovered 12 Tier 1 clusters which Applicants display on a t-stochastic neighbor embedding (t-SNE) plot colored by cluster identity containing 99,488 cells (
Within B cells, Applicants identified a strong division between non-cycling and cycling B cells, with those found in the cycling compartment readily identifiable by germinal center markers and further dark zone (AICDA) and light zone (CD83) genes resulting in FG.B/DZ.AICDA.IGKC and FG.B/LZ.CD74.CD83 clusters (
Within myeloid cells, Applicants identified, and confirmed using extensive inspection of literature curated markers, cell subsets corresponding to monocytes (CD14, FCGR3A, FCN1, S100A8, S100A9, etc.), macrophages (CSF1R, MERTK, MAF, C1QA, etc.), cDC1 (CLEC9A, XCR1, BATF3), cDC2 (FCER1A, CLEC10A, CD1C, IRF4 etc.), and pDCs (IL3RA, LILRA4, IRF7) (
Within T cells, Applicants followed a similar approach as utilized for Myeloid cells and identified principal cell subsets of T cells (joint expression of CD247, CD3D, CD3E, CD3G with TRAC, TRBC1, TRBC2, or TRGC1, TRGC2 and TRDC), and a combined cluster of cytotoxic cells (FG.T/NK/ILC.GNLY.TYROBP) likely including T cells, NK cells (lower expression of TCR-complex genes with NCAM1, NCR1 and TYROBP), and some TLCs (KIT, NCR2, RORC and low expression of CD3-complex genes) (
Within epithelial cells, most cells expressed high levels of OLFM4, identifying them as crypt-localized cells (Moor et al., 2018). Applicants readily identified subsets of stem cells (LGR5), proliferating cells (TOP2A), goblet cells (SPINK4, ZG16, various MUCs), enteroendocrine cells (SCG3, ISL1), Paneth cells (ITLN2, PRSS2, LYZ), tuft cells (GNG13, SH2D6, TRPM5) and enterocytes (APOC3, APOA1, FABP6, etc.) (
Within endothelial cells, Applicants readily identified vascular and lymphatic endothelial cells (LYVE1, PROX1), with the vascular cells able to be further identified as capillaries (CA4) or venular endothelial cells (ACKR1, MADCAM1) (Brulois et al., 2020). Applicants also identified a subset of cells (FG.Endth/Peri.FRZB.NOTCH3) expressing high levels of FRZB and NOTCH3, which, rather than being arterioles, likely represent arteriole-associated pericytes or smooth muscle cells given the absence of EFNB2, SOX17, BMX, and HEY1, and the presence of ACTA2 and MYL9, as cluster-defining genes (
Within fibroblasts, Applicants identified principal subsets characterized by their structural roles (COL3A1, ADAMDEC1, FBLN1, LUM, etc.), myofibroblasts (MYH11, ACTA2, ACTG2, etc.), and organization of lymphoid cells (CCL19, CCL21 etc.) (
The mast cells recovered did not further sub-cluster in an automated fashion, and were largely marked by TPSB2 and TPSAB1 (>97%), with minimal CMA1 (<20%) expressing cells, suggesting they are largely classical MC-T cells in FGID intestine (Dwyer et al., 2021).
Applicants identified four Tier 1 clusters for plasma cells, which are characterized by their strong expression of IGH* immunoglobulin heavy-chain genes together with either an IGK* (kappa light chain) or IGL* (lambda light chain) genes (Cyster and Allen, 2019; James et al., 2020). This resolved IgA IgK plasma cells, IgA IgL plasma cells, IgM plasma cells, and IgG plasma cells. Iterative tiered clustering identified further heterogeneity within all clusters of IgA and IgG plasma cells, though given the 3′-bias of this dataset, Applicants note that a principled investigation of these clusters would ideally use 5′ sequencing with targeted VDJ amplification.
Together, the treatment-naïve cell atlas from 13 FGID patients captures 138 cell clusters from a non-inflammatory state of pediatric ileum which Applicants annotated and named in a principled fashion.
From the 124,054 cells profiled from 14 pediCD patients, Applicants recovered 12 Tier 1 clusters which here Applicants display on a t-SNE plot colored by cluster identity (
Within B cells, Applicants also identified a strong division between non-cycling and cycling B cells, with those found in the cycling compartment readily identifiable by germinal center markers and further dark zone (AICDA) and light zone (CD83) genes, as in FGID (Victora et al., 2010). Within cells expressing germinal centers markers, a highly-proliferative branch including clusters such as CD.B/LZ.CCL22.NPW, CD.B/GC.MKI67.RRM2, and CD.B/DZ.HIST1H1B.MKI67 emerged (
Within myeloid cells, Applicants identified, and confirmed using the same extensive inspection of literature curated markers as in FGID, cell subsets corresponding to monocytes (CD14, FCGR3A, FCN1, S100A8, S100A9, etc.), macrophages (CSF1R, MERTK, MAF, C1QA, etc.), cDC1 (CLEC9A, XCR1, BATF3), cDC2 (FCER1A, CLEC10A, CD1C, IRF4 etc.), and pDCs (IL3RA, LILRA4, IRF7) (
Within T cells, Applicants followed a similar approach as utilized for FGID T cells and identified cell subsets of T cells (joint expression of CD247, CD3D, CD3E, CD3G with TRAC, TRBC1, TRBC2, or TRGC1, TRGC2 and TRDC), but in pediCD also identified several discrete clusters of NK cells (lower expression of TCR-complex genes with FCGR3A or NCAM1, NCR1 and TYROBP), and ILCs (KIT, NCR2, RORC and low expression of CD3-complex genes) (
Within epithelial cells Applicants identified substantial heterogeneity in CD. Most cells expressed high levels of OLFM4, identifying them as crypt-localized cells (Moor et al., 2018). Applicants readily identified subsets of stem cells (LGR5), proliferating cells (TOP2A), goblet cells (SPINK4, ZG16, various MUCs), enteroendocrine cells (SCG3, ISL1), Paneth cells (ITLN2, PRSS2, LYZ), tuft cells (GNG13, SH2D6, TRPM5) and enterocytes (APOC3, APOA1, FABP6, etc.) (
Within endothelial cells, Applicants also readily identified vascular and lymphatic endothelial cells (LYVE1, PROX1), with the vascular cells able to be further identified as capillaries (CA4) or venular endothelial cells (ACKR1, MADCAM1) (
Within fibroblasts, Applicants identified principal subsets characterized by their structural roles (COL3A1, ADAMDEC1, FBLN1, LUM, etc.), myofibroblasts (MYH11, ACTA2, ACTG2, etc.), and organization of lymphoid cells (CCL19, CCL21 etc.) (
The mast cells recovered in pediCD did further sub-cluster in an automated fashion, were largely marked by TPSB2 (>90%), with minimal CMA1 (<16%) expressing cells, suggesting they are largely classical MC-T cells in pediCD intestine (
Applicants also identified four Tier 1 clusters for plasma cells, which are characterized by their strong expression of IGH* immunoglobulin heavy-chain genes together with either a IGK* (kappa light chain) or IGL* (lambda light chain) genes. This resolved IgA IgK plasma cells, IgA IgL plasma cells, IgM plasma cells, and IgG plasma cells. Iterative tiered clustering identified further heterogeneity within all clusters of IgA plasma cells, though given the 3′-bias of this dataset, Applicants note that a principled investigation of these clusters would ideally use 5′ sequencing with targeted VDJ amplification.
Together, the treatment-naïve cell atlas from 14 pediCD patients captures 305 cell clusters from an inflammatory state of the pediatric ileum suggesting an increase in the number and diversity of cell states present in the intestine during overt inflammatory disease.
As this pediCD atlas was curated from treatment-naïve diagnostic samples, Applicants were able to interrogate the data to test if overall shifts in cellular composition, specific cell states, and/or gene expression signatures underlie clinically-appreciated disease severity and treatment decisions (NOA vs. FR/PR), and those that are further associated with response to anti-TNF therapies (either FRs or PRs). Here, Applicants leveraged the detailed clinical trajectories collected from all patients as the ultimate functional test: resolving how cellular composition and cell states predict disease and treatment outcomes.
In order to capture the overall principal axes of variation explaining changes in cellular composition, Applicants calculated the fractional composition of all 305 end cell clusters in pediCD within its parent cell type (“per cell type”), or within all cells (“per total cells”), and performed a principal component analysis (PCA) over both of these sample x cell cluster frequency tables (Table 11) (Mathew et al., 2020). Applicants then used the PC1 (13.4% variation “per cell type” and 13.5% variation “per total cells”) and PC2 (12.7% variation “per cell type” and 11.8% variation “per total cells”) as numerical variables which Applicants correlated with clinical metadata including categorical variables (patient ID, ethnicity, gender, etc.), ordinal variables (Terminal Ileum (TI)-macroscopic endoscopic evidence, TI-microscopic histopathology, Anti-TNF treatment within 90 days of diagnosis, and treatment decision/response coded as anti-TNF_NOA_FR_PR, etc.) and numerical variables (Height, BMI, CRP, ESR, PLT, PCDAI, wPCDAI, etc.) (
In order to understand if any cell types were predominantly driving associations with clinical disease severity at initial presentation, Applicants then further deconvoluted the overall PCA on 305 end clusters and performed PCA over each cell type's fractional composition of end clusters individually (B cells: 33 clusters, Endothelial: 18 clusters, Epithelial: 68 clusters, Fibroblast 45 clusters, Myeloid: 54 clusters, T/NK/ILCs: 57 clusters), and correlated the first two PC's (all PC1's and PC2's each accounted for >13% variance) with all of the clinical variables (Table 11). The PCs derived from T/NK/ILC cells, myeloid cells, and epithelial cells were all moderately correlated with anti-TNF_NOA_FR_PR status (r>0.49) individually and had higher values than the other cell types; therefore, Applicants asked if a PCA-based metric considering all three cell types would synergistically capture both disease severity and treatment response. When Applicants calculated the PCA accounting for frequencies within each cell type of T/NK/ILC cells, myeloid cells, and epithelial cells, Applicants found strong correlation for PC2 with both anti-TNF within 90 days (r=−0.83) and anti-TNF-NOA_FR_PR status (r=−0.87) (
Applicants next focused on further deconstructing the disease severity vector: identifying which cell clusters accounted for the most significant changes in abundance based on the relative frequency of an end cell cluster within its parent cell type. Applicants focus on this form of analysis for scRNA-seq, similar to what is typically reported for flow cytometry, and further discuss approaches to enumerate total cell numbers which would be critical to identify changes in overall cellularity in the different pediCD treatment and response categories (Discussion) (Gomariz et al., 2018). Applicants first performed a Fisher's exact test between NOA vs. FR; NOA vs. PR; or FR vs. PR, and then performed a Mann-Whitney U test to highlight specific clusters and discuss results from clusters with high Simpson's index of diversity (i.e. recovered from multiple patients) as shown for T/NK/ILCs and Myeloid Cell Types (
When comparing FR/PRs to NOAs, two subsets with significantly increased frequency in FR/PR patients amongst T cells, NK cells, and ILCs were identified. These were CD.NK.MKI67.GZMA and CD.T.MKI67.IL22 (
Applicants also detected significant decreases in FRs relative to NOAs in certain cell types, particularly within Epithelial cells including CD.EpithStem.LINC00176.RPS4Y1, CD.MCell.CSRP2.SPIB, CD.EC.FABP6.PLCG2, and CD.EC.FABP1.ADIRF (
Applicants next focused on those cell subsets that were significantly changed only between PRs and NOAs (
Several clusters of cells were decreased in PR versus NOA, including CD.T.LAG3.BATF, CD.T.IFI44L.PTGER4, and CD.T.IFI6.IRF7 amongst lymphocytes (
Applicants assessed the compositional differences between FRs and PRs and only identified one cell cluster which was significantly increased in PRs: CD.B/DZ.HIST1H1B.MKI67, which are proliferating dark zone B cells. CD.T.EGR1.TNF T cells were significantly decreased in PR versus FR (
Together, the significant changes in cell composition between the clinically-defined patient groups were particularly notable within proliferating T cells, cytotoxic NK cells, monocytes/macrophages, and epithelial cells that could be combined to calculate a numerical variable for “PC2-T/NK/ILC/Myeloid/Epithelial” which correlated strongly with both the clinical presentation leading to a decision to treat or not with anti-TNF therapies and to the distinction between anti-TNF_NOA_FR_PR status (
To determine whether the disease severity gene signature that Applicants discovered in the PREDICT study can be found in other cohorts, Applicants selected the top 92 markers of the 25 cell states associated with disease severity and treatment outcomes (Table 14) and performed a gene-set enrichment analysis (GSEA) (
As Applicants had generated independent cellular atlases for FGID and pediCD to mitigate “discovery” of hybrid cell clusters that may not represent bona fide biological cell states, Applicants next sought to match and identify correspondence between pediCD and FGID cell subsets. As the study progressed, several analytical methods to integrate scRNA-seq emerged which utilize distinct principles to either predict cell type names given reference gene lists or directly integrate two datasets that were collected from distinct perturbations, tissues, or even species (Hao et al., 2021; Hie et al., 2019; Korsunsky et al., 2019; Pliner et al., 2019). However, many of these methods are benchmarked on broad cell type or subset integration, and thus their applicability for fine cell states, as in the end clusters Applicants identify here, remains unknown. Thus, Applicants employed a random forest (RF) classifier-based approach, which has recently also been applied successfully in work to identify correspondence in fine sub-clusters in the mammalian retina (Peng et al., 2019; Shekhar et al., 2016). Specifically, Applicants employed paired RF models (one trained on FGID the other trained on pediCD) to obtain cross dataset predictions per cell. Applicants trained these models in scikit-learn with 5-fold cross validation and params: min_samples_leaf=1, oob_score=True, criterion=“gini”, max_depth=200, n_estimators=700, max_features=“sqrt” (Pedregosa et al., 2011). The training set (but not the test set) was sampled with replacement such that all classes contained as many samples as the maximum proportioned class. This up-sampling procedure provided the largest gain to the test accuracy, sensitivity, and specificity scores, increasing accuracy ˜10-15% across each cell type. With the final model, Applicants attained cross-dataset predictions (pediCD to FGID & FGID to pediCD) for each cell, giving a probability score of a cell belonging to a subset in the other disease condition (
Comparing across myeloid cells between pediCD and FGID, Applicants could identify strong correspondence of specific cell subsets such as cDC1s or pDCs (
For T/NK/ILC cells, Applicants identified more discrete patterns relative to Myeloid cells based on comparison of the RF result. Within the two FGID cytotoxic T cell clusters, Applicants identified correspondence by 18 pediCD clusters, representing ILC3s, and cytotoxic NK cells and T cells (
Importantly, when Applicants jointly clustered macrophages from FGID and pediCD together, Applicants identified that several of the original end clusters identified through ARBOL in pediCD were divided across the UMAP: being split into distinct clusters of cells (
Based on their over-representation within clusters showing more significant differences within pediCD, Applicants next focused on performing an analysis over a shared gene expression space of FGID and pediCD of the monocytes/macrophages (
Within FGID monocytes/macrophages, Applicants identified that the majority of clusters occupied the periphery of the UMAP space, including chemokine-expressing clusters (FG.Mac.CCL3.HES1; FG.Mac.CXCL8.IL1B) and metabolic clusters (FG.Mac.APOE.PTGDS) (
Within T/NK/ILCs, Applicants identified that FGID cells were more uniformly mixed with the pediCD cells relative to monocytes/macrophages (
In order to provide tissue-scale context and understand the impact on other cell types for these anti-TNF response associated lymphocyte states, Applicants assessed the relationship of these proliferating T and NK cell clusters with epithelial and myeloid cells (
Applicants present two comprehensive cellular atlases of FGID and pediCD, and then identify correspondence between the two. Applicants generated complete gene lists for cell types (1 vs. rest across all cells), subsets (1 v. rest across all cells), and states (1 vs. rest within cell type). Applicants then focused on pediCD, and those cell states and gene expression which distinguish between disease severity and FRs vs. PRs (Table 1, 2, 3, and 14). The study addresses a critical unmet need in the fields of IBD and systems immunology: the creation of an atlas of newly-diagnosed and untreated diseased tissue, coupled with detailed clinical follow-up to link diagnostic cell types and states with disease trajectory. This is especially true for GI disease and others like it which afflict tissues that are not easily accessible without operative or endoscopic intervention, and where tissue-specific immune pathology dictates disease severity and trajectory. Likewise, cross-sectional studies, as have been the norm for most previous scRNA-seq studies of IBD, are not able to overlay disease trajectory and treatment response onto the topography of a complex multi-cellular atlas, thus limiting the mechanistic and predictive inferences that can be drawn from the generated atlas (Corridoni et al., 2020a, 2020b; Drokhlyansky et al., 2019; Elmentaite et al., 2020; Huang et al., 2019; Kinchen et al., 2018; Martin et al., 2019; Parikh et al., 2019; Smillie et al., 2019). Furthermore, mouse models of CD, and of IBD more broadly, may not be the most appropriate models for understanding treatment resistance in pediCD (Neurath, 2019). To surmount these limitations, Applicants created a prospective clinical study, and enrolled patients requiring a diagnostic biopsy for possible IBD, prior to diagnosis. This allowed Applicants to capture a tremendously valuable control group: those patients with FGID, who experience GI symptoms without evidence of GI inflammation or autoimmunity. These uninflamed controls served as a critical comparator to contextualize the evidence of immune pathology that Applicants observed in patients with pediCD. With these detailed clinical phenotypes as the foundation, Applicants developed an automated ITC algorithm for scRNA-seq data, ARBOL, which defines a vector of T cells, myeloid cells and epithelial cells that cleanly stratifies both Crohn's disease severity and response to treatment.
The availability of comprehensive clinical, flow cytometric and scRNA-seq data from patients with pediCD and from uninflamed FGID controls created an unprecedented opportunity for comparative atlas creation. Applicants took the opportunity to develop a methodical, unbiased, approach to cell state discovery, ARBOL (github.com/jo-m-lab/ARBOL). ARBOL iteratively explores axes of variation in scRNA-seq data by clustering and subclustering until variation between cells becomes noise. The philosophy of ARBOL is that every axis of variation could be biologically meaningful so each should be explored, and that axes of variation are relative to the comparative outgroup, meaning that similar cell states may arise at distinct tiers. Once every possibility is explored, curation and a statistical interrogation of resolution are used to collapse clusters into the elemental transcriptomes of the dataset. ARBOL inherently builds a tree of subclustering events. As data is separated by major axes of variation in each subset, later rounds capture less pronounced variables. This comes with some caveats: variation shared by all cell types (for example, cell cycle stage) can make up one of the major axes of variation in the first round of clustering. Cell types can split up at the beginning, so the same splitting of B and T cells, for example, may happen further down in separate branches. The resulting tree of clustering events (
One of the primary remaining challenges going forward will be to identify which clusters are truly patient-unique, or are simply patient-unique at the cohort size to which Applicants are currently limited to. Applicants calculate a diversity metric for each end cluster to highlight those which are largely conserved between patients, and provide complete cluster-defining gene lists for both FGID and pediCD at three levels of clustering. Applicants also provide links to the data visualization portal to enable cross-atlas comparisons: singlecell.broadinstitute.org/single_cell/study/SCP1422/predict-2021-paper-fgid and singlecell.broadinstitute.org/single_cell/study/SCP1423/predict-2021-paper-cd.
One of the chief advantages of enrolling pediCD patients at diagnosis, and prior to any therapeutic intervention, was that Applicants were able to relate their diagnostic immune landscape with disease trajectory. In the pediCD group, Applicants identified 3 clinical subgroups. The first distinction was made by treating physicians, and classified patients with milder versus more severe clinical disease characteristics at diagnosis. The milder patients were not placed on anti-TNF agents (NOA), while the more severe patients were treated with monoclonal antibodies that neutralize TNF including infliximab and adalimumab. The second distinction between patient groups could not be made at diagnosis, but rather, was based on clinical and biochemical response to anti-TNF agents. Thus, of those patients treated with anti-TNF therapeutics, some were FRs, and some were only PRs, with PRs requiring anti-TNF dose modifications and the addition of other agents, and with ongoing, uncontrolled disease signs and symptoms. While differences in ant-TNF pharmacokinetics have been partially implicated in the need to dose-escalate anti-TNF agents in some pediCD patients, the study identifies foundational differences in the immune state at diagnosis in PR patients compared to the NOA and FR subgroups (D'Haens and Deventer, 2021; Ordis et al., 2012; Yarur et al., 2016). While standard flow cytometry was not able to distinguish the immune phenotype of NOA versus treated patients, scRNA-seq identified significant differences. The contextualization of the scRNA-seq derived predictive cellular vector with two other treatment-naïve bulk RNA-seq studies of Crohn's disease, underscores the broader applicability of the findings (Kugathasan et al., 2017; Verstockt et al., 2019).
Applicants noted significant cell state changes at diagnosis underlying clinically-appreciated disease severity that impacted the clinical decision to treat or not to treat with anti-TNF agents. These occurred within multiple clusters of T, NK, fibroblast, epithelial, monocyte, macrophage, and dendritic cells. For anti-TNF response, very few clusters exhibited significantly differential composition between FR and PR individuals. This suggests that multiple collective changes in several cell types may conspire to lead to differences in treatment outcomes. Indeed, when Applicants jointly considered a cellular principal component vector comprising epithelial cells, T/NK/ILCs, and myeloid cells, Applicants identified several clusters that together could delineate the full spectrum of NOA, FR, and PR. This cellular vector indicated that multiple T cell subsets, NK cells, monocytes, macrophages, and epithelial cells were altered in disease. Intriguingly, by finely clustering each cell type, Applicants found that proliferating T and NK cells do not represent a uniform population, but rather reflect functional specialization capturing FOXP3, IFNG, IL22, and GZMA as cluster-defining genes. Enriched in NOA individuals were epithelial cells involved in chemosensation (Tuft.GNAT3.TRPM5) and absorption of metabolites (EC.GSTA3.TMPRSS15), as well as stem cells (Banerjee et al., 2020; von Moltke et al., 2016; Sido et al., 1998). That pediCD severity is not uniquely predicted by a singular cell subset or gene is reflective of the complex genetics and environmental factors that have been implicated, along with the rich literature that has found significant changes by histology, flow cytometry, or mass cytometry in CD relative to control tissue (Buisine et al., 2001; Leeb et al., 2003; Leonard et al., 1995; Lilja et al., 2000; Mitsialis et al., 2020; Müller et al., 1998; Souza et al., 1999; Stappenbeck and McGovern, 2017; Takayama et al., 2010). However, with the PREDICT study, Applicants have discovered precisely which changes in CD cellular composition come together to form a predictive vector for both disease severity and treatment response. Intriguingly, the quantification and visualization of this response vector predicted a later escalation of one of the patients (p022; who appeared as an outlier FR in
When considering the relationships between T cells and NK cells along with epithelial cells, Applicants captured that proliferating cytotoxic NK cell subsets like CD.NK.MKI67.GZMA were significantly negatively correlated with critical metabolic and progenitor epithelial cell subsets in pediCD. Conversely, proliferating regulatory CD.T.MKI67.FOXP3 were positively associated with secretory epithelial cells in pediCD, but did not appear related to the decrease in metabolic or progenitor cells. How T cell-derived cytokines impact intestinal regeneration and differentiation has recently been the focus of several studies, but the relationship of these fine-grained T cell subsets with specific epithelial cell states observed in the human intestine remained unknown (Biton et al., 2018; Lindemans et al., 2015). This work suggests that in the context of the ileum impacted by CD that there is further complexity to understand, particularly as it pertains to cytotoxic NK cells and T cells and their impact on epithelial cell homeostasis and regeneration.
The mapping of these disease severity-associated cell networks identifies a host of new potential therapeutic targets for pediCD, for many of which there are clinical-stage therapeutics that could be investigated. These include CD40L-blocking antibodies, IL-22 agonists, and targeted anti-proliferation agents (Betts et al., 2017; Lindemans et al., 2015; Miura et al., 2021; Ramanujam et al., 2020; Sootome et al., 2020).. A case can also be built for targeting inflammatory cytokines such as IL-1, and for interrogating agents aimed at mucosal healing including new anti-GM-CSF antibodies, given that several prominent cell subsets marked by CSF2 were enriched in the PR patients (Ai et al., 2021; Aschenbrenner et al., 2021; Castro-Dopico et al., 2020; Mehta et al., 2020; Mitsialis et al., 2020; Muro and Mrowiec, 2015). This atlas therefore provides a rigorous evidence-based rationale for proposing new therapeutic interventions, as well as a mechanism for interrogating the impact of new agents on the longitudinal immune landscape of pediCD patients.
Recent work on COVID-19 has also highlighted the challenges faced by systems approaches to capture baseline cell states that predict disease trajectory (Kaczorowski et al., 2017; Lucas et al., 2020; Mathew et al., 2020; Schulte-Schrepping et al., 2020; Su et al., 2020). In a disease of known infectious etiology with SARS-CoV-2, monocytes, macrophages, granulocytes, T cells, B cells, antibodies, and interferon state have all independently been associated with disease outcomes. Few studies have considered how multiple collective changes at baseline may influence outcome, yet are likely more reflective of the disease. With the complex and protracted presentation of a multifactorial disease like Crohn's disease, Applicants posit that multiple concerted effects are required to dictate both the severity (NOA vs FR/PR) and the treatment-response (FR vs PR). Additional work can consider which cell subsets are recovered during mucosal healing, and how closely the treated state reflects each individual patient's baseline presentation.
Pediatric patients less than 20 years of age with suspected inflammatory bowel disease were enrolled on the PREDICT Study (ClinicalTrials.gov #NCT03369353) Enrollment took place between Nov. 9, 2017 to Dec. 21, 2018 in accordance to an institutional review board approved protocol with written informed consent and assent when applicable. Patients diagnosed with Crohn's Disease (CD) were included and patients without gut inflammation on endoscopy and histology, and who were diagnosed with Functional GI Disease (FGID), served as a comparative cohort for this study. Terminal ileum and blood samples were taken during the diagnostic endoscopy procedures prior to initiation of therapy. Patients diagnosed with other inflammatory or infectious etiologies on endoscopy and biopsy were excluded from the analysis.
Clinical course and variables were monitored at the time of enrollment and for 3 years after initial endoscopy, with median follow up for CD being 32.5 months and FGID being 31 months at the time of clinical database lock (Dec. 1, 2020). Medical management was dictated by clinicians. Clinical variables obtained included sex, race, age at diagnosis, weight z-score, height z-score, BMI z-score, clinical disease severity using the Pediatric Crohn's Disease Activity Index (PCDAI), and disease location and phenotype using the Montreal Criteria (Hyams et al., 1991; Silverberg et al., 2005). Laboratory evaluation included C-reactive protein, ESR, hemoglobin, albumin, white blood cell count, and platelet count.
Early anti-TNF or immunomodulator therapy was defined as initiation of immunosuppression within 90 days of diagnostic endoscopy. Anti-TNF monoclonal antibody was started in 10 patients with CD. All patients were followed prospectively and categorized as full responders (FR), partial responders (PR), or not on anti-TNF (NOA). Full response to anti-TNF is defined as clinical symptom control and biochemical response with wPCDAI score of <12.5 on maintenance anti-TNF therapy and partial response defined as lack of clinical symptom control and biochemical response with documented escalation of anti-TNF therapy.
Clinical variables are expressed as median (lower and upper confidence interval; range) and compared using the Mann-Whitney U test. Categorical variables were described as frequencies and percentages and compared using the chi-square test. Clinical laboratory values are represented by mean and standard error of the mean (range) and compared with the Mann-Whitney U test. Significance is indicated by a P value of <0.05. Clinical statistical analysis was performed using GraphPad Prism version 8.3.0.
Tissue Dissociation into Single-Cell Suspensions
Human Ileum. Single-cell suspensions were collected from intestinal biopsies using a modified version of a previously published protocol (Persson et al., 2013) as described in (Smillie et al., 2019). One biopsy from the terminal ileum was received directly in hand and processed with an average time from patient to loading on the 10× Chromium platform of 2.5 total hours, and never exceeding 3.5 hours. While intact, biopsy bites were handled using a P1000 pipette applying gentle suction, and all centrifugation steps done in a temperature controlled 4° C. centrifuge. Biopsy bites were first rinsed in 30 mL of ice-cold PBS (ThermoFisher 10010-049) and allowed to settle. Each individual bite was then transferred to 10 mL epithelial cell solution (HBSS Ca/Mg-Free [ThermoFisher 14175-103], 10 mM EDTA [ThermoFisher AM9261], 100 U/ml penicillin [ThermoFisher 15140-122], 100 μg/mL streptomycin [ThermoFisher 15140-122], 10 mM HEPES [ThermoFisher 15630-080], and 2% FCS [ThermoFisher 10082-147]) freshly supplemented with 200 μL of 0.5M EDTA. Separation of the epithelial layer from the underlying lamina propria was performed for 15 minutes at 37° C. with rotation at 120RPM. The tube was then removed and placed on ice immediately for 10 minutes before shaking vigorously 15 times. Visual macroscopic inspection of the tube at this point yielded visible epithelial sheets, and microscopic examination confirmed the presence of single-layer sheets and crypt-like structures.
The remnant tissue bite was carefully removed and placed into a large volume of ice-cold PBS to rinse before transferring to 5 mL of enzymatic digestion mix (Base: RPMI1640, 100 U/ml penicillin [ThermoFisher 15140-122], 100 μg/mL streptomycin [ThermoFisher 15140-122], 10 mM HEPES [ThermoFisher 15630-080], 2% FCS [ThermoFisher 10082-147], & 50 μg/mL gentamicin [ThermoFisher 15750-060]), freshly supplement immediately before with 100 μg/mL of Liberase TM [Roche 5401127001] and 100 μg/mL of DNase I [Roche 10104159001]), at 37° C. with 120 rpm rotation for 30 minutes. During this 30-minute lamina propria (LP) digestion, the epithelial (EPI) fraction was spun down at 400 g for 7 minutes and resuspended in 1 mL of epithelial cell solution before transferring to a 1.5 mL Eppendorf tube in order to minimize time spent centrifuging and provide a more concentrated cell pellet. Cells were spun down at 800 g for 2 minutes and resuspended in TrypLE express enzyme [ThermoFisher 12604013] for 5 minutes in a 37° C. bath followed by gentle trituration with a P1000 pipette. Cells were spun down at 800 g for 2 minutes and resuspended in ACK lysis buffer [ThermoFisher A1049201] for 3 minutes on ice to remove red blood cells, even if no RBC contamination was visibly observed in order to maintain consistency across samples. Cells were spun down at 800 g for 2 minutes and resuspended in 1 mL of epithelial cell solution and placed on ice for 3 minutes before triturating with a P1000 pipette and filtering into a new Eppendorf tube through a 40 μM cell strainer [Falcon/VWR 21008-949]. Cells were spun down at 800 g for 2 minutes and then resuspended in 200 μL of epithelial cell solution and placed on ice while final steps of LP dissociation occurred. After 30 minutes, the LP enzymatic dissociation was quenched by addition of 1 ml of 100% FCS [ThermoFisher 10082-147] and 80 μL of 0.5M EDTA and placing on ice for five minutes. Samples were typically fully dissociated at this step and after gentle trituration with a P1000 pipette filtered through a 40 M cell strainer into a new 50 mL conical tube and rinsed with PBS to 30 mL total volume. This tube was spun down at 400 g for 10 minutes and resuspended in 1 mL of ACK and placed on ice for 3 minutes. LP cells were spun down at 800 g for 2 minutes and resuspended in 1 mL of epithelial cell solution and spun down at 800 g for 2 minutes and resuspended in 200 μL of epithelial cell solution and placed on ice. Following centrifugation, the cells from both EPI and LP fractions were counted and prepared as a single-cell suspension for scRNA-seq. Since the full EPI isolation was not performed on all patients limiting sample sizes, here Applicants focus the analysis on LP fractions.
Multicolor flow cytometry was performed on tissue samples to examine the immune composition for enrolled patients. Flowjo software was used to phenotypically define cell populations that will be analyzed and compared in patients using two-way ANOVAs (or non-parametric equivalent). Antibodies used include: CD3 APC, SP34-2 (BD Biosciences); CD3 BUV661, UCHT1 (BD Biosciences); CD3 BV711, OKT3, (Biolegend); CD3 PE, SP34 (BD Biosciences); CD4 BV785, OKT4 (Biolegend); CD8a BUV395, RPA-T8 (BD Biosciences); CD8b FITC, REA715 (Miltenyi Biotec); CD11b APC-Cy7, ICRF44 (BD Biosciences); CD11c APC-eFlour 780, BU15 (Fisher Scientific); CD11c BUV661, B-ly6 (BD Biosciences); CD14 APC-eFluor 780, 61D3 (Fisher Scientific); CD14 BUV737, M5E2 (BD Biosciences); CD20 APC-eFluor 780, 2H7 (Fisher Scientific); CD20 PE-Cy7, L27 (BD Biosciences); CD38 APC, HIT2 (BD Biosciences); CD45 PerCP/Cy5.5, HI30 (Biolegend); CD45RA BV605, HI100 (Biolegend); CD56 (NCAM) FITC, TULY56 (Fisher); CD94 APC-Vio770, REA113 (Miltenyi Biotec); CD117 (c-kit) BV421, 104D2 (Biolegend); CD123 BV711, 9F5 (BD Biosciences); CD127 Biotin, HIL-7R-M21 (BD Biosciences); CD161 BV711, DX12 (BD Biosciences); CD197 (CCR7) BV421, G043H7 (Biolegend); CD294 (CRTH2) BV605, BM16 (Biolegend); CD326 (Epcam) APC, HEA-125 (Miltenyi Biotec); HLA-DR APC-H7, L243 (G46-6) (BD Biosciences); TCR PAN γδ PE-Cy7, IMMU510 (Beckman Coulter); α4-β7 integrin (Act-1), (NIH AIDS Reagent Program); Streptavidin BUV737 (Fisher); Live/dead Fix Aqua (Fisher); R-PE Antibody Labeling Kit (300 mcg) (Abcam).
10× v2 3′. Single cells were loaded onto 3′ library chips as per the manufacturers protocol for Chromium Single Cell 3′ Library (v2) (10× Genomics). The LP fraction was captured in its own channel of the 10× Chromium Single Cell Platform, in order to recover sufficient numbers of cells for downstream analyses. An input of 10,000 single cells was added to each channel with a recovery rate of 9,514 cells per sample based on median across samples. Briefly, single cells were portioned into Gel Beads in Emulsion (GEMs) in the Chromium controller with cell lysis and barcoded reverse transcription of RNA, followed by cDNA amplification, enzymatic fragmentation and 5′ adaptor and sample index attachment. Libraries were sequenced on a HiSeq or NovaSeq flow cell. The read structure was paired end with length of read 1 26 bp, length of read 2 91 bp, and the length if index 1 (i7 primer) 8 bp. Quality-filtered base calls were converted to demultiplexed FASTQ files.
FASTQ files were aligned to GRCh38 using Cellranger v2.2 pipeline on the Cumulus/Terra cloud pipeline portal.firecloud.org/?retum=firecloud #methods/cumulus/cellranger_workflow/10 generating 27 cell-by-gene matrices (13 FGID, 14 CD), one for each patient. Applicants used default parameters of the 10th snapshot version of the pipeline, aside from requiring that it use cellranger v2.2.0.
Every sample was first filtered excluding genes measured in fewer than 3 cells and cells with fewer than 200 unique genes. To control for doublets and low-quality cells Applicants then further filtered individually, attempting to match the approximate 10,000 cells loaded onto the sample lane and balancing the thresholds to not cut out dense regions of a Ncounts by Nfeatures scatter plot. Pre-filtering, Applicants looked for outlier samples, based on proportion of percent mitochondrial genes, number of counts, and number of features, none fell beyond the 1.5 times the IQR threshold.
Exact thresholds used for each sample:
Post filtering, Applicants merged sample matrices using an outer join to create an FGID dataset (115,569 cells) and a CD dataset (139,342 cells).
Preprocessing & Clustering of scRNA-Seq Data
Applicants began initial analysis following traditional clustering and annotation techniques; however, these methods using manual and at times subjective metrics scaled poorly to the size and scope of the dataset and moreover did not give clear distinction between disease specific cell states and compositional shifts within cell states across disease.
For the first pass at analysis, Applicants grouped the FGID and CD datasets together (254,911 cells) and proceeded with the standard Seurat v3.1.5 pipeline (Stuart et al., 2019). Applicants used manual heuristics of gene marker specificity to choose cluster resolution and isolate 9 major cell types (T, B, plasma, epithelial, endothelial, fibroblast, myeloid, mast cell, and glial) and 1 aggregate cluster of T, B, myeloid, and epithelial cells with a strong proliferation signature. Applicants then subclustered the proliferating group and manually merged the proliferating cells with their corresponding cell type based on marker gene expression, and separately re-preprocessed and clustered each cell type annotating based on one vs. rest differential expression (Wilcoxon, fdr<0.05) within the cell type.
Applicants found several disadvantages to this approach. First, Applicants found it difficult to determine for each cluster whether Applicants should be looking for changes in compositional frequency or gene expression. Particularly within the myeloid major cell group Applicants would find extremely disease biased sub-clusters, as much as a 9:1 ratio between CD to FGID. It was unclear whether there was massive compositional shift within a conserved cell state or if instead a base cell state was split into multiple clusters based phenotypic differences in disease and Applicants should perform a differential expression test between it and neighboring FGID biased clusters. Second, after two rounds of manual processing Applicants were still unsure if Applicants had reached a base level with each end cluster corresponding to a unique and biologically homogeneous cell state. Third, at that point, having partitioned over 100 distinct clusters, individually supervising each subsets processing and sub-clustering was infeasible. Applicants needed a more systematic method to address these challenges.
It is common to organize cell identity ontologies in a tree structure. With major groups such as immune, stromal, and epithelial at the top and branching down a level, you might set more nuanced identities like T, B, endothelial and goblet cell types as a second tier, and even more nuanced identities like CD4+ vs CD8+ T cells as a third. In ideal circumstances, this mental model conforms well to RNA-seq data where Applicants can layer gene modules with more and more specific variation together to describe highly particular cell identities and states. And, by clustering at a high level with genes that vary across the entire dataset, then sub-clustering with genes that vary only within a particular parent cluster Applicants are able uncover this hierarchy of cell identity. Reality is of course much messier than theory and many additional factors to cell identity contribute to the variation in gene expression within actual datasets, particularly as Applicants found with disease condition during the first approach.
To be able to choose the appropriate future analyses and comparisons, Applicants need a highly accurate representation of cell identity and state. The underlying issue in the first pass at clustering was that in combining the disease conditions together, the variable genes selected at each stage represented a combination of differences between cell identity & disease. This combination could have been manageable if either disease or cell identity were consistently more variable. Applicants could isolate one factor at a specific tier in the hierarchy before sub-clustering to isolate the other. In the case, disease and cell identity both had many overlapping scales of variation. To address this problem, Applicants isolated cell identity by separating the dataset by disease and clustering for cell identity within each disease set (FGID 115,569 cells, CD 139,342 cells). This approach did then require Applicants to perform an additional stage of analysis to find corresponding clusters between the two datasets, but allowed for far more effectively distinguishing type, scale, and specificity of disease differences.
Within each disease set Applicants still needed a method to ensure Applicants were reaching the bottom level of biological heterogeneity, and preferably an automated method as the first pass had shown the potential for isolating hundreds of cell states. To efficiently cluster and isolate these cell states Applicants wrote a cloud-based pipeline to systematically optimize parameter selection and stop when biological heterogeneity is exhausted. Homogenous cell subsets were isolated by recursively normalizing, selecting variable genes, and clustering based on silhouette score. Applicants stopped recursing into sub-clusters once Applicants reached one of four end conditions defined as:
Having a group of less than 100 cells (though Applicants did partition many clusters smaller than 100 cells after clustering groups just larger than that cutoff).
Isolating an optimized clustering of only one cluster.
Finding two clusters that have fewer than 25 genes (fdr<0.05 & |log fold change|>1.5 & percent expression>=20% in at least one cluster) differentially upregulated between each cluster using a bimodal test developed in (Shekhar, 2016 10.1016/j.cell.2016.07.054). For this last condition, if reached, Applicants reject the clustering and return back the cells as a single end cluster.
Having reached a max tier limit. Setting this value to 10, Applicants never triggered this condition with either FGID or CD datasets, but included it to prevent runaway recursion.
Code for generating this tree of cell clusters is currently available here: (jo-m-lab.github.io/ARBOL/ARBOLtutorial.html). Within each recursion, the established steps were processed using Seurat version 3.1.5 (github.com/satija.lab/seurat). Normalization and variable gene selection were processed with SCTransform (github.com/ChristophH/sctransform) (Hafemeister and Satija, 2019). Clustering for major cell types was performed using Louvain clustering on dimensionally reduced principal components.
Parameters depend on the size of the dataset, and thus must be adjusted based on how many cells are being partitioned for each recursive step. When calculating nearest neighbor graphs, and clustering Applicants set the K parameter to ‘ceiling(0.5*sgrt(N))’. Applicants chose the number of principal components based on the top 15 percentile of calculated improvement of variance explained. For subsets less than 500 cells Applicants used Jackstraw to calculate significant principle components. If neither method succeeded, Applicants chose the first two principle components. Applicants set clustering resolution via a grid search optimizing for maximum average silhouette score, (Silhouette measures the ratio of intra-cluster distance to inter-cluster distance, where a high score means highly distinct clusters). For stages where Applicants were clustering more than 500 cells a randomized subsample of N cells/10 was used to calculate the average silhouette score.
Additionally, at each recursive step Applicants output quality metrics and basic plots, such as 1:rest differential expression from the optimal partitioning at each stage and UMAP representations painted by sample metadata (sample ID, cluster number). The pipeline, saved output as a directory structure matching the tree discovered by this recursive clustering. This tree represents the lower levels of variance of discovered at each tier. At any tier level Applicants are able to extract the cell's partitioning. Due to the intermixing of patient and cell identity effects at multiple levels of the tree (a fraction of a single patient's cells might separate out at a high level, but then continue to separate into identifiable cell types, or vice versa), Applicants found the most meaningful levels at the top and bottom of the tree. The clustering tree is useful for understanding the levels of variance in the dataset, but Applicants found it contains too much noise to be easily interpretable. Thus, Applicants later generated a hierarchical clustering of the bottom level clusters based on pairwise differential expression, which is displayed in figures (
Cell Type and Subset Annotation from Tiered Clustering
After running the hierarchical tiered clustering pipeline Applicants manually curated the generated tree of clusters. Specifically, tree generation was reinitiated for the B Cells within the FGID dataset as it had stopped at the first tier on two clusters with <50 genes differentially expressed, however Applicants could see in this case that there was additional biological stratification based on strong differential expression of CXCR4 (Wilcoxon; log FC=1.22860917, Bonferroni.p=1.0E-300) CD69 (Wilcoxon; log FC=1.27527652, Bonferroni.p=2.99E-151), HMGN1 (Wilcoxon; log FC=−1.1688612, Bonferroni.p=1.62E-227) and HMGA1 (Wilcoxon; log FC=−1.28838294, Bonferroni.p=1.06E-209) among others. This formed a clear divide between non-proliferating and proliferating B Cells, further validated by a clear separation within the UMAP based on PCA reduced variable genes within the B cells. Applicants further examined each branching point of the tree to determine its splitting cause, noting splits based on spillover, doublet, and singular patient effects. Splits at higher tiers based on doublets often split again allowing Applicants to recover cells that did not have the dual expression profile. Splits that only had patient splits below (measured by having only clusters of single patients) were manually marked as end clusters, thereby merging all clusters below that split. With these manual steps made, Applicants performed pairwise differential expression to ensure each partitioned subset is distinct from its neighbors.
Applicants annotated these final clusters with four methods attempting to balance descriptiveness, ease of understanding, and ease of name generation: The first method, is generated during the hierarchical tiered clustering by following the path from the end cluster up to the original tier. An example annotation is T0C0.T1C3.T2C3.T3C5 marking an end cluster that split at tier 1 into cluster 0 and at tier 2 into cluster 3. These annotations do not provide any biological information to the reader, but do provide a unique ID for the end cluster. The second method is far more descriptive, where Applicants manually annotate the main reason for each particular split. This still follows the original ranking of variation as found by the hierarchical tiered clustering, while also providing biological interpretation, as an example: CD.Mloid.macrophage_chemokine.S100A8 S100A9_CXCL9_CXCL10_TNF_inflamonocyte.
This method of annotation was particularly useful during analysis as Applicants were immediately able to see how early or late two clusters had split from each other, as well as seeing a number of the subset defining genes. Unfortunately, as is apparent this method also produces extremely long names that are difficult to display and refer to. It is also a highly manual process, and difficult to reproduce precisely. To better present the findings and aid others in reproducing the results, the third method automates this annotation. This method is performed by taking each major cell type, which in the case matched the tier one splits, and performing 1:rest differential expression testing (Wilcoxon; adj.p<0.05, only.pos=True) within each major cell type. Applicants then ranked the genes based on the product of ‘−log(Bonferroni.p)’, ‘avg_logFC’, and ‘pct.exp.1/pct.exp.rest’ and took the top 5, forming a name like CD.Mloid_CCL3_CCL4_CCL3L3_TNF_TNFAIP6. This scheme, again was useful, but did not quite meet the demands of recognizability and brevity. Thus for T and Myeloid cells Applicants adjusted these names to a finer degree of specificity by visualizing the expression profiles of each subset with a dotplot of canonical marker genes based off of current literature, and limiting to the top 2 genes based off the method 3 rankings and the dotplot of canonical markers, thereby producing the fourth and final annotations in the form: CD.Mono.CXCL10. TNF. Due to the limited nature of current characterization of stromal and epithelial cells Applicants were unable to match the same degree of specificity as the T and Myeloid cells, however Applicants did where possible adjust from the major cell type, to the most specific that Applicants could be confident of. For instance, adjusting “Epith” to “Goblet” based on marker expression of TFF3 and MUCN13.
At this point Applicants had generated a hierarchical representation of the datasets from the top down showing the splits of highest variation at every level. By necessity that means that each level is controlled by and represents different selections of genes, which may have no relation to the genes selected in another branch. To understand the relations of cell subsets and compare across cell type Applicants needed a unified set of genes. For each dataset (FGID and CD) Applicants performed pairwise differential expression (Wilcoxon; Bonferroni.p<0.01) and selected the top 50 most significant genes from each test. Gene lists were merged as a union, finding 4445 unique genes for FGID and 1760 unique genes for CD that best differentiate the subsets. Subset centers were calculated from these selected genes as the median expression of cells grouped by subset. The resulting table was then hierarchically clustered using correlation distance and complete linkage. Clustering was performed in R using the pvclust package (github.com/cran/pvclust)
The resulting tree shows from the bottom up the relationships between cell subsets, and allows cell subsets that were potentially misclassified at a high split in hierarchical tiered clustering to find their biological neighboring subsets. As previously mentioned within the description of hierarchical tiered clustering Applicants did not find any end cluster subsets that met the thresholds for merging. This does not mean that Applicants did not observe shuffling from the initial tiered splits. While overall there was good agreement between the two methods, Applicants noted subsets jumping between major cell types as defined by the first splits of the tiered clustering. Applicants identified the majority of these jumping subsets as doublet clusters by exploring their differential gene results at multiple levels of the tiered clustering tree. Applicants removed these doublet subsets and others based on flipping expression programs at different tiers. For instance, looking like T cells expressing TRAC, IL7R within an epithelial cluster, than at the next tier expressing KRT18 and PIGR. After removing doublets, Applicants recalculated subset distances and dimensional reductions, as presented in the main figures.
Separating the data on disease condition into two datasets was important as it allowed Applicants to isolate the axis of cell identity within each disease and be confident in the homogeneity of each subset.
The first attempt to find corresponding clusters followed the methods of Tasic et al. 2016. Applicants used the best differentiating genes sets created for the unified gene space clustering to as the mapping space for a nearest-neighbor classifier. For each cell within the a disease condition, Applicants could map it to the nearest cell subset within the other disease condition. As a trial run Applicants created this gene space for each major cell type of the FGID disease condition and performed 5-fold cross-validation.
Applicants further used an automated system to choose genes as the most significantly differentially expressed genes in order to create enough separation between cluster centers to effectively classify new cells. Applicants chose to use a random forest classifier as it allowed Applicants to train for the optimal selection of genes, required little to no preparation of data, and provided probabilities of each cell being predicted to each class. These probabilities for each class proved particularly useful do to the second realization. Because the number of subsets differs between disease conditions, Applicants cannot make the assumption that there is a one to one relationship between conditions. Applicants also cannot make the assumption that the many to one relationships are unidirectional with one base subset splitting into many states only from FGID toward CD. A single classifier would not allow Applicants to distinguish between these many types of relationships. However, Applicants realized that by creating a classifier for both directions (FGID to pediCD and pediCD to FGID) Applicants could take advantage of the difference in confidence between the two classifications to discover the direction and type of relationship. For 1:1 relationships, Applicants would expect all cells of subset A in condition X to match with 100% confidence to subset A in condition Y. In that particular case the summed probability equal 2 and there would be zero difference in confidence of one classifier to its matching classifier. For non-1:1 relationships, Applicants might instead see 90% of cells of subset A in condition X to matching with >85% confidence to subset B in condition Y, and only 30% of subset B in condition Y matching with >85% confidence. From this discrepancy Applicants can to infer that subset A may be a cell state in condition X that is layered on top of a base state B in condition Y. Low confidence in both directions indicates subsets unique to a particular condition.
After these realizations Applicants trained random forest classifiers for each cell type in each disease condition using SciKit-Learn v0.22.2, with the intent to classify each cell to the subset in the opposed dataset the cell is most similar to (Pedregosa et al., 2011). For each cell type Applicants optimized a classifier for accuracy using grid-based search tuning number of trees, depth, number of features, criterion, and min samples per leaf with 5-fold cross validation for each set of tuning parameters. Applicants never observed full overfitting where the accuracy on test folds began to drop with increased size of model, but Applicants did quickly find diminishing returns as Applicants increased model size. For simplicity and because optimal tuning parameters were robust to overfitting, Applicants chose to use the same largest model parameters for all models (number of trees=500, depth=200, number of features=sqrt, criterion=gini, min samples per leaf=1). The initial training rounds found accuracies in the mid 60%. A definite increase from the NN classifier, but not high enough for Applicants to be confident in the results. The main issue Applicants eventually determined to be the uneven class distributions (far more cells in subset A than subset B). This caused the smaller subsets to be under trained. To compensate Applicants up-sampled with replacement each subset within the training fold to contain at least the 75th quantile number of cells. This single change improved accuracy on the unmodified test fold the most, varying from 5-15% improvement of accuracy, precision and recall across each cell type and provided accuracies ranging from high 70 to low 90 percent per major cell type.
Applicants ran the random forest model across the disease conditions. Applicants trained each random forest model with optimized parameters on all folds of its dataset, then proceeded to get probability predictions for each cell from the disease condition to the trained disease condition. With these class probabilities per cell Applicants could aggregate for each disease condition by taking the mean class probabilities for each group, leaving Applicants with 2 n by m table where n equals the number of subset groups and m equals the number of subset classes in the opposing disease condition. Using the mean probabilities for the group allowed more information from the cell level to rise to the aggregated levels than using the individual class prediction alone (computed as the class with max confidence of cell membership). These tables also provide confidences to all classes which is important for understanding the transverse confidence in both directions.
It is especially important to understand the many one to one relationships between disease conditions and find where a base cell state becomes layered in additional expression profiles, as these are the exact cases where Applicants can infer the underlying signaling patterns that diversify or concentrate cell state profiles. In diverse splitting of a subset across disease Applicants can start to understand the heterogeneity of patient response to treatment as it becomes clear which particular cell profiles are correlated with strong and poor response. To gain insight to these changes, Applicants care about where there is strong confidence in both directions and where there is strong confidence in only one direction. The simplest method to calculate these is to separately take the sum of the pairwise prediction confidences and the difference. Applicants call the sum of confidences the correspondence of a subset, and the difference the bias.
Applicants plot these metrics on a dot plot where each possible connection is laid out on a grid. For each dot Applicants set the size to match the correspondence, and color the dot based on the bias, such that a perfect match would appear as a large white circle. A more unidirectional match would be tinted darker in the color matching the disease condition with more confidence. Matches with more bias tend to indicate a subset matching a base cell state but also expressing some additional gene modules. To aid the human eye on picking up the major patterns Applicants filter to only show the top 10% highest correspondences. This parameter was chosen after looking at the distribution of correspondence scores and selecting the majority of the right tail of the distribution. It keeps the strongest matches in both ways and keeps the strongest in highly biased matches. To also aid the human eye Applicants perform a hierarchical clustering using cosine distance and complete linkage on the prediction confidences and compute an optimal ordering based on the cosine distances using the “cba” package in R: cran.r-project.org/web/packages/cba/index.html. This allows Applicants to sort subsets on the rows and columns such that subsets that get predicted similarly are next to each other. From this visualization Applicants are able to easily discern which are the subsets FGID that split into many phenotypes within CD from high correspondence and bias, which subsets don't change phenotype much at all based on high correspondence and low bias, and which are the subsets are potentially unique to a disease condition based on very low correspondence and bias.
Compositional differences are an important metric for understanding the baseline differences that prognose a patent's response to treatment. Applicants measure these differences with proportional enrichment of particular cell subsets within each patient, and finding the significantly reproducible enrichments across disease. As an extreme example Applicants might find that subset A cells comprise as much 80% of cells sampled in one condition whereas they might only comprise 30% in a different condition. This type of compositional analysis is highly affected by the number and choice of subsets included, and the sampling depth per patient (how many cells are collected). The first factor is controlled by the confidence in the clustering and using computationally optimized parameters. Applicants further control this factor by limiting analysis of compositional shifts of cell states to within major cell types. This isolates the chance of error from affecting the entire analysis and allows Applicants to gain a more direct biological insight of the rise and fall of particular cell states in the context of similar subsets. Applicants control the second factor of sampling depth differences by computing a normalized cell count score per patient of the form (ncells in subset/ncells in patient's major cell type)*1e6. This score provides Applicants with the number of cells expected per million.
Applicants input the cells per million score into a two-sample Wilcoxon test in base R, which is equivalent to the Mann-Whitney rank score test. Applicants set a significance threshold of p_value <0.05. Applicants made 5 different pairwise comparisons (FGID vs FR, FGID vs PR, NOA vs FR, NOA vs PR, FR vs PR). Comparisons between FGID and pediCD groups were determined by finding maximum correspondence between the disease conditions for each subset. Due to the interest in not only finding differences between FGID and CD, but also baseline differences within CD that lead to different treatment response, Applicants are slightly underpowered in comparisons within CD, splitting the sample size from 14, to 4, 5, and 5. While Applicants do find significantly enriched subsets between subsets of CD, they are not necessarily robust to multiple testing correction. However, Applicants are confident that the split is justified. First because Applicants determined the split based on robust clinical markers (see clinical methods section). Second because Applicants do find consistent biological changes across numerous analyses. Applicants are additionally confident in the results of the Mann-Whitney tests as they correspond to the largest effect size changes as considered significant in the more lenient Fisher's exact test.
A similar compositional analysis to that done with the Mann-Whitney was performed with a Fisher's Exact test. Do the difference of the tests Applicants input for each subset the number of cells for that subset against the number of cells not of that subset within the major cell type split on rows by pairwise comparisons (NOA vs FR, NOA vs PR, FR vs PR). Applicants computed FDR correction of p_values at major cell type and entire dataset levels and found significance subsets at both levels. But, most interestingly in comparing the two tests Applicants found that the Mann-Whitney discovered as significant (pval<0.05) the portion of cell subsets with largest effect sizes. Understanding the limited patient number at these within CD comparisons and wanting to only report results most likely to be reproducible biology, Applicants determined to only follow those subsets reported as significant within both Mann-Whitney and Fisher's exact tests.
Two visualizations of these tests proved particularly useful. The first a heatmap of cell per million score split by treatment condition in conjunction with the previously described correspondence dot plot was especially powerful. Those plots allowed Applicants to follow by eye directly from a significantly compositionally enriched subset in PR to its neighbors within CD and within FGID providing a complete picture of where to direct next analysis. The second also represents cell per million score, but as a scatter plot with a dot for each patient and grouped by treatment response. This allowed quick visual confirmation that results were not due simple to outlier error.
Cell frequencies were calculated per patient for cell subsets (i.e. end clusters) within parent cell types and cell subsets (i.e. end clusters) within all cells as CPM=((count/sum(count))*1e6. Principal component analysis (PCA) was performed on the resulting patient×CPM matrices using the R package stats::prcomp(., scale=TRUE). Variance explained per PC was calculated as std{circumflex over ( )}2/sum(std{circumflex over ( )}2). PCA loadings per patient and per cell subset were extracted from the prcomp( ) result. PCA1 and PCA2 from the total PCA×patient and from each celltype's PCA×patient matrix were correlated with clinical metadata using Spearman rank correlation as calculated by the R package stats::cor.test(., method=‘spearman’). P values were recorded from the cortesto call, and FDR was calculated using R's fdrtool::fdrtool(p.values, statistic=“pvalue”). For combined celltype PCA's, patient×CPM tables were concatenated before PCA.
Fold changes between patients responding or not responding to anti-TNF therapy from RISK and E-MTAB-7604 cohorts were calculated with Seurat (v4.0.3) (Haberman et al., 2014; Hao et al., 2021) and DESeq2 (v1.30.1) (Love et al., 2014) packages, respectively. GSEA analysis was performed using the fgsea R package (v1.16.0) (Korotkevich et al., 2021). Genes with similar fold changes were preranked in a random order. The code for this analysis can be found in the GitHub repository jo-m-lab.github.io/3p-PREDICT-Paper/4_GSEA/PREDICT_GSEA_final.html.
The micrograin structure found through hierarchical tiered clustering is vital for being able to directly compare like cells across disease conditions, and find significant changes in phenotype and composition within individual subsets. It is also vital to understand how those like subsets relate to each other within a disease condition and how the larger macrograin structure differs across conditions. This macrograin structure can be explored through the gradients of gene expression among cells of a major type. Pseudotime and RNA-velocity are both excellent tools for exploring these gradients. For both tools, the choice of genes directly determines the structure found within the dimensional reduction, and thus what genes are chosen as significantly location specific within the resulting landscape of cells. for the purposes, as Applicants knew Applicants would be exploring a single cell lineage, and exploring the relationships of cell states within that space, Applicants required for the dimensional reduction the genes common to that space. Applicants selected genes by performing differential expression between the major cell type and all other cell types within that disease. Applicants took the outer union of those genes. Then removed genes from the list found to be differentially expressed between disease conditions at the major cell type level. From these genes Applicants performed PCA to 50 principal components and then computed a UMAP reduction to 2 components. This selection process allows the dimensional reduction to find smooth gradients between cells and provided a common space for cells of multiple disease conditions to exist.
From this common expression landscape Applicants utilized Monocle3 cole-trapnell-lab.github.io/monocle3/(Cao et al., 2019) to extract a best estimate linear path through the space. Applicants calculated a diffusion pseudotime on allowing use to numerically estimate the distribution of cells within the expression landscape. To compute the significance of changes in that distribution Applicants used a permutation test of Hellinger distance between distributions. At each of 10,000 permutations Applicants shuffled the group ordering within the comparison pair. Applicants performed this test five times for comparisons between FGID vs FR, FGID vs PR, NOA vs FR, NOA vs PR and FR vs PR. The threshold was set as Bonferroni corrected p_value <0.05.
FGID and pediCD Integration Using STACAS
Integration of T cells from the FGID and pediCD datasets (n=29640 and 38031, respectively) was performed using the STACAS package (v1.1.0) (Andreatta and Carmona, 2021) Sankey plot was created using RAWGraphs 2.0 beta (https://github.com/rawgraphs) (Mauri et al., 2017).
To calculate differential expression between FR and PR groups, for each subset with a least 50 cells in each condition Applicants used a Wilcoxon test thresholded to 0.05 Bonferroni corrected p-value and down sampled using the “max.cells.per.ident” argument within Seurat's ‘FindMarkers’ function to a maximum of 5000 cells. The limits on minimum and maximum number of cells were chosen mitigate issues with comparisons between disproportionate populations and computational efficiency. There does still exist 2 orders of magnitude between the minimum and maximum; however the subsets most of interest and reported in Table 2 are all of the same order of magnitude.
There are noted spillover effects within the expression tests. Applicants observe ubiquitous contamination of genes such IGHA1, IGHG1 as DEFA5, across all cell types and subsets. These genes are routinely found as enriched within more severe inflammation, beyond even this dataset. This is a real effect, but less than useful for understanding driving factors within individual cell subsets. So, Applicants focused on significant differentially expressed genes that also have a high pct.cells.expressing.in/pct.cells.expressing.out ratio. Applicants can then filter the subsets to find those with the most number of specific differentially expressed genes between the FR and PR groups.
Parameters such as sample size, number of replicates, number of independent experiments, measures of center, dispersion, and precision (mean+/−SEM) and statistical significances are reported in Figures and Figure Legends. A p-value less than 0.05 was considered significant. Where appropriate, a Bonferroni or FDR correction was used to account for multiple tests, as noted in the figure legends or Methods. All statistical tests corresponding to differential gene expression are described above and completed using R language for Statistical Computing.
Various modifications and variations of the described methods, pharmaceutical compositions, and kits of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific embodiments, it will be understood that it is capable of further modifications and that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention that are obvious to those skilled in the art are intended to be within the scope of the invention. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure come within known customary practice within the art to which the invention pertains and may be applied to the essential features herein before set forth.
This application claims the benefit of U.S. Provisional Application No. 63/158,711, filed Mar. 9, 2021. The entire contents of the above-identified application are hereby fully incorporated herein by reference.
This invention was made with government support under Grant Nos. DK034854, AI118672, HL095791, HL158504, and AI051731 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/019582 | 3/9/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63158711 | Mar 2021 | US |