Diffuse large B-cell lymphoma (DLBCL) is the most common B-cell lymphoma and is clinically heterogeneous. Gene expression profiling (GEP) classified DLBCL into 2 major molecular subtypes according to their cell of origin (COO): germinal-center B-cell-like (GCB) and activated B-cell-like (ABC) DLBCL.1 ABC-COO is associated with poorer clinical outcomes in DLBCL irrespective of treatment: CHOP (cyclophosphamide, doxorubicin, vincristine, and prednisone), rituximab (R)-CHOP,1-3 obinutuzumab (G)-CHOP,4 or classical salvage chemotherapy R-DHAP (rituximab, dexamethasone, high-dose cytarabine, and cisplatin) followed by intensive therapy plus autologous stem cell transplantation.5 However, several novel agents, including lenalidomide,6-8 ibrutinib,8,9 and bortezomib alone10 or in combination with durvalumab (anti-PD-L1),11 showed selective or better clinical efficacy in ABC- vs GCB-DLBCL. The prognostic and therapeutic differences between ABC- and GCB-DLBCL have a molecular basis, such as higher frequencies of mutations in CD79, MYD88, CARD11, PRDM1, and TNFAIP3,12 chronic active B-cell receptor signaling,13 and more frequent MYC/BCL2 double expression in the absence of genetic MYC/BCL2 double hit14 in ABC-DLBCL. In addition, the subcellular distribution and mechanism of action of doxorubicin in ABC-DLBCL are different from those in GCB-DLBCL.15 To guide clinical therapeutics, distinction of the GCB vs ABC/non-GC subtype has become the standard practice according to the 2016 revision of the World Health Organization classification of lymphoid neoplasms.16
Significant efforts have been put into establishing clinically applicable assays and accurate classification of DLBCL, and methodology to determine COO has been evolving in the last 2 decades. The original Lymphochip spotted cDNA microarray and the gold standard classification algorithm are robust in COO classification but impracticable for routine clinical practice.1-3 Researchers thus developed algorithms to distinguish GC from non-GC subtypes based on protein expression of 3 to 5 biomarkers in formalin-fixed, paraffin-embedded (FFPE) tissue samples readily assessed by immunohistochemistry (IHC) in the clinic.17-24 However, the accuracy of these IHC algorithms and the prognostic significance of COO subtypes determined by IHC algorithms5,25 are not consistent.23,26-28 To enable GEP by DNA microarrays to classify DLBCL using clinical FFPE tissues that yield highly fragmented RNA samples, new RNA amplification and labeling techniques and classification models were developed, including a 100-gene classifier for Affymetrix GeneChip (Affymetrix, Inc) data29 and a 20-gene DLBCL Automatic classifier for Illumina WG-DASL platform (Illumina United Kingdom) data30 developed from a previous platform-independent 27-gene DLBCL subgroup predictor31 that showed reproducibility and prognostic value.
To simplify the GEP process for FFPE samples, a multiplexed quantitative nuclease protection assay (qNPA) was developed that directly hybridizes mRNA in situ using 50-mer probes for genes of interest, followed by probe capture and quantitative imaging, thereby reliably detecting mRNA levels in FFPE samples without RNA exaction and amplification.32-34 The qNPA platform (HTG Molecular Diagnostics, Inc.) can accurately classify DLBCL using a 14-gene signature.35 The current HTG EdgeSeq DLBCL COO assay has been applied in a clinic trial.36 However, the most successful simplified variation of microarray for rapid COO determination is the NanoString nCounter System (NanoString Technologies), which elegantly detects target mRNA of interest in extracted nonamplified RNA samples using a capture probe and a color-coded reporter probe, followed by purification, immobilization, and digital readout.37 Several different small gene panel-based DLBCL-COO assays, including the most wildly used Lymph2Cx 20-gene assay,38 have been applied in research studies and clinical trials,4,39-45 although a large gene panel (145 genes) was also achievable for the NanoString nCounter system.46 COO determined by Lymph2Cx 20-gene assay either exhibited high concordance with GEP-determined COO or showed significant prognostic value in 4 retrospective studies47-50 and a clinical tria1,51 but not in 2 clinical trials52 and 1 retrospective study.53
Reverse transcriptase-multiplex ligation-dependent probe amplification, which ligates the left and right probes annealed to cDNA target sequences, permitting amplification of specific genes,54 is another type of assay that has been applied for DLBCL-COO classification based on expression of 14 or 21 genes.55,56 This method is sensitive and cost-effective without using a dedicated platform but has relatively poor dynamic range and is unable to include some COO-specific genes.55
DLBCL outcome predictors that link GEP signatures directly to clinical outcome instead of COO have also been developed,2,3,57,58 but the reproducibility between different studies was poor, and the predictive value for therapies other than the standard treatment is uncertain. In contrast, COO classification with underlying biology basis9 also have predictive values for novel therapies, as demonstrated in phase 1/2 and 2/3 clinical trials.6-8,10,11 However, recent clinical trials for adding ibrutinib (phase 336) and bortezomib (phases 259 and 360) to the standard R-CHOP in previously untreated ABC (by Hans algorithm and HTG EdgeSeq36 or by Illumina DASL assay60) or non-GC (by Hans algorithm and Nanostring Lymph2Cx assay59) DLBCL patients failed to show improved clinical outcome.
To better classify DLBCL biologically guiding therapeutic clinical trials, genetic alteration signatures have been explored to subtype DLBCL in large numbers of patients, as genetic upstream of the oncogenic biology in DLBCL can define the response to novel targeted therapies. Schmitz et al61 used a GenClass algorithm, and Chapuy et al62 used an nonnegative matrix factorization (NMF) consensus clustering algorithm to analyze high-content genetic data of 574 and 304 patients, respectively, and uncovered genetically distinct subtypes within or independent of COO subtypes, most of which demonstrated robust prognostic significance and potential therapeutic relevance.61,62 However, the pathogenic driver roles of many mutations in signatures vary or have not been validated,63,64 and how to accurately assign a genetic subtype to new individual patients at presentation in real time is less clear than the current COO classification. In a phase 3 GOYA study (NCT01287741),43 approximation of EZB, BN2, N1, and MCD subtypes based on presence of subtype founder gene alterations in targeted next-generation sequencing (NGS) data of 465 genes did not find prognostic effect, whereas clusters (C) C2, C3, and C5 identified by applying NMF consensus clustering to the study cohort showed poorer prognosis compared with C0, C1, and C4 clusters. In another prospective study from the LNH03B LYSA (Lymphoma Study Association) clinical trials with targeted NGS of 34 key genes and genomic copy number variation analysis, none of the genetic subtypes identified by the GenClass algorithm or NMF consensus clustering showed prognostic significance.65 The inconsistent prognostic values could result from the highly variable sequencing panels and NGS data quality in different studies, inaccurate subtyping, and the clinical heterogeneity within defined genetic subtypes underscored by phenotypic biologic (eg, MYC/BCL2 expression66) heterogeneity arising from many other underlying mechanisms, for example, epigenetic deregulation and genetic alterations in noncoding regions.67 In fact, in the cohort of Schmitz et al, MCD patients with MYD88/CD79B double mutations had better survival compared with other MCD patients,66 and the EZB subtype has been further divided into the unfavorable EZB-MYC+ and favorable EZB-MYC− subtypes recently by a LymphGen algorithm.68 A LymphGen webtool has been public accessible and able to assign genetic subtypes to patients if the input is from a cohort but not if from only 1 patient.
In a first aspect, the present invention is a method of treating diffuse large B-cell lymphoma, comprising obtaining a sample from a patient having diffuse large B-cell lymphoma; detecting in the sample, by an assay, mutation in each gene in a first panel; quantifying in the sample an expression level of each gene in a second panel; classifying the diffuse large B-cell lymphoma of the patient as having a cell of origin of either (i) germinal-center B-cell-like or (ii) activated B-cell-like; and treating the patient with a cancer treatment therapy regime. The first panel comprises at least one gene selected from the group consisting of EZH1 and MYD88; and the second panel comprises at least one gene selected from the group consisting of IRF4, MYBL1, RASGRF1, S1PR2 and SSBP2.
In a second aspect, the present invention is a method, comprising detecting in a sample from a patient having diffuse large B-cell lymphoma, by an assay, mutation in each gene in a first panel; and quantifying in the sample an expression level of each gene in a second panel. The first panel comprises at least one gene selected from the group consisting of EZH1 and MYD88; and the second panel comprises at least one gene selected from the group consisting of IRF4, MYBL1, RASGRF1, S1PR2 and SSBP2.
In a third aspect, the present invention is a method, comprising detecting in a sample from a patient having diffuse large B-cell lymphoma, by an assay, mutation in each gene in a first panel; and quantifying in the sample an expression level of each gene in a second panel. The first panel comprises TP53; and the second panel comprises at least one gene selected from the group consisting of CARD11, BCL6, MALAT1, RABEP1 and BCORL1.
In a fourth aspect, the present invention is a method, comprising detecting in a sample from a patient having diffuse large B-cell lymphoma, by an assay, mutation in each gene in a first panel; and quantifying in the sample an expression level of each gene in a second panel. The first panel comprises TP53; and the second panel comprises at least one gene selected from the group consisting of CDK8, LMO2, BCR, TGFBR2, CHD2 and ETS1.
Based on these previous studies, we hypothesized that combined high-throughput genetic and gene expression signature analysis may improve the DLBCL classification for prognostic stratification and therapeutic implication. To be clinically applicable, fast and economical assays on FFPE samples that provide both genetic and expression data with low sample input are needed. We therefore implemented targeted RNA sequencing (RNA-Seq) of 1408 genes with NGS technology that simultaneously sequences and quantitates expressed mRNA molecules in a single assay. Artificial intelligence (AI) was implemented to build predictive models based on both genetic and gene expression data of a large number of DLBCL FFPE samples. The robustness of the predictive models was tested in validation cohorts supporting our hypothesis. The full details of the study have been published84 (Xu-Monette Z Y, et al. A refined cell-of-origin classifier with targeted NGS and artificial intelligence shows robust predictive value in DLBCL. Blood Advances 2020; 14(4):3391), the contents of which are hereby incorporated by reference in their entirety, except where inconsistent with the present application. Three models were developed: a model for classifying the cell of origin (COO) of the DLBCL, a prognostic model for DLBCL overall survival (OS), and a prognostic model for DLBCL progression-free survival (PFS).
The model for classifying the COO of the DLBCL includes detecting in a sample from a patient mutation in each gene in a first COO panel, and quantifying expression levels of each gene in a second COO panel. Each COO panel includes at least one gene. The first COO panel may include one or more of the genes EZH1 and MYD88. Preferably the first COO panel includes EZH1, and more preferably both EZH1 and MYD88. The second COO panel may include one or more of the genes AFF3, AHR, AUTS2, BCAS4, BCL6, BTLA, CARD11, CCND2, CCND3, CD22, CD44, COL9A3, CREB3L2, EBF1, ETV6, FAM46C, FOXP1, IKZF1, IL2RA, IRF4, IRS1, KANK1, LCK, LMO2, LPP, LRMP, LRP5, LRRK2, LYL1, LYN, METTL7B, MYBL1, P2RY8, PAG1, PAK6, PDGFD, PIK3CG, PIM1, PTK2, PTK2B, PTPN2, RASGRF1, S1PR2, SSBP2, STAT3 and TBL1XR1. Preferably, the second COO panel includes one or more of IRF4, MYBL1, RASGRF1, 51PR2 and SSBP2, and more preferably the second COO panel includes all 5 of these genes. The second COO panel may include 1 to 46 genes including 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40 or 45 genes.
Optionally, the COO of the DLBCL may be identified, and then optionally the clinician can select the most appropriate cancer treatment therapy regime. Preferably, the confidence of classifying the COO is a probability of at least 0.8. Examples of cancer treatment therapy regimes include administering one or more of cyclophosphamide, doxorubicin, vincristine, prednisone, rituximab, obinutuzumab, dexamethasone, cytarabine, cisplatin, lenalidomide, ibrutinib, bortezomib, durvalumab and autologous stem cell transplantation. The best selection of treatment of DLBCL based on the COO of the DLBCL may be found, for example in UpToDate, a clinical decision support resource that is used to aid medical professionals in diagnosing and making treatment decisions (UpToDate, Wolters Kluwer, www.uptodate.com/home).
The prognostic model for DLBCL OS includes detecting in a sample from a patient mutation in each gene in a first OS panel, and quantifying expression levels of each gene in a second OS panel. Each OS panel includes at least one gene. The first OS panel may include one or more of the genes TP53 and TET2. Preferably the first OS panel includes TP53, and more preferably both TP53 and TET2. The second OS panel may include one or more of the genes AFF3, ASPSCR1, BCL2, BCL6, BCORL1, BHLHE22, BTK, CARD11, CCND2, CD58, CHEK2, CIT, CREB3L2, DST, ETS1, EYA2, FANCF, FZD6, GAS5, HMGA1, HOXA9, IRF4, KDM5C, KLK2, LFNG, LMO2, MACROD1, MALAT1, MEF2B/MEF2BNB-MEF2B, MFNG, MLLT4, MTCP1, MYC, PIM1, POLD1, PPP3CA, RABEP1, RAD51B, RBM6, RECQL4, RHBDF2, RLTPR, RTEL1-TNFRSF6B, SMAD3, SPTBN1, SRRM3, ST6GAL1, SULF1, SYP, TEAD2, TFAP2A, TGFBR3, U2AF2 and ZIC2. Preferably, the second OS panel includes one or more of CARD11, BCL6, MALAT1, RABEP1 and BCORL1, and more preferably the second OS panel includes all 5 of these genes. The second OS panel may include 1 to 54 genes including 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 or 53 genes. Preferably, the age of the patient (that is, the patient being over or under 60 years of age) is also included as a factor in determining OS.
The prognostic model for DLBCL PFS includes detecting in a sample from a patient mutation in each gene in a first PFS panel, and quantifying expression levels of each gene in a second PFS panel. Each PFS panel includes at least one gene. The first PFS panel includes TP53. The second PFS panel may include one or more of the genes AFF1, AFF3, ASPSCR1, ATM, BCL2, BCR, BTG2, BTK, BTLA, CDK12, CDK8, CHD2, CHEK2, CIRH1A, CREB3L2, DDIT3, EDNRB, EPHB6, ETS1, FANCF, FOXP1, FZD6, GAB1, GAS5, GPR34, IQCG, ITGA7, KDM5C, KDSR, LAMA5, LFNG, LIFR, LMO2, MACROD1, MAP2K5, MFNG, MYC, NCSTN, NR6A1, POU2AF1, PRKCB, RLTPR, RPL22, SHC2, SMAD3, SPTBN1, ST6GAL1, TEAD2 and TGFBR2. Preferably, the second PFS panel includes one or more of CDK8, LMO2, BCR, TGFBR2, CHD2 and ETS1, and more preferably the second PFS panel includes all 6 of these genes. The second PFS panel may include 1 to 49 genes including 2, 3, 4, 5, 10, 15, 20, 25, 30, 35, 40, 45 or 48 genes.
A sample from a patient may be a tissue sample (such as a tumor tissue sample or bone marrow tissue sample) or a cell free RNA sample. The sample may be fresh tissue or formaldehyde-fixed paraffin-embedded (FFPE) tissue. Preferably, the patient has already been diagnosed with DLBCL. A variety of techniques to detect mutation in genes and the expression level of genes in a sample are known. Preferably, both detection of the mutations and the expression levels are determined using next generation sequencing (NGS) in a single assay. Preferably, all the genes of interest in all 3 models, both mutations and expression levels, are determined in a single assay. Preferably, the expression levels are normalized, for example normalized to the expression level of the PAX5 gene.
RNA-seq was performed for 444 patients with de novo DLBCL diagnosed in 1998 to 2008 treated with R-CHOP at 22 medical centers. Cases were organized for retrospective studies as part of the DLBCL Consortium Program,69 which has been approved by the institutional review board of each participating medical center and conducted in accordance with the Declaration of Helsinki. Patients with transformed DLBCL, primary mediastinal large B-cell lymphoma, primary central nervous system DLBCL, or primary cutaneous DLBCL have been excluded. Molecular characterization of the study cohort has been previously summarized.70,71 Fluorescence in situ hybridization identified 12 of 293 cases as high-grade B-cell lymphoma with MYC and BCL2 and/or BCL6 rearrangements (7 MYC/BCL2 double/triple-hit and 5 MYC/BCL6 double-hit cases).
Data for 418 cases were further analyzed after data quality control. GEP was performed in 366 of the 418 patients using Affymetrix GeneChip Human Genome U133 Plus 2.0 (deposited in Gene Expression Omnibus GSE #31312).24 Using a Bayesian model, 172, 160, and 34 cases were determined as GCB, ABC, and unclassified DLBCL, respectively. For the 34 GEP-unclassified cases, the Visco-Young IHC algorithm24 was applied, which assigned 15 cases to GCB and 19 cases to ABC. For the other 52 cases in which GEP was not performed, the Visco-Young algorithm classified 22 cases as GCB and 23 cases as ABC.
To further validate the COO model, 60 independent DLBCL samples were obtained and classified into ABC/GCB subtypes using the Lymph2Cx NanoString nCounter assay according to the manufacturer's instructions.
Raw RNA-Seq and Affymetrix GEP data were preprocessed and normalized by robust multichip average using the R package (version 1.65.1).72 Two-class unpaired significance analysis of microarrays were performed to identify significantly differentially expressed genes (DEGs) between the 2 groups.73 Gene expression data were analyzed via CLUSTER software using the average linkage metric and then displayed by JAVA TREEVIEW (www.java.com/en).74
The Agencourt FormaPure Total 96-Prep Kit was used to extract both DNA and RNA from the same FFPE tissue lysates using an automated KingFisher Flex and protocols as recommended by each manufacturer. Samples were selectively enriched for 1408 cancer-associated genes using reagents provided in an IIlumina TruSight RNA Pan-Cancer Panel. The cDNA was generated from the cleaved RNA fragments using random primers during the first- and second-strand synthesis. Then, sequencing adapters were ligated to the resulting double-stranded cDNA fragments. The coding regions of expressed genes were captured from this library using sequence-specific probes to create the final library. Sequencing was performed on an IIlumina NextSeq 550 System platform. Ten million reads per sample in a single run was required. The read length was 2×150 bp. The sequencing depth was 10× to 1739×, with a median of 41×. An expression profile was generated from the sequencing coverage profile of each individual sample using Cufflinks. Expression levels were measured using fragments per kilobase of transcript per million and further normalized using the B-cell PAX5 RNA expression levels to adjust for variability in the percentage of DLBC cells in samples.
Alignment of sequencing data and variant calling were performed with the DRAGEN Somatic Pipeline (IIlumina) using tumor-only analysis against the GRCh37 reference genome to identify 2 classes of mutations: single nucleotide variants and indels. Tumor samples were analyzed without a matching normal.
To build robust DLBCL classification models, we randomly selected 60% of cases to fit (train) the model and then validated using the remaining 40% (validation set). Sixty independent DLBCL samples classified by Nanostring Lymph2Cx assay were used as a second validation set.
First, univariate significance tests were used to screen the large number of variables. Normalized RNA expression data and mutation data were included as variants to build a classification model. For interpretability and simplicity, we divided the gene expression values into 4 or 10 equal parts using the quartiles (Q1, Q2, and Q3) and deciles and selected mutation data of 39 highly recurrent genes that had mutations in at least 10 patients. Fisher's exact test was used after discretizing RNA expressions using their quartiles, and 228 variables were statistically significant with P<0.01. After adjusting for multiple hypothesis testing using Benjamini-Hochberg's method and setting the cutoff for false discovery rate (FDR) at 0.01, statistically significant variables were narrowed down to 129. Finally, setting the cutoff for FDR at 0.0001, 48 variables were selected with either small adjusted P values or high area under the receiver operating curve (AUC).
We selected 252 DLBCLs with high confidence COO assignment to develop risk stratification models directly correlating with survival. We randomly selected 60% (152) of subjects as the training set to fit the model and tested the performance in the remaining 40% (100) patients. Kaplan-Meier and Cox proportional hazards (CPH) analysis was used to identify variables with significant prognostic impact.
Multiple statistical approaches were tested for modeling performance, and models built through deep learning techniques75,76 were most predictive and robust. We used autoencoders for nonlinear transformations of autoencoded features into 2-dimensional latent space. Logistic regression and CPH models were used for building the COO model and clinical risk models, respectively.
Mutation status of each gene was analyzed for prognostic significance. Table 2 lists frequently mutated genes with significant mutational effects on overall survival (OS) by univariate analysis. The DLBCL group had 418 patients. The GCB group, determined by gene-expression profiling, had 172 patients. The ABC group, determined by gene-expression profiling, had 160 patients. The impact on OS was based on univariate analysis for each gene. Among genes with mutations occurring in at least 9 patients, TP53, TET2, KMT2D (in overall cohort, P=0.0005, 0.011, and 0.012, respectively), NOTCH2 (in GCB, P=0.005), and ATM (in ABC, P=0.003) mutations showed significantly adverse effects, whereas EZH2 and GNA13 mutations genes showed significantly favorable effects (P=0.007 and 0.047, respectively).
RNA-Seq gene expression84, gene fusion, and mutation data were used to develop a model for DLBCL-COO classification in the training set. Fisher's exact test and multiple hypothesis testing adjustment were used to identify RNA-Seq variables showing significant difference between GCB and ABC subtypes. Finally, the top 48 variables (Table 3) that were significantly differed between GCB and ABC subtypes with FDR<0.0001 or high AUC were chosen to build a new classification model for RNA-Seq data, including 2 genes (MYD88 and EZH2)'s mutation status and 46 genes' RNA expression levels.
Several statistical models were built on the 48 variables in the training set (without knowing classification) and then tested in the validation sets. The COO model based on autoencoder, an unsupervised deep learning technique, showed the best performance. An autoencoder neural network was built with 5 hidden layers.75,76 The first 2 layers and the last 2 layers each had 100 neurons; the middle layer (bottleneck) had 2 neurons, which captured latent (unobserved) features of the data. The values of these 2 neurons formed a low-dimensional (2) representation of the data; that is, it aggregated the 48 variables into 2 latent features. The top 7 contributing variables to the latent features were MYD88 mutation, EZH2 mutation, RASGRF1 expression, MYBL1 expression, S1PR2 expression, SSBP2 expression, and IRF4 expression. Based on the latent features, a logistic regression model was built for GCB/ABC classification (named as NGS-COO classifier). As shown in
The NGS-COO classifier developed from the training set was then applied to the validation set. A probability of scoring was generated for each case. Approximately 30% of the cases had a score between 0.5 and 0.75, indicating low confidence for classification. For the remaining 70% with high confidence for assigning to 1 of the 2 subtypes (probability of 0.8 or higher), the ABC vs GCB classification showed sensitivity and specificity of 96% and 97% for classification in the validation set. The accuracy/concordance rate with previous GCB/ABC classification was 95.6%. The corresponding AUC was 96.2%.
In the training and validation sets, in total, 216 cases were determined as the ABC subtype and 202 cases as the GCB subtype. The new GCB/ABC cases were also associated with 1319 significant DEGs with FDR<0.0001 in GEP analysis using our previous Affymetrix GeneChip DNA microarray data and multiple biomarkers characterized in previous studies by our Consortium program.
To further evaluate the performance of the NGS-COO classification model, we applied the same approach to 60 independent cases as an external validation cohort. Our NGS-COO model showed sensitivity and specificity of 96% and 97%, respectively, with the previous COO classification by the NanoString Lymph2Cx assay. The concordance rate was 92.9%. The corresponding AUC was 95.7%. As shown in
The performance of our NGS-COO classifier was also evaluated by correlating with survival outcomes. Although the autoencoder was only trained for COO classification in the training set, the NGS-COO classifier was significantly associated with OS and progression-free survival (PFS) in DLBCL, similar to the previous COO classification (
To build robust prognostic models aggregating small contributions of a large number of variables directly to patient survival, we used a similar procedure and the AI method to develop models in the training set and test the performance in the validation set based on both gene expression and genetic variables plus 2 additional factors: age and sex of patients. We first screened for significant variables using Kaplan-Meier and CPH for OS in the training set. Although 61 variables showed significant prognostic effects by log-rank test and 110 variables by CPH regression (P<0.05), only the TP53 mutation remained statistically significant after adjusting for multiple hypothesis testing (FDR<0.0001). Therefore, we selected 57 variables with the top 2% AUC values or P<0.01 (either based on log-rank test or CPH; Table 4).
We used a similar neural network architecture as described for COO modeling and again included 2 neurons in the bottleneck layer to reduce the data into 2 dimensions (latent features). The top 7 variables contributing to the 2 latent features are age >60, TP53 mutation, CARD11 expression, BCL6 expression, MALAT1 expression, RABEP1 expression, and BCORL1 expression. A simple CPH model was built based on the 2 latent features obtained from the autoencoder (which are nonlinear combinations of the 57 variables) and provided a risk score (NGS-OS score) for each case, which was normalized to be between 0 (lowest risk) and 100 (highest risk). As shown in
We followed a similar procedure to build a CPH model for PFS with 50 selected variables based on a 2-dimensional feature set obtained from an autoencoder (Table 5). The top 7 variables contributing to the model are TP53 mutation, CDK8 expression, LMO2 expression, BCR expression, TGFBR2 expression, CHD2 expression, and ETS1 expression. Although 24 variables are shared by the NGS-OS and NGS-PFS models, there are only 7 genes (AFF3, BCL6, CARD11, CCND2, IRF4, LMO2, and PIM1) shared by the NGS-COO and NGS-OS models and 5 genes (AFF3, BTLA, CREB3L2, FOXP1, and LMO2) shared by the NGS-COO and NGS-PFS models.
Similar with the NGS-OS risk scores, NGS-PFS risk scores identified one third of the training set and 30% of the validation set as high-risk patients (
In this study, we developed novel DLBCL classification models based on both genetic and transcriptional variables derived from comprehensive RNA-Seq annotation and quantitative data. Our results demonstrated that both the NGS-COO classifier and NGS survival predictors were robust, and AI was able to assign COO/risk scores to new DLBCL cases (patients in the validation sets). Our NGS-COO classifier shared 8 genes with the 27-gene predictor by Wright et al (BCL6, CCND2, ETV6, IRF4, LMO2, LRMP, MYBL1, and PIM1),31 8 genes with the 20-gene DLBCL Automatic classifier by Barrans et al (BCL6, CCND2, ETV6, FOXP1, IRF4, LMO2, LRMP, and PIM1),30 7 genes with the 14-gene-qNPA assay (BCL6, CCND2, IRF4, LMO2, LRMP, MYBL1 and PIM1),35 and 3 genes with the NanoString Lymph2Cx assay (CREB3L2, MYBL1, and S1PR2).38 Seven of the total 11 common genes (BCL6, CCND2, CREB3L2, FOXP1, IRF4, LMO2, and PIM1) are also shared by our NGS survival predictors, consistent with the association of COO with clinical outcome. The NGS-OS/PFS risk predictors had more significant P values in prognostic analysis than the NGS-COO classifier in the same patient cohort, suggesting that COO is only one of the biological contributors to DLBCL clinical outcome. However, the performance of NGS-OS/PFS risk predictors for other therapies is unknown. Different from previous COO/prognostic models, we integrated genetic abnormalities: MYD88 and EZH2 mutations in the NGS-COO classification model, TP53 and TET2 mutations in the NGS-OS risk model, and TP53 mutation in the NGS-PFS model.
The high-throughput RNA-Seq assays developed in this study using an NGS benchtop sequencer with approximately 3-day turnaround time have important practical implications. Although targeted NGS platforms have been implemented in the clinic to aid in diagnosis and therapeutic decisions,80 and AI is emerging as an efficient tool in health care for large data processing and sophisticate model construction,76,81 currently no NGS panels and AI implementation have been developed for lymphoma diagnosis and management. Our study supports the reliability and practicality of using targeted NGS along with AI in generating clinically useful objective information. Compared with current IHC assays, DNA microarrays, and other GEP analysis techniques used for DLBCL COO classification, targeted RNA-Seq has a balanced advantage of genome-wide coverage, dynamic range of quantification, reproducibility, high throughput, and accuracy, as well as high sensitivity, automation, affordability, short assay time, and flexibility.82 As RNA-seq has become less costly and been integrated into clinical practice,80 we expect that the generated RNA-seq data will be used not only to answer the COO and prognostic questions but also other diagnostic and clinical questions impacting clinical decisions, such as predicting clinical responses to novel therapies in clinic and in future prospective or retrospective studies.80,83
The current proof-of-principle study demonstrates the potential utility of the targeted RNA-Seq assay for accurate and reproducible DLBCL-COO subclassification in daily clinical practice using a commercially available NGS platform; streamline analysis of high-throughput RNA-Seq data, COO assignment, and risk prediction by AI can further improve the workflow.
Here, the input data are denoted as Xin, the hidden layers as h, and the output layer as Xout. The model can be presented as follows:
h
0
=x
in
h
l=ƒl(hl-1Wl+bl),l=1, . . . ,5
h
6
=x
out
=g(h5Wout+bout)
Where Wl is the weight matrix connecting layers l−1 and l, bl is the corresponding vector of biases, fl is a nonlinear transformation function, g is a link function that maps the last year to Xout with the corresponding weight matrix Wout and bias vector bout.
The inventor set f to the hyperbolic tangent (tanh) function for all the hidden layers. Assuming that the output has Gaussian distribution, the inventor set the link function g to the identity function. The model parameters, W and b, are estimated such that Xout becomes a close approximation (in terms of mean square error) of Xin. More specifically, the H2O package in R was used to fit the model, and the corresponding code is presented below. To improve the generalizability of the model, the drop-out ratio from the input layer was set to 50% and the L2 regularization term was set to 0.01. The top 5 contributing variables to the model were Mute.MYD88, Mute.EZH2, RASGRF1, MYBL1, S1PR2, SSBP2, IRF4.
This application claims the benefit of U.S. Provisional Application No. 62/899,007, filed 11 Sep. 2019, entitled “CELL OF ORIGIN CLASSIFICATION OF DLBCL USING TARGETED NGS EXPRESSION PROFILING AND DEEP LEARNING”, attorney docket no. 104183.0001PRO, the contents of which are hereby incorporated by reference in their entirety, except where inconsistent with the present application.
Number | Date | Country | |
---|---|---|---|
62899007 | Sep 2019 | US |