The invention is in the field of medical diagnostics, in particular in the field of disease diagnostics and monitoring. The invention is directed to markers for the detection of disease, to methods for detecting disease, and to a method for determining the efficacy of a disease treatment.
Cancer is one of the leading causes of death in developed countries. Studies have revealed that many cancer patients are diagnosed at a late stage, when they are more difficult to treat. Cancer is mainly driven by successive mutations in normal cells, resulting in DNA damages and ultimately causing significant gene alterations that contribute to a cancerous state.
Cancer is often diagnosed on the basis of tumor markers. Tumor markers are substances that are present in a cancer cell or that is produced in another cell in response to a cancer. Some tumor markers are also present in normal cells but, for example, in an alternative form of at higher levels, in a cancerous cell. Tumor markers can often be identified in a liquid sample, such as blood, urine, stool, or bodily fluids.
Most presently used tumor markers are proteins. An example is prostate-specific antigen (PSA), which is used as a tumor marker for prostate cancer. Most single tumor markers are not reliable to be useful in the management of an individual patient with cancer. Alternative markers, such as gene expression levels and DNA alterations such as DNA methylation, have begun to be used as tumor markers. The identification of alterations in expression levels and/or genomic DNA of multiple genes may improve detection, diagnosis, prognosis and treatment of cancer. Extensive data mining and statistical analysis is required to discover combinations of tumor markers that can differentiate between normal variation and a cancerous state.
Blood-based liquid biopsies, including tumor-educated blood platelets (TEPs; Nilsson et al., 2011. Blood 118: 3680-3683; Best et al., 2015. Cancer Cell 28: 666-676; Nilsson et al., 2015. Oncotarget 7: 1066-1075), have emerged as promising biomarker sources for non-invasive detection of cancer and therapy selection. A notorious challenge is the identification of optimal biomarker panels from such liquid biosources. To select robust biomarker panels for disease classification the use of ‘swarm intelligence’ was proposed, especially particle swarm optimization (PSO) (Kennedy et al., 2001. The Morgan Kaufmann Series in Evolutionary Computation. Ed: David B. Fogel; Bonyadi and Michalewicz 2016. Evolutionary computation: 1-54; Kennedy and Eberhart, 1995. Proceedings of IEEE International Conference on Neural Networks: 1942-1948).
PSO-driven algorithms are inspired by the concomitant swarm of birds and schools of fish that by self-organisation efficiently adapt to their environment or identify sources of food. Bioinformatically, PSO algorithms are exploited for the identification of optimal solutions for complex parameter selection procedures, including the selection of biomarker gene lists (Alshamlan et al., 2015. Computational Biol Chem 56: 49-60; Martinez et al., 2010. Computational Biol Chem 34: 244-250).
Targeted therapy and personalized medicine are critically depending on disease profiling and the development of companion diagnostics. Mutations in disease-derived nucleic acids can be highly predictive for the response to targeted treatment. However, obtaining easily accessible high-quality nucleic acids remains a significant developmental hurdle. Blood generally contains 150,000-350,000 thrombocytes (platelets) per microliter, providing a highly available biomarker source for research and clinical use. Moreover, thrombocyte isolation is relatively simple and is a standard procedure in blood bank/hematology labs. Since platelets do not contain a nucleus, their RNA transcripts—needed for functional maintenance—are derived from bone marrow megakaryocytes during thrombocyte origination. In addition, thrombocytes may take up RNA and/or DNA from other cells during circulation via various transfer mechanisms. Tumor cells for instance release an abundant collection of genetic material, some of which is secreted by microvesicles in the form of mutant RNA During circulation in the blood stream thrombocytes may absorb the genetic material secreted by cancer cells and other diseased cells, serving as an attractive platform for the companion diagnostics of cancer, specifically in the context of personalized medicine.
The present invention provides a method of administering immunotherapy that modulates an interaction between programmed death protein 1 (PD-1) and its ligand, to a cancer patient, comprising the steps of providing a sample from the patient, the sample comprising mRNA products that are obtained from anucleated cells of said patient; determining a gene expression level for at least four genes, more preferred at least five genes, more preferred at least six genes listed in Table 1 in said sample; comparing said determined gene expression level to a reference expression level of said genes in a reference sample; typing the patient as a positive responder to said immunotherapy, or as a not-positive responder, based on the comparison with the reference; and administering immunotherapy to a cancer patient that is typed as a positive responder.
In a preferred method of the invention, a gene expression level is determined for at least four genes listed in Table 1, more preferred at least five genes, more preferred at least six genes, more preferred at least ten genes, more preferred at least fifty genes, more preferred all genes, listed in Table 1.
Said immunotherapy that modulates an interaction between PD-1 and its ligand, PD-L1 or PD-L2, is aimed at activating the immune system to attack the cancer of the patient. Known modulators that inhibit interaction between PD-1 and its ligand include monoclonal antibodies such as atezolizumab (Genentech Oncology/Roche), avelumab (Merck/Pfizer), durvalumab (AstraZeneca/MedImmune), nivolumab (Bristol-Myers Squibb), lambrolizumab (Merck), pidilizumab (CureTech) and pembrolizumab (Merck), and fusion proteins such as AMP-224 (GlaxoSmithKline). A preferred immunotherapy comprises nivolumab.
In another embodiment, the invention provides a method of typing a sample of a subject for the presence or absence of a lung cancer, comprising the steps of providing a sample from the subject, whereby the sample comprises mRNA products that are obtained from anucleated cells of said subject; determining a gene expression level for at least five genes listed in Table 2; comparing said determined gene expression level to a reference expression level of said genes in a reference sample; and typing said sample for the presence or absence of a lung cancer on the basis of the comparison between the determined gene expression level and the reference gene expression level.
Said subject, a mammalian, preferably a human, is not known to suffer from lung cancer. Said lung cancer preferably is a non-small cell lung cancer.
In a preferred method of the invention, a gene expression level is determined for at least ten genes listed in Table 2, more preferred at least forty five genes, more preferred at least fifty genes, more preferred all genes, listed in Table 2.
Anucleated cells, as referred to herein above, may act as local and systemic responders during tumorigenesis and cancer metastasis, thereby being exposed to tumor-mediated education, and resulting in altered behaviour. Anucleated cells, such as thrombocytes can function as a RNA biomarker trove to detect and classify cancer from diverse sources. Said RNA present in anucleated cells preferably originates from tumor cells, and is transferred from tumor cells to anucleated cells. These anucleated cells can be easily isolated from a liquid biopsy such as blood and may contain RNA from nucleated tumor cells.
Said sample comprising mRNA products is preferably obtained from a liquid biopsy, preferably blood. Said anucleated cells preferably are or comprise thrombocytes. In a preferred embodiment, thrombocytes are isolated from a blood sample and mRNA is subsequently isolated from said isolated thrombocytes.
A gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1, and/or for at least five genes listed in Table 2, in said sample may be determined by any method known in the art, including micro-array-based analyses, serial analysis of gene expression (SAGE), multiplex Polymerase Chain Reaction (PCR), multiplex Ligation-dependent Probe Amplification (MLPA), bead based multiplexing such as Luminex/XMAP, and high-throughput sequencing including next generation sequencing. The gene expression level is preferably determined by next generation sequencing.
The invention further provides a method of treating a cancer patient, preferably a lung cancer patient, by assigning immunotherapy that modulates an interaction between PD-1 and its ligand to said patient, wherein said cancer patient is selected by typing a sample from the patient, the sample comprising mRNA products that are obtained from anucleated cells of said subject; determining a gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1; comparing said determined gene expression level to an expression level of said genes in a reference sample; typing the patient as a positive responder to said immunotherapy, or as a not-positive responder, based on the comparison with the reference; and assigning immunotherapy to a cancer patient that is selected as a positive responder.
Further provided is immunotherapy that modulates an interaction between PD-1 and its ligand, for use in a method of treating a cancer patient, preferably a lung cancer patient, wherein said cancer patient is selected by typing a sample from the patient, the sample comprising mRNA products that are obtained from anucleated cells of said subject; determining a gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1; comparing said determined gene expression level to an expression level of said genes in a reference sample; typing the patient as a positive responder to said immunotherapy, or as a not-positive responder, based on the comparison with the reference; and assigning immunotherapy to a cancer patient that is selected as a positive responder.
As is indicated herein above, said immunotherapy that modulates an interaction between PD-1 and its ligand. PD-L1 or PD-L2, is aimed at activating the immune system to attack the cancer of the patient. Known modulators that inhibit interaction between PD-1 and its ligand include monoclonal antibodies such as atezolizumab (Genentech Oncology/Roche), avelumab (Merck/Pfizer), durvalumab (AstraZeneca/MedImmune), nivolumab (Bristol-Myers Squibb), lambrolizumab (Merck), pidilizumab (CureTech) and pembrolizumab (Merck), and fusion proteins such as AMP-224 (GlaxoSmithKline). A preferred immunotherapy comprises nivolumab.
The invention further provides a method for obtaining a biomarker panel for typing of a sample from a subject, the method comprising isolating anucleated cells, preferably thrombocytes, from a liquid sample of a subject having condition A; isolating RNA from said isolated cells; determining RNA expression levels for at least 100 genes in said isolated RNA; determining RNA expression levels for said at least 100 genes in a control sample from a subject not having condition A; and using particle swarm optimization-based algorithms to obtain a biomarker panel that discriminates between a subject having condition A and a subject not having condition A.
It is preferred that the subject having condition A is suffering from a cancer, preferably a lung cancer, or had a known, positive response to a cancer treatment, while a subject not having condition A is not suffering from a cancer, or had a known, negative response to a cancer treatment.
(a) Overview of Non-cancer and NSCLC platelet samples (total of 728) included in this study for thromboSeq. (b) Overview of alternative splicing analyses, the estimated contribution to the TEP signatures, and additional Figures related to these analyses. RBP=HNA-binding protein (c) Schematic representation of the particle-swarm intelligence approach. Light to dark grey colored dots represent AUC-values of 38 samples classified using a thromboSeq classification algorithm, with use of 100 randomly selected parameters (left) or 100 parameters selected by swarm-intelligence (right). Dots were mirrored twice for visualization purposes. Most optimal AUC-value reached by swarm-enhanced thromboSeq is indicated in both plots with an asterisk. (d) ROC analysis of swarm-enhanced thromboSeq classifications using Non-cancer and NSCLC cohorts matched for patient age and blood storage time. Grey dashed line indicates ROC evaluation of the training cohort assessed by LOOCV, red line indicates ROC evaluation of the evaluation cohort (n=40), blue line indicates ROC evaluation of the validation cohort (n=130). Indicated are cohort size, most optimal accuracy, and AUC-value. Acc.=accuracy. (e) Performance of the swarm-enhanced thromboSeq algorithm evaluated in the full 728-samples cohort summarized in a ROC curve. Swarm intelligence made use of the evaluation cohort (red line, n=88 samples) to optimize the classification performance of the 120-samples training cohort by selection of the biomarker gene panel. The swarm-enhanced thromboSeq NSCLC diagnostics algorithm was validated using a patient age and/or blood storage time-unmatched cohort (n=520, blue line). Performance of the training cohort, assessed by LOOCV, is indicated with a grey dashed line. Indicated are cohort, size, most optimal accuracy, and AUC-value. Acc.=accuracy.
(a) Schematic overview of the experimental setup. Blood of patients eligible for treatment with PD-1 inhibitor nivolumab was included from one month before till start of treatment (baseline, t=0). Tumor response read-out based on CT-imaging and according to the RECIST 1.1, criteria were performed at 6-8 weeks, 3 months, and 6 months, after start of nivolumab therapy. Best response was selected as overall tumor response (see Example 1). (b) Heatmap of unsupervised clustering of platelet mRNAs following swarm-intelligence driven gene panel selection of responders (blue, n=44) and non-responders (red, n=60). (c) ROC analysis of the swarm-enhanced thromboSeq nivolumab response prediction algorithm of 104 nivolumab baseline samples. Training cohort performance as measured by LOOCV approach is indicated by a red line, dependent evaluation cohort by a black line, and independent validation cohort by a blue line. Grey solid (upper bound) and dotted (lower bound) lines indicate the ROC curve resulting from a randomly trained algorithm. The black dot indicates a potential clinical threshold of the algorithm for optimal therapy selection and non-responder rule-out. (d) A 2×2 cross-table indicating the classification accuracies of the independent validation cohort, with the thromboSeq classification read-out optimized towards a rule-out value. A 100% sensitivity results in 53% specificity. Indicated are sample numbers and percentages.
(a) Schematic representation of thromboSeq machine learning-based liquid biopsies for cancer diagnostics and therapy monitoring. A library of RNA-seq data generated from platelets of individuals with different diseases and healthy individuals served as input for thromboSeq algorithm development. Following algorithm optimization using the swarm-module and model validation, the platform enables RNA signature-based disease classification and therapy monitoring. (b) Schematic representation and sample cohort details of the training, evaluation, and validation cohorts. Cohorts are used for assessing the analytical performance of swarm-enhanced thromboSeq and to investigate the diagnostic classification power in patient age- and blood storage time-matched cohorts. The patient age and blood storage time-matched cohort was validated on a 130-samples training cohort, optimized using a 40-samples evaluation cohort.
(a) Overview of the demographic characteristics of the platelet sample cohort (n=263) matched for patient age and blood storage time. Characteristics are shown for both Non-cancer (n=104) and NSCLC (n=159) individuals. Indicated per clinical group are number of male individuals and percentage of total, median age (including interquartile range (IQR) and minimal and maximum age, in years), smoking status and percentage of total, and metastasis of the primary NSCLC towards other organs (yes/no). n.a.=not applicable. (b) Overview of platelet activation markers as measured by flow cytometric analysis of n=3 (8 hours time point) or n=6 (other time points) platelet samples collected from healthy donors and isolated using the thromboSeq platelet isolation protocol. Light and dark grey boxes represent average percentage of platelets expressing respectively P-selectin or CD-63 on the surface. The box indicates the interquartile range (IQR), black line represents the median, and the whiskers indicate 1.5×IQR. Dots represent expression of these surface markers after platelet activation with TRAP (see Example 1). Platelet samples are only minimally activated using the thromboSeq platelet isolation protocol. (c) Summary of the platelet total RNA yield in nanograms per microliter isolated from 6 mL whole blood in EDTA-coated Vacutainers tubes. The RNA concentration and quality was measured by Bioanalyzer RNA Picochip analysis. Total RNA yield was summarized in boxplots for both Non-cancer (n=86) and NSCLC (n=151) separately. The box indicates the interquartile range (IQR), black line represents the median, and the whiskers indicate 1.5×IQR. Platelets of NSCLC patients had a significantly higher RNA yield as compared to Non-cancer patients (p=0.0014, two-sided independent Student's t-test). (d) Linearity and efficiency of SMARTer cDNA synthesis and amplification using the thromboSeq protocol. Correlation plot of estimated RNA input (x-axis, in pg/μL) to the output SMARTer cDNA yield (y-axis, in nM, n=177 observations in total). Each dot represents a sample, color-coded by clinical group. An average RNA input, as measured by Bioanalyzer Picochip RNA, of ˜500 pg was used for SMARTer cDNA synthesis and PCR amplification. The RNA input and cDNA output showed a positive correlation (r=0.23, p=0.003, Pearson's correlation). (e) Linearity and efficiency of Truseq cDNA library preparation and PCR amplification using the thromboSeq protocol. Correlation plot of SMARTer cDNA yield used as input (x-axis, in nM) to the outputted Truseq platelet cDNA sequence library yield (y-axis, in nM, n=177 observations in total). Each dot represents a sample, color-coded by clinical group. All SMARTer cDNA output, except a 1.5 μL purification buffer aliquot for Bioanalyzer analysis, was used as input for the Truseq Library Preparation. The SMARTer cDNA yield and Truseq platelet cDNA library output showed a positive correlation (r=0.44, p<0.0001, Pearson's correlation). (f) Bioanalyzer traces of samples with spiked, smooth, and intermediate spiked/smooth profiles. For each example, the total RNA on Picochip Bioanalyzer, the SMARTer amplified cDNA on DNA High Sensitivity Bioanalyzer, and Truseq cDNA library on DNA 7500 Bioanalyzer are shown. X-axes indicate the length of the product (in nucleotides (nt) for RNA, and base pairs (bp) for cDNA), while y-axes indicate the relative fluorescence as measured by the Bioanalyzer 2100. From spiked towards smooth SMARTer cDNA samples, a gradual increase of smoothness of the SMARTer cDNA Bioanalyzer slopes was observed, while the total RNA and Truseq cDNA show non-distinguishable profiles. (g) Overview of the relative cDNA yield in nM resulting from the SMARTer amplification (top), relative cDNA length in bp of the spiked, smooth, and intermediate spiked/smooth SMARTer cDNA groups (middle), and number of intron-spanning spliced RNA reads (bottom). cDNA concentrations were measured by the area-under-the-graph on a Bioanalyzer cDNA High Sensitivity chip. cDNA yield is comparable among the three distinct SMARTer profiles. The relative cDNA length was measured by selection of a 200-9000 bp region in the Bioanalyzer software. The SMARTer cDNA slopes are strongly correlated to the average cDNA length. The contribution of reads mapping to intergenic regions do negatively influence the number of intron-spanning reads eligible for thromboSeq analysis. Number of samples per SMARTer slope and clinical group is shown below the graph. The box indicates the interquartile range (IQR), black line represents the median, and the whiskers indicate 1.5×IQR. (h) Histogram of the average fragment length of reads mapped to intergenic regions for both spiked (upper) and smooth (bottom) samples (n=50 each, randomly sampled). Overlapping reads mapping to intergenic regions were merged (see Online Methods), and total resulting fragment sizes were quantified. Both spiked and smooth samples contain primarily fragments of <250 nt, with a peak in the 100-200 nt region. (i) Selection of intron-spanning spliced RNA reads for thromboSeq analysis. Stackplot indicates the distribution of reads for each sample, subspecified from intron-spanning, exonic, intronic, intergenic, and mitochondrial regions. Of note, the intron-spanning reads were subtracted from the reads mapping to the exonic regions. Samples (n=263) were sorted according to the proportion (y-axis) of intron-spanning reads. (j) Selection of samples with >3000 genes detected for thromboSeq analysis. Plot indicates for 740 platelet RNA samples subjected to thromboSeq the total number of intron-spanning reads (x-axis), and the number of genes detected (y-axis), with at least one intron-spanning read. The number of detected genes is partially correlated to the total number of intron-spanning reads yielded per sample. Samples with less than 3000 genes detected (n=10) were excluded from analyses. (k) Summary of the number of genes detected with confidence (i.e. >30 spliced RNA reads) in the platelet RNA samples using shallow thromboSeq (10-20 million reads on average), shown for both Non-cancer (n=377) and NSCLC (n=353) cohorts. The box indicates the interquartile range (IQR), black line represents the median, and the whiskers indicate 1.5×IQR. The average detection of genes per samples is ˜4500 different RNAs, and slightly decreased on average in platelets of NSCLC patients as compared to Non-cancer individuals. (l) Comparison of shallow versus deep thromboSeq. A total of 12 platelet RNA samples collected from healthy controls were subjected to deep thromboSeq (median 59.7 (min-max: 43.2-96.2) million raw reads counts per sample) and compared with the matched shallow thromboSeq RNA-seq data. For the deep thromboSeq, platelet samples were reprepared for sequencing, starting from platelet total RNA, with comparable input concentrations. Plot indicates the raw read counts for each gene (log-transformed y-axis) sorted by median read counts of all samples (x-axis). The three genes with highest expression in deep thromboSeq are highlighted. (m) Leave-one-sample-out cross-correlation. To investigate the comparability of one sample (test case) to all other sample (reference cohort), we performed cross-correlations, during which counts of each sample was correlated to the median counts of all other samples. This step was included as a quality control step (see Online Methods) following the selection for samples with sufficient number of genes detected (see also (j)). The cross-correlation was calculated 730 times, i.e. all samples were left out of the reference cohort, once. Results indicate that all samples show high inter-sample Pearson's correlations. Samples with a inter-sample-correlation <0.5 (n=2) were excluded from analyses.
(a) Unsupervised hierarchical clustering of differentially spliced RNAs between Non-cancer (n=104) and NSCLC (n=159) individuals. A total of 1625 genes (698 up, 927 down) showed a significance with a False Discovery Rate <0.01 (see Example 3). Columns indicate samples, rows indicate genes, and color intensity represent the z-score transformed RNA expression values (prior to visualization subjected to the RUV-based iteration correction-module). Clustering of samples showed non-random partitioning (p<0.0001, Fisher's exact test). (b) PAGODA gene ontology analysis (see Example 1). Significantly enriched genes were subjected to unbiased gene cluster identification and gene ontology analysis. Most significant results by adjusted Z-score, indicating high statistical significance, were clustered and visualized. Grey code indicates a dark to light (low-to-high) score per sample per gene cluster. The most significant biological group (maximum adjusted Z-score of 13.9) includes gene ontologies related to translation, RNA binding proteins (RBPs), and signaling, with a low splicing score in NSCLC samples compared to Non-cancer samples. The most significantly enriched gene cluster in NSCLC patients compared to Non-cancer individuals is related to signaling and immune response (maximum adjusted Z-score of 5.3). This clustering analysis identified correlations between platelet homeostasis gene signatures in platelets of Non-cancer individuals and specific immune signaling pathways in TEPs of NSCLC patients. RBP=RNA binding proteins.
(a) Schematic figure represents the read distribution analyses procedure. From the patient age- and blood storage time-matched cohort, we mapped 100 bp reads to the human genome and quantified the number of reads mapping to four distinct regions (see Example 3). i.e. exonic, intronic, and intergenic regions (together the ‘genomic regions’) and the mitochondrial genome (abbreviated as mtDNA). Of note, the intron-spanning spliced reads were included in the exonic regions. (b) Boxplots indicate for Non-cancer (light grey, n=104) and NSCLC (dark grey, n=159) the median and spread of reads mapping to mitochondrial (mtDNA), exonic, intronic, or intergenic regions, and the median and spread of intron-spanning and exon boundary reads. The box indicates the interquartile range (IQR), black line represents the median, and the whiskers indicate 1.5×IQR. Intron-spanning reads are defined as reads that start on exon a and end on exon b. Exon boundary reads are defined as reads that overlay a neighbouring exon-intron boundary. The exonic, intronic, intergenic, intron-spanning, and exon boundary reads were normalized to one million total genomic reads. (c) Summary figure of the analysis of alternative RNA isoforms. Schematic figure represents the development of an isoform count matrix. For this, 92 bp trimmed RNA-seq reads were mapped to the human genome and following subjected to the MISO algorithm. The MISO algorithm allows for inferring expressed RNA isoforms from single read RNA-seq data. MISO output data was deconvoluted into a count matrix that contains per sample for each expressed RNA isoform the number of reads supporting that particular isoform. The count matrix of 104 Non-cancer individuals and 159 NSCLC patients was used for differential expression analysis. Isoforms with a significance value (FDR)<0.01 were selected. Piechart of the total number of differentially spliced RNA isoforms (FDR<0.01, n=743, summarized in color codes) per gene (n=571, summarized in the pies of the piechart), indicating the distribution of significantly altered isoforms between Non-cancer and NSCLC per parent gene. In 38% of the significantly altered RNA isoforms multiple isoforms belonged to the same parent gene, supporting the notion that some genes show co-regulation of multiple RNA isoforms. Pie chart of total number of genes (n=571 in total) that shows for all RNA isoforms co-increased expression levels ( 277/571, 49%), co-decreased expression levels ( 281/571, 49%), or alternative splicing ( 13/571, 2%). Additional details are provided in Example 2. (d) Summarizing figure of the exon skipping events analysis. Schematic figure represent the experimental approach for detection of exon skipping events. Reads were mapped and analyzed using the MISO algorithm, which infers reads favouring either inclusion (on top of the schematic figure) or exclusion (below of schematic figure) of the specific exon. For this, the algorithm does also takes reads mapping to neighbouring exons into account. After filtering for average read coverage in the majority of the sample cohort (see Online Methods), a total 230 exons remained eligible for analysis. Percent spliced in (PSI)-values, as outputted by MISO, were used for differential ANOVA statistics. A total of 27 exons were identified as potentially skipped in either Non-cancer or NSCLC samples (FDR<0.01). The histogram shows the direction of the PSI-value, where positive PSI-values favour exclusion in Non-cancer, and negative PSI-values favour exclusion in NSCLC. The gene names of the annotated events, sorted by FDR-value and filtered for unique gene names, are listed in the box. Additional details are provided in Example 2.
(a) Correlation plot of proportion of reads mapping to exonic coordinates (x-axis) versus the log-transformed. RUV-corrected, and counts-per-million of P-selectin. Each dot represent a sample, coded by clinical group (NSCLC, n=159, dark grey, and Non-cancer, n=104, light grey). The exonic reads correlate with the expression levels of P-selectin (r=0.51, p<0.001). (b) Distribution of correlation coefficients of the correlation between log-transformed counts-per-million levels of 4722 genes and the log-transformed counts-per-million of P-selectin. A subset of the genes show a strong correlation with P-selectin (r approximates −1 or 1), whereas other do not (r approximates 0). For the histogram, a bin size of 0.05 was used. (c) Venn-diagram overlay of genes upregulated in the NSCLC TEP signature (698 genes, see also
(a) Schematic biological model highlighting the difference between nucleated cells and anucleated platelets in the context of regulation of translation. Nucleated cells (left) are able to regulate and maintain the transcriptome by transcription factor (TF)-mediated DNA transcription, resulting in protein translation. Anucleated platelets lack genomic DNA, and thus the ability to regulate the RNA content by TFs. Circulating platelets retain the ability to selectively splice the pre-mRNA repertoire, suggesting a key regulatory function during the induction of splicing events. (b) Schematic representation of the RBP-thromboSearch engine algorithm. The algorithm is designed to identify correlations between the presence of RBP motif sequences in specific genomic regions of the genome, here applied to 5′-UTRs and 3′-UTRs. At start, the algorithm extracts the reference sequence of the regions of interest from the human genome (hg19). In addition, the algorithm was complemented with validated RBP binding sites motif sequences that were previously identified (Ray et al., 2013. Nature 499: 172-177). By reduction of the motif sequences, 547 non-redundant oligonucleotide sequences were matched with the UTR reference sequences, and all matched counts (ranging 0 to 460) were summarized in a UTR-to-motif matrix, used for downstream analyses. For further details of the RBP-thromboSearch engine algorithm, see Example 1. (c) UTR-read coverage filter. UTR regions (n=19180, x-axis) included in this analysis were quantified for number of mapping reads (y-axis). UTRs with more than five (5′-UTRs) or three (3′-UTRs) mapped reads were considered present in platelets. Blue dots represent mean counts across all samples, grey shade indicates the respective standard deviations. (d) Enrichment of identified RBP binding sites per UTR region. The x- and y-axes represent the mean binding sites for the 5′ and 3′-UTR per RBP (dots, n=102). Several RBPs are specifically enriched in the 3′-UTR, whereas others are enriched in the 5′-UTR (see also Example 4). (e and f) Heatmap of all RBPs (n=80, rows) and all 5′-UTR (e) and 3′-UTH (f) regions detected with sufficient coverage in platelets (n=3210 for 5′-UTR, and n=3720 for 3′UTR, columns, see Example 4). Number of binding sites is reflected by the heatmaps colors (see grey scale). UTR regulation by RBPs seems to be mediated by presence/absence of RBP binding sites. (g) Correlation analysis between n binding sites of an RBP and the logarithmic fold-change (log FC) of genes (n=4722) in the NSCLC/Non-cancer differential splicing analysis (see also
(a) Schematic overview of the iterative correction module as implemented in thromboSeq. The RNA-seq data correction procedure includes multiple steps, i.e. 1) filtering of low abundant genes, 2) determination of stable genes among confounding variables, 3) raw-read counts Remove Unwanted Variation (RUV)-based factor analysis and correction, and 4) reference group-mediated counts-per-million and TMM-normalisation (see also Example 1). In detail, in step 1 genes with low confidence of detection, i.e. less than 30 intron-spanning spliced RNA reads in more than 90% of the sample cohort, are excluded. In the schematic example, the two upper genes (rows) contain in >90% of the samples (in this schematic example n=10 in total) sufficient numbers of reads, as indicated by the light grey boxes. Thus, these genes will be included for analysis. The lower two boxes indicate insufficient numbers of samples with sufficient numbers of genes, thus prompting the algorithm to remove these particular genes from the downstream analyses. Secondly, the algorithm searches for genes that show a stable expression pattern among all other samples. For this, the algorithm performs multiple Pearson's correlation analyses among a (potential confounding) variable and raw read counts, resulting in a distribution of the correlation coefficients. In the schematic figure, this is shown for intron-spanning reads library size (left) and patient age (right). The correlation distribution is shown below, and the putative thresholds (also subjected to PSO selection, see (e)) are indicated by black lines. Of note, as the raw intron-spanning read counts are normalised by counts-per-million normalization afterwards, stable genes have to approximate a correlation coefficient of one (see also
As used herein, the term “cancer” refers to a disease or disorder resulting from the proliferation of oncogenically transformed cells. “Cancer” shall be taken to include any one or more of a wide range of benign or malignant tumours, including those that are capable of invasive growth and metastasis through a human or animal body or a part thereof, such as, for example, via the lymphatic system and/or the blood stream. As used herein, the term “tumour” includes both benign and malignant tumours or solid growths, notwithstanding that the present invention is particularly directed to the diagnosis or detection of malignant tumours and solid cancers. Cancers further include but are not limited to carcinomas, lymphomas, or sarcomas, such as, for example, ovarian cancer, colon cancer, breast cancer, pancreatic cancer, lung cancer, prostate cancer, urinary tract cancer, uterine cancer, acute lymphatic leukaemia. Hodgkin's disease, small cell carcinoma of the lung, melanoma, neuroblastoma, glioma (e.g. glioblastoma), and soft tissue sarcoma, lymphoma, melanoma, sarcoma, and adenocarcinoma. In preferred embodiments of aspects of the present invention, thrombocyte cancer is disclaimed.
The term “liquid biopsy”, as is used herein, refers to a liquid sample that is obtained from a subject. Said liquid biopsy is preferably selected from blood, urine, milk. cerebrospinal fluid, interstitial fluid, lymph, amniotic fluid, bile, cerumen, feces, female ejaculate, gastric juice, mucus pericardial fluid, pleural fluid, pus, saliva, semen, smegma, sputum, synovial fluid, sweat, tears, vaginal secretion, and vomit. A preferred liquid biopsy is blood.
The term “blood”, as is used herein, refers to whole blood (including plasma and cells) and includes arterial, capillary and venous blood.
The term “anucleated blood cell”, as used herein, refers to a cell that lacks a nucleus. The term includes reference to both erythrocyte and thrombocyte. Preferred embodiments of anucleated cells according to this invention are thrombocytes. The term “anucleated blood cell” preferably does not include reference to cells that lack a nucleus as a result of faulty cell division.
The term “thrombocyte”, as used herein, refers to blood platelets. i.e. small, irregularly-shaped cell fragments that do not have a DNA-containing nucleus and that circulate in the blood of mammals. Thrombocytes are 2-3 μm in diameter, and are derived from fragmentation of precursor megakaryocytes. Platelets or thrombocytes lack nuclear DNA, although they retain some megakaryocyte-derived mRNAs as part of their lineal origin. The average lifespan of a thrombocyte is 5 to 9 days. Thrombocytes are involved and play an essential role in hemostasis, leading to the formation of blood clots.
The present invention describes methods of diagnosing, prognosticating or predicting a response to treatment, based on analyzing gene expression levels in anucleated cells such as thrombocytes extracted from blood. This approach is robust and easy. This is attributed to the rapid and straight forward extraction procedures and the quality of the extracted nucleic acid. Within the clinical setting, thrombocytes extraction from blood samples is implemented in general biological sample collection and therefore it is foreseen that the implementation into the clinic is relatively easy.
The present invention provides general methods of diagnosing, prognosticating or predicting treatment response of a subject using said general methods. When reference is herein made to a method of the invention, any and all of these embodiments are referred to, except if explicitly indicated otherwise.
A method of the invention can be performed on any suitable body sample comprising anucleated blood cells, such as for instance a tissue sample comprising blood, but preferably said sample is whole blood.
A blood sample of a subject can be obtained by any standard method, for instance by venous extraction.
The amount of blood that is required is not limited. Depending on the methods employed, the skilled person will be capable of establishing the amount of sample required to perform the various steps of the methods of the present invention and obtain sufficient nucleic acid for genetic analysis. Generally, such amounts will comprise a volume ranging from 0.01 μl to 100 ml, preferably between 1 μl to 10 ml, more preferably about 1 ml.
The body fluid, preferably blood sample, may be analyzed immediately following collection of the sample. Alternatively, analysis according to the method of the present invention can be performed on a stored body fluid or on a stored fraction of anucleated blood cells thereof, preferably thrombocytes. The body fluid for testing, or the fraction of anucleated blood cells thereof, can be preserved using methods and apparatuses known in the art. In an anucleated blood cell fraction, the thrombocytes are preferably maintained in inactivated state (i.e. in non-activated state). In that way, the cellular integrity and the disease-derived nucleic acids are best preserved. A thrombocyte-containing sample from a body fluid preferably does not include platelet poor plasma or platelet rich plasma (PRP). Further isolation of the platelets is preferred for optimal resolution.
A body fluid, preferably blood sample, may suitably be processed, for instance, it may be purified, or digested, or specific compounds may be extracted therefrom. Anucleated cells may be extracted from the sample by methods known to the skilled person and be transferred to any suitable medium for extraction of nucleic acid. The subject's body fluid may be treated to remove nucleic acid degrading enzymes like RNases and DNases, in order to prevent destruction of the nucleic acids.
Thrombocyte extraction from the body sample of the subject may involve any available method. In transfusion medicine, thrombocytes are often collected by apheresis, a medical technology in which the blood of a donor or patient is passed through an apparatus that separates out one particular constituent and returns the remainder to the circulation. The separation of individual blood components is done with a specialized centrifuge. Plateletpheresis (also called thrombopheresis or thrombocytapheresis) is an apheresis process of collecting thrombocytes. Modern automatic plateletpheresis allows blood donors to give a portion of their thrombocytes, while keeping their red blood cells and at least a portion of blood plasma. Although it is possible to provide the body fluid comprising thrombocytes as envisioned herein by apheresis, it is often easier to collect whole blood and isolate the thrombocyte fraction therefrom by centrifugation. Generally, in such a protocol, the thrombocytes are first separated from other blood cells by a centrifugation step of about 120×g for about 20 minutes at room temperature to obtain a platelet rich plasma (PRP) fraction. The thrombocytes are then washed, for example in phosphate-buffered saline/ethylenediaminetetraacetic acid, to remove plasma proteins and enrich for thrombocytes. Wash steps are generally followed by centrifugation at 850-1000×g for about 10 min at room temperature. Further enrichments can be carried out to yield more pure thrombocyte fractions.
Platelet isolation generally involves blood sample collection in Vacutainer tubes containing anticoagulant citrate dextrose (e.g. 36 ml citric acid, 5 mmol KCl, 90 mmol/l NaCl, 5 mmol/l glucose, 10 mmol/l EDTA pH 6.8). A suitable protocol for platelet isolation is described in Ferretti et al. (Ferretti et al., 2002. J Clin Endocrinol Metab 87: 2180-2184). This method involves a preliminary centrifugation step (1,300 rpm per 10 min) to obtain platelet-rich plasma (PRP). Platelets may then be washed three times in an anti-aggregation buffer (Tris-HCl 10 mmol/l; NaCl 150 mmol/l; EDTA 1 mmol/l; glucose 5 mmol/l; pH 7.4) and centrifuged as above, to avoid any contamination with plasma proteins and to remove any residual erythrocytes. A final centrifugation at 4,000 rpm for 20 min may then be performed to isolate platelets. For quantitative determination, the protein concentration of platelet membranes may be used as internal reference. Such protein concentrations may be determined by the method of Bradford (Bradford, 1976. Anal Biochem 72: 248-254), using serum albumin as a standard.
A sample comprising anucleated cells can be freshly prepared at the moment of harvesting, or can be prepared and stored at −70° C. until processed for sample preparation. Preferably, storage is performed under conditions that preserve the quality of the nucleic acid content of the anucleated cells. Examples of preservative conditions are fixation using e.g. formaline and paraffin embedding, the addition of RNase inhibitors such as RNAsin (Pharmingen) or RNasecure (Ambion), the addition of aqueous solutions such as RNAlater (Assuragen; U.S. Ser. No. 06/204,375), Hepes-Glutamic acid buffer mediated Organic solvent Protection Effect (HOPE; DE10021390), and RCL2 (Alphelys; WO04083369), and the addition of non-aquous solutions such as Universal Molecular Fixative (Sakura Finetek USA Inc.; U.S. Pat. No. 7,138,226).
Methods to determine gene expression levels are known to a skilled person and include, but are not limited to, Northern blotting, quantitative PCR, microarray analysis and RNA sequencing. It is preferred that said gene expression levels are determined simultaneously. Simultaneous analyses can be performed, for example, by multiplex qPCR, RNA sequencing procedures, and microarray analysis. Microarray analysis allow the simultaneous determination of gene expression levels of expression of a large number of genes, such as more than 50 genes, more than 100 genes, more than 1000 genes, more than 10,000 genes, or even whole-genome based, allowing the use of a large set of gene expression data for normalization of the determined gene expression levels in a method of the invention.
Microarray-based analysis involves the use of selected biomolecules that are immobilized on a solid surface, an array. A microarray usually comprises nucleic acid molecules, termed probes, which are able to hybridize to gene expression products. The probes are exposed to labeled sample nucleic acid, hybridized, and the abundance of gene expression products in the sample that are complementary to a probe is determined. The probes on a microarray may comprise DNA sequences. RNA sequences, or copolymer sequences of DNA and RNA. The probes may also comprise DNA and/or RNA analogues such as, for example, nucleotide analogues or peptide nucleic acid molecules (PNA), or combinations thereof. The sequences of the probes may be full or partial fragments of genomic DNA. The sequences may also be in vitro synthesized nucleotide sequences, such as synthetic oligonucleotide sequences.
A probe preferably is specific for a gene expression product of a gene as listed in Tables 1-3. A probe is specific when it comprises a continuous stretch of nucleotides that are completely complementary to a nucleotide sequence of a gene expression product, or a cDNA product thereof. A probe can also be specific when it comprises a continuous stretch of nucleotides that are partially complementary to a nucleotide sequence of a gene expression product of said gene, or a cDNA product thereof. Partially means that a maximum of 5% from the nucleotides in a continuous stretch of at least 20 nucleotides differs from the corresponding nucleotide sequence of a gene expression product of said gene. The term complementary is known in the art and refers to a sequence that is related by base-pairing rules to the sequence that is to be detected. It is preferred that the sequence of the probe is carefully designed to minimize nonspecific hybridization to said probe. It is preferred that the probe is, or mimics, a single stranded nucleic acid molecule. The length of said complementary continuous stretch of nucleotides can vary between 15 bases and several kilo bases, and is preferably between 20 bases and 1 kilobase, more preferred between 40 and 100 bases, and most preferred about 60 nucleotides. A most preferred probe comprises about 60 nucleotides that are identical to a nucleotide sequence of a gene expression product of a gene, or a cDNA product thereof. In a method of the invention, probes comprising probe sequences as indicated in Tables 1-3 and 5-7 can be employed.
To determine the gene expression level by micro-arraying, the gene expression products in the sample are preferably labeled, either directly or indirectly, and contacted with probes on the array under conditions that favor duplex formation between a probe and a complementary molecule in the labeled gene expression product sample. The amount of label that remains associated with a probe after washing of the microarray can be determined and is used as a measure for the gene expression level of a nucleic acid molecule that is complementary to said probe.
A preferred method for determining gene expression levels is by sequencing techniques, preferably next-generation sequencing (NGS) techniques of RNA samples. Sequencing techniques for sequencing RNA have been developed. Such sequencing techniques include, for example, sequencing-by-synthesis. Sequencing-by-synthesis or cycle sequencing can be accomplished by stepwise addition of nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in U.S. Pat. Nos. 7,427,673; 7,414,116; WO 04/018497 WO 91/06678; WO 07/123744; and U.S. Pat. No. 7,057,026. Alternatively, pyrosequencing techniques may be employed. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi et al., 1996, Analytical Biochemistry 242: 84-89; Ronaghi, 2001. Genome Res 11: 3-11; Ronaghi et al., 1998. Science 281: 363; U.S. Pat. Nos. 6,210,891; 6,258,568; and 6,274,320. In pyrosequencing, released PPi can be detected as it is immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons.
Sequencing techniques also include sequencing by ligation techniques. Such techniques use DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides and are inter alia described in U.S. Pat. Nos. 6,969,488; 6,172,218; and 6,306,597. Other sequencing techniques include, for example, fluorescent in situ sequencing (FISSEQ), and Massively Parallel Signature Sequencing (MPSS).
Sequencing techniques can be performed by directly sequencing RNA, or by sequencing a RNA-to-cDNA converted nucleic acid library. Most protocols for sequencing RNA samples employ a sample preparation method that converts the RNA in the sample into a double-stranded cDNA format prior to sequencing.
The determined gene expression levels are preferably normalized. Normalization refers to a method for adjusting or correcting a systematic error in the measurements for determining gene expression levels. Systemic bias may result from variation by differences in overall performance, differences in isolation efficiency of anucleated cells resulting in differences in purity of the isolated anucleated cells, and to variation between RNA samples, which can be due for example to variations in purity. Systemic bias can be introduced during the handling of the sample during the determination of gene expression levels.
The determined levels of expression of genes from Tables 1-3 in a sample are compared to the levels of expression of the same genes in a reference sample. Said comparison may result in an index score indicating a similarity of the determined expression levels in a sample of an individual, subject or patient, with the expression levels in the reference sample. For example, an index can be generated by determining a fold change/ratio between the median value of gene expression across samples that have been typed as being obtained from individuals suffering from cancer and the median value of gene expression across samples that are typed as being obtained from individuals not suffering from cancer. The relevance of this fold change/ratio as being significant between the two respective groups can be tested, for example, in an ANOVA (Analysis of variance) model. Univariate p-values can be calculated in the model and after multiple correction testing (Benjamini & Hochberg, 1995. JRSS, B, 57: 289-300) can be used as a threshold for determining significance that the gene expression shows a clear difference between the groups. Multivariate analysis may also be performed in adding covariates such as tumor stage/grade/size into the ANOVA model.
Similarly, an index can be determined by Pearson correlation between the expression levels of the genes in a sample of a patient and the average or mean of expression levels in one or more cancer samples that are known to respond to immunotherapy that modulates an interaction between PD-1 and its ligand, and the average or mean expression levels in one or more cancer samples that are known not to respond to immunotherapy that modulates an interaction between PD-1 and its ligand. The resultant Pearson scores can be used to provide an index score. Said score may vary between +1, indicating a prefect similarity, and −1, indicating a reverse similarity. Preferably, an arbitrary threshold is used to type samples as being responsive or as not being responsive. More preferably, samples are classified as responsive or as not responsive based on the respective highest similarity measurement. A similarity score is preferably displayed or outputted to a user interface device, a computer readable storage medium, or a local or remote computer system.
For predicting a response to immunotherapy that modulates an interaction between PD-1 and its ligand, said reference sample preferably comprises gene expression products that are obtained from anucleated cells of an individual known to respond positive to said immunotherapy, and/or of an individual known not to respond positive to said immunotherapy. Similarly, for typing of a sample of a subject for the presence or absence of a cancer, said reference sample preferably comprises gene expression products that are obtained from anucleated cells of an individual known to suffer from a cancer, and/or known not to suffer from a cancer.
Said reference sample preferably provides a measure of the average or mean level of expression of genes in anucleated cells of at least 2 independent individuals, more preferred at least 5 independent individuals, more preferred at least 10 independent individuals, such as between 10 and 100 individuals.
Said average or mean level of expression of genes in anucleated cells of the reference sample is preferably presented on a user interface device, a computer readable storage medium, or a local or remote computer system. The storage medium may include, but is not limited to, a floppy disk, an optical disk, a compact disk read-only memory (CD-ROM), a compact disk rewritable (CD-RW), a memory stick, and a magneto-optical disk.
The gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1 can be used to predict a response to immunotherapy that modulates an interaction between PD-1 and its ligand, to a cancer patient, prior to administering said therapy.
For this, anucleated cells, preferably thrombocytes, are isolated from a patient known to suffer from a cancer, such as a lung cancer. A sample comprising ribonucleic acid (RNA), preferably messenger RNA (mRNA), is isolated from the isolated anucleated cells. Following reverse transcription of the RNA into copy desoxyribonucleic acid (cDNA) using any method known to the skilled person, the resulting cDNA is labelled and gene expression levels are quantified, for example by next generation sequencing, for example on an Illumina sequencing platform.
Based on the sequencing results, the gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1 is determined in the sample comprising ribonucleic acid (RNA), from said cancer patient and preferably normalized. The normalized expression levels are compared to the level of expression of the same at least four genes listed in Table 1, more preferred at least five genes in a reference sample. Said reference sample is obtained from one or more cancer patients with a known, positive response to immunotherapy that modulates an interaction between PD-1 and its ligand, and/or obtained from one or more cancer patients with a known, negative response to immunotherapy that modulates an interaction between PD-1 and its ligand. From said comparison, a predicted response efficacy to administration of immunotherapy that modulates an interaction between PD-1 and its ligand such as, for example, administration of nivolumab, is obtained.
Contemplated herein is a method of typing a sample of a subject known to suffer from a cancer, especially a lung cancer, comprising the steps of providing a sample from the subject, whereby the sample comprises mRNA products that are obtained from anucleated cells of said subject; determining a gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1; comparing said determined gene expression level to a reference expression level of said genes in a reference sample; and typing said sample for a likelihood of responding to immunotherapy that modulates an interaction between PD-1 and its ligand such as, for example, administration of nivolumab, on the basis of the comparison between the determined gene expression level and the reference gene expression level.
In a preferred method according to the invention, a level of expression of at least four genes listed in Table 1, more preferred at least five genes from Table 1 is determined, more preferred a level of expression of at least ten genes from Table 1, more preferred a level of expression of at least twenty genes from Table 1, more preferred a level of expression of at least thirty genes from Table 1, more preferred a level of expression of at least forty genes from Table 1, more preferred a level of expression of at least fifty genes from Table 1, more preferred a level of RNA expression of all five hundred thirty two genes from Table 1.
It is further preferred that said at least five genes from Table 1 comprise the first four genes listed in Table 1, more preferred the first five genes with the lowest P-value, as is indicated in Table 1, more preferred the first ten genes with the lowest P-value, as is indicated in Table 1, more preferred the first twenty genes with the lowest P-value, as is indicated in Table 1, more preferred the first thirty genes with the lowest P-value, as is indicated in Table 1, more preferred the first forty genes with the lowest P-value, as is indicated in Table 1, more preferred the first fifty genes with the lowest P-value, as is indicated in Table 1.
In a further preferred embodiment, said at least four genes listed in Table 1, more preferred at least five genes from Table 1 comprise ENSG00000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3) and ENSG00000126698 (DNAJC8); more preferably ENSG000000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8) and ENSG00000121879 (PIK3CA); more preferably ENSG0000000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA) and ENSG00000174238 (PITPNA); more preferably ENSG0000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA) and ENSG0000084754 (HADHA); more preferably ENSG0000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG00000084754 (HADHA) and ENSG00000272369); more preferably ENSG0000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG0000019314 (PTBP3), ENSG0000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG00000084754 (HADHA), ENSG00000272369) and ENSG00000073111 (MCM2); more preferably ENSG00000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG00000084754 (HADHA), ENSG00000272369), ENSG00000073111 (MCM2), ENSG00000137073 (UBAP2), ENSG00000115866 (DARS), ENSG00000229474 (PATL2), ENSG00000086589 (RBM22), ENSG00000145675 (PIK3R1), ENSG00000088833 (NSFL1C), ENSG00000267243, ENSG00000260661, ENSG00000144747 (TMF1) and ENSG00000158578 (ALAS2), more preferably ENSG00000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG00000084754 (HADHA), ENSG00000272369), ENSG00000073111 (MCM2), ENSG00000137073 (UBAP2), ENSG00000115866 (DARS), ENSG0000229474 (PATL2), ENSG00000086589 (RBM22), ENSG00000145675 (PIK3R1), ENSG00000088833 (NSFL1C), ENSG000000267243, ENSG00000260661, ENSG0000144747 (TMF1), ENSG00000158578 (ALAS2), ENSG00000083642 (PDS5B), ENSG00000142089 (IFITM3), ENSG00000107175 (CREB3), ENSG00000162585 (C1orf86), ENSG00000142687 (KIAA0319L), ENSG00000100796 (SMEK1), ENSG00000142856 (ITGB3BP), ENSG00000103479 (RBL2), ENSG00000048471 (SNX29), ENSG00000196233 (LCOR) and ENSG00000068120 (COASY); more preferably ENSG00000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG00000084754 (HADHA), ENSG00000272369), ENSG00000073111 (MCM2), ENSG00000137073 (UBAP2), ENSG00000115866 (DARS), ENSG00000229474 (PATL2), ENSG00000086589 (RBM22), ENSG00000145675 (PIK3R1), ENSG00000088833 (NSFL1C), ENSG00000267243, ENSG00000260661, ENSG00000144747 (TMF1), ENSG00000158578 (ALAS2), ENSG000000083642 (PDS5B), ENSG00000142089 (IFITM3), ENSG00000107175 (CREB3), ENSG00000162585 (C1orf86), ENSG000000142687 (KIAA0319L), ENSG00000100796 (SMEK1), ENSG00000142856 (ITGB3BP), ENSG00000103479 (RBL2), ENSG00000048471 (SNX29), ENSG00000196233 (LCOR), ENSG00000068120 (COASY), ENSG00000120868 (APAF1), ENSG00000198265 (HELZ), ENSG00000162688 (AGL), ENSG00000228215, ENSG00000147457 (CHMP7), ENSG00000129187 (DCTD), ENSG00000141644 (MBD1), ENSG00000172172 (MRPL13), ENSG0000011097 (PITPNM1) and ENSG00000102054 (RBBP7); more preferably ENSG00000084234 (APLP2), ENSG00000165071 (TMEM71), ENSG00000143515 (ATP8B2), ENSG00000119314 (PTBP3), ENSG00000126698 (DNAJC8), ENSG00000121879 (PIK3CA), ENSG00000174238 (PITPNA), ENSG0000084754 (HADHA), ENSG00000272369), ENSG00000073111 (MCM2), ENSG00000137073 (UBAP2), ENSG00000115866 (DARS), ENSG00000229474 (PATL2), ENSG00000086589 (RBM22), ENSG00000145675 (PIK3R1), ENSG0000088833 (NSFL1C), ENSG00000267243, ENSG00000260661, ENSG000000144747 (TMF1), ENSG00000158578 (ALAS2), ENSG00000083642 (PDS5B), ENSG00000142089 (IFITM3), ENSG00000107175 (CREB3), ENSG00000162585 (C1orf86), ENSG00000142687 (KIAA0319L), ENSG00000100796 (SMEK1), ENSG00000142856 (ITGB3BP), ENSG00000103479 (RBL2), ENSG00000048471 (SNX29), ENSG00000196233 (LCOR), ENSG00000068120 (COASY), ENSG00000120868 (APAF1), ENSG00000198265 (HELZ), ENSG00000162688 (AGL), ENSG00000228215, ENSG00000147457 (CHMP7), ENSG00000129187 (DCTD), ENSG00000141644 (MBD1), ENSG00000172172 (MRPL13), ENSG00000110697 (PITPNM1), ENSG00000102054 (RBBP7), ENSG000153214 (TMEM87B), ENSG0000150054 (MPP7), ENSG00000122008 (POLK), ENSG00000151150 (ANK3), ENSG00000165970 (SLC6A5), ENSG00000100811 (YY1), ENSG00000152127 (MGAT5), ENSG00000172493 (AFF1), ENSG00000213722 (DDAH2), ENSG00000177425 (PAWR), ENSG00000260017, ENSG0000141429 (GALNT1), ENSG00000119979 (FAM45A), ENSG00000136167 (LCP1), ENSG00000244734 (HBB), ENSG00000143569 (UBAP2L), ENSG00000079459 (FDFT1), ENSG00000197459 (HIST1H2BH) and ENSG00000080371 (RAB21).
In a most preferred embodiment, a set of at least four genes from Table 1 comprises ENSG00000164985 (PSIP1), ENSG00000114316 (USP4), ENSG00000103091 (WDR59) and ENSG00000140564 (FURIN), which resulted in an AUC-value of 0.70 (95%-CI: 0.47-0.94) and an classification accuracy of 73%.
The gene expression level for at least five genes listed in Table 2 can be used to type a sample from a subject for the presence or absence of a cancer in said subject.
For this, anucleated cells, preferably thrombocytes, are isolated from a subject not known to suffer from a cancer, such as a lung cancer. A sample comprising ribonucleic acid (RNA), preferably messenger RNA (mRNA), is isolated from said isolated anucleated cells. Following reverse transcription of the RNA into copy desoxyribonucleic acid (cDNA) using any method known to the skilled person, the resulting cDNA is labelled and gene expression levels are quantified, for example by next generation sequencing, for example on an Illumina sequencing platform.
Based on the sequencing results, the gene expression level for at least five genes listed in Table 2 is determined in the sample comprising ribonucleic acid (RNA), from said subject and preferably normalized. The normalized expression levels are compared to the level of expression of the same at least five genes in a reference sample. Said reference sample is obtained from one or more cancer patients, and/or obtained from one or more subjects that are known not to suffer from a cancer. From said comparison, said subject can be types for a likelihood of having a cancer such as a lung cancer, or not having a cancer.
In a preferred method according to the invention, a level of expression of at least five genes from Table 2 is determined, more preferred a level of expression of at least ten genes from Table 2, more preferred a level of expression of at least twenty genes from Table 2, more preferred a level of expression of at least thirty genes from Table 2, more preferred a level of expression of at least forty genes from Table 2, more preferred a level of expression of at least fifty genes from Table 2, more preferred a level of RNA expression of all thousand genes from Table 2.
It is further preferred that said at least five genes from Table 2 comprise the first five genes with the lowest P-value, as is indicated in Table 2, more preferred the first ten genes with the lowest P-value, as is indicated in Table 2, more preferred the first twenty genes with the lowest P-value, as is indicated in Table 2, more preferred the first thirty genes with the lowest P-value, as is indicated in Table 2, more preferred the first forty genes with the lowest P-value, as is indicated in Table 2, more preferred the first fifty genes with the lowest P-value, as is indicated in Table 2.
In a further preferred embodiment, said at least five genes from Table 2 comprise HBB, EIF1, CAPNS1, NDUFAF3 and OTUD5, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1 and BCAP31, more preferred HBB, EF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM and DSTN, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM, DSTN, ALDH9A1, ZNF346, LMAN1, EEF1B2, AP2S1, HSPB1, HBQ1, HTATIP2, PTMS and TPM2, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM, DSTN, ALDH9A1, ZNF346, LMAN1, EEF1B2, AP2S1, HSPB1, HBQ1, HTATIP2, PTMS, TPM2, DESI1, RHOC, YWHAH, CPQ, FMTPN, ISCU, MRPL37, MGST3, CMTM5 and ACTG1, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM, DSTN, ALDH9A1, ZNF346, LMAN1, EEF1B2, AP2S1, HSPB1, HBQ1, HTATIP2, PTMS, TPM2, DESI1, RHOC, YWHAH, CPQ, MTPN, ISCU, MRPL37, MGST3, CMTM5, ACTG1, ITGA2B, HPSE, KLHDC8B, CDC37, HLA-DRA, KSR1, ACOT7, PRKAR1B, MAOB and ZDHHC12, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM, DSTN, ALDH9A1, ZNF346, LMAN1, EEF1B2, AP2S1, HSPB1, HBQ1, HTATIP2, PTMS, TPM2, DESI1, RHOC, YWHAH, CPQ, MTPN, ISCU, MRPL37, MGST3, CMTM5, ACTG1, ITGA2B, HPSE, KLHDC8B, CDC37, HLA-DRA, KSR1, ACOT7, PRKAR1B, MAOB, ZDHHC12, SNX3, YIF1B, PRDX5, HDAC8, DDX5, TPM1, SVIP, PDAP1, CD79B and PRSS50, more preferred HBB, EIF1, CAPNS1, NDUFAF3, OTUD5, SRSF2, ANP32B, KIFAP3, ATOX1, BCAP31, NAP1L1, TIMP1, POLR2E, CD74, POLR2G, RPS5, GPI, GSTM4, IGHM, DSTN, ALDH9A1, ZNF346, LMAN1, EEF1B2, AP2S1, HSPB1, HBQ1, HTATIP2, PTMS, TPM2, DESI1, RHOC, YWLAH, CPQ, MTPN, ISCU, MRPL37, MGST3, CMTM5, ACTG1, ITGA2B, HPSE, KLHDC8B, CDC37, HLA-DRA, KSR1, ACOT7, PRKAR1B, MAOB, ZDHHC12, SNX3, YIF1B, PRDX5, HDAC8, DDX5, TPM1, SVIP, PDAP1, CD79B, PRSS50, GPX1, IFITM3, SAIMD14, FUNDC2, BRIX1, CFL1, AKIRIN2, NAPSB, GPAA1, TRIM28, CMTM3 and MMP1.
In a most preferred embodiment, said at least 10 genes from Table 2 comprise ENSG00000168765 (GSTM4), ENSG00000206549 (PRSS50), ENSG00000106211 (HSPB1), ENSG00000185909 (IKLHDC8B), ENSG00000097021 (ACOT7), ENSG00000105401 (CDC37), ENSG00000099817 (POLR2E), ENSG00000105220 (GPI), ENSG00000075945 (KIFAP3), ENSG00000100418 (DESI1). The 10 genes resulted in an AUC-value of 0.74 (95%-CI: 0.70-0.77) and a classification accuracy of 68%) in an independent, late stage validation set (n=518 samples). The AUC-value was 0.69 (95%-CI: 0.59-0.79), with a classification accuracy of 65% in an early stage validation set (n=106 samples).
In a most preferred embodiment, a set of at least 45 genes from Table 2 is used to type a sample from a subject for the presence or absence of a cancer, especially a lung cancer, in said subject. Said at least 45 genes comprise ENSG00000023191 (RNH1), ENSG00000142089 (IFITM3), ENSG00000097021 (ACOT7), ENSG00000172757 (CFL1), ENSG00000213465 (ARL2), ENSG00000136938 (ANP32B), ENSG00000067365 (METTL22), ENSG00000130429 (ARPC1B), ENSG00000116221 (MRPL37), ENSG00000177556 (ATOX1), ENSG00000074695 (LMAN1), ENSG00000198467 (TPM2), ENSG00000188191 (PRKAR1B), ENSG00000126247 (CAPNS1), ENSG00000159335 (PTMS), ENSG00000113761 (ZNF346), ENSG00000102265 (TIMP1), ENSG00000168002 (POLR2G), ENSG00000185825 (BCAP31), ENSG00000155366 (RHOC), ENSG00000099817 (POLR2E), ENSG00000125868 (DSTN), ENSG00000160446 (ZDHHC12), ENSG00000100418 (DESI1), ENSG00000109854 (HTATIP2), ENSG00000161547 (SRSF2), ENSG000068308 (OTUD5), ENSG00000206549 (PRSS50), ENSG00000178057 (NDUFAF3), ENSG00000042753 (AP2S1), ENSG00000168765 (GSTM4), ENSG00000075945 (KIFAP3), ENSG00000173812 (EIF1), ENSG00000086506 (HBQ1), ENSG00000106244 (PDAP1), ENSG00000187109 (NAP1L1), ENSG00000106211 (HSPB1), ENSG00000105220 (GPI), ENSG00000105401 (CDC37), ENSG00000128245 (YWHAH), ENSG00000173083 (HPSE), ENSG00000185909 (KLHDC8B), ENSG00000126432 (PRDX5), ENSG00000166091 (CMTM5) and ENSG00000069535 (MAOB). The 45 genes resulted in an AUC-value of 0.77 (95%-CI: 0.73-0.81) and a classification accuracy of 77%) in an independent, late stage validation set (n=518 samples). The AUC-value was 0.74 (95%-CI: 0.65-0.83), with a classification accuracy of 70% in an early stage validation set (n=106 samples)
P selectin protein (SELP, CD62) is stored in platelet alpha-granules and released upon platelet activation. P-selectin levels are enriched in younger, reticulated platelets. The platelet RNA gene panel selected for NSCLC diagnostics depicted in Table 2 contains genes that are co-regulated with p-selectin RNA expression in platelets. Hence, the NSCLC diagnostic signature may be enriched for reticulated platelets, expressing high levels of p-selectin RNA. Said P-selectin signature may have help in predicting therapy response, in case the platelet population of responding patients shifts during treatment from reticulated platelets to mature platelets. This shift might also be observed for other treatment modules including chemotherapy, targeted therapies, radiotherapy, surgery or immunotherapy.
Therefore, the gene expression level for at least five genes listed in Table 3 can be used to assist in predicting a response to immunotherapy that modulates an interaction between PD-1 and its ligand, to a cancer patient, prior to administering said therapy.
Hence, the invention provides a method of administering immunotherapy that modulates an interaction between PD-1 and its ligand, to a cancer patient, comprising the steps of providing a sample from the patient, the sample comprising mRNA products that are obtained from anucleated cells of said patient; determining a gene expression level for at least four genes listed in Table 1, more preferred at least five genes listed in Table 1, and at least five genes listed in Table 3; comparing said determined gene expression levels to reference expression levels of said genes in a reference sample; typing the patient as a positive responder to said immunotherapy, or as a not-positive responder, based on the comparison with the reference; and administering immunotherapy to a cancer patient that is typed as a positive responder.
For this, anucleated cells, preferably thrombocytes, are isolated from a patient known to suffer from a cancer, such as a lung cancer. A sample comprising ribonucleic acid (RNA), preferably messenger RNA (mRNA), is isolated from the isolated anucleated cells. Following reverse transcription of the RNA into copy desoxyribonucleic acid (cDNA) using any method known to the skilled person, the resulting cDNA is labelled and gene expression levels are quantified, for example by next generation sequencing, for example on an Illumina sequencing platform.
Based on the sequencing results, the gene expression level for at least five genes listed in Table 3 is determined in the sample comprising ribonucleic acid (RNA), from said cancer patient and preferably normalized. The normalized expression levels are compared to the level of expression of the same at least five genes in a reference sample. Said reference sample is obtained from one or more cancer patients with a known, positive response to immunotherapy that modulates an interaction between PD-1 and its ligand, and/or obtained from one or more cancer patients with a known, negative response to immunotherapy that modulates an interaction between PD-1 and its ligand. From said comparison, a predicted response efficacy to administration of immunotherapy that modulates an interaction between PD-1 and its ligand such as, for example, administration of nivolumab, is obtained.
In a preferred method according to the invention, a level of expression of at least five genes from Table 3 is determined, more preferred a level of expression of at least ten genes from Table 3, more preferred a level of expression of at least twenty genes from Table 3, more preferred a level of expression of at least thirty genes from Table 3, more preferred a level of expression of at least forty genes from Table 3, more preferred a level of expression of at least fifty genes from Table 3, more preferred a level of RNA expression of all thousand eight hundred twenty genes from Table 3.
It is further preferred that said at least five genes from Table 3 comprise the first five genes with the lowest P-value, as is indicated in Table 3, more preferred the first ten genes with the lowest P-value, as is indicated in Table 3, more preferred the first twenty genes with the lowest P-value, as is indicated in Table 3, more preferred the first thirty genes with the lowest P-value, as is indicated in Table 3, more preferred the first forty genes with the lowest P-value, as is indicated in Table 3, more preferred the first fifty genes with the lowest P-value, as is indicated in Table 3.
In a further preferred embodiment, said at least five genes from Table 3 comprise SELP, ITGA2B, AP2S1, OTUD5 and MAOB from Table 3, more preferred SELP, ITGA2B, AP2S1, OTUD5, MAOB, KIFAP3, HBQ1, ACOT7, POLR2E and DESI1, more preferred SELP, ITGA2B, AP2S, OTUD5, MAOB, KIFAP3, HBQ1, ACOT7, POLR2E, DESI1, TIMP1, CPQ, GPI, CDC37, MTPN, HSPB1, PDAP1, HTATIP2, SNX3 and ZNF346, more preferred SELP, ITGA2B, AP2S1, OTUD5, MAOB, KIFAP3, HBQ1, ACOT7, POLR2E, DESI1, TIMP1, CPQ, GPI, CDC37, MTPN, HSPB1, PDAP1, HTATIP2, SNX3, ZNF346, DSTN, CAPNS1, PRDX5, YWHAH, AKIRIN2, ISCU, TPM1, CMTM3, ALDH9A1 and RHOC, more preferred SELP, ITGA2B, AP2S1, OTUD5, MAOB, KIFAP3, HBQ1, ACOT7, POLR2E, DESI1, TIMP1, CPQ, GPI, CDC37, MTPN, HSPB1, PDAP1, HTATIP2, SNX3, ZNF346, DSTN, CAPNS1, PRDX5, YWHAH, AKIRIN2, ISCU, TPM1, CMTM3, ALDH9A1, RHOC, PTMS, ZDHHC12, SRSF2, FUNDC2, CMTM5, SAMD14, YIF1B, POLR2G, GSTM4 and CFL1, more preferred SELP, ITGA2B, AP2S1, OTUD5, MAOB, KIFAP3, HBQ1, ACOT7, POLR2E, DESI1, TIMP1, CPQ, GPI, CDC37, MTPN, HSPB1, PDAP1, HTATIP2, SNX3, ZNF346, DSTN, CAPNS1, PRDX5, YWHAH, AKIRIN2, ISCU, TPM1, CMTM3, ALDH9A1, RHOC, PTMS, ZDHHC12, SRSF2, FUNDC2, CMTM5, SAMD14, YIF1B, POLR2G, GSTM4, CFL1, HPSE, EIF1, NDUFAF3, ACTG1, BCAP31, KLHDC8B, NAP1L1, PRKAR1B, MMP1, GPAA1, SVIP, TPM2, PRSS50 and GPX1.
A most preferred set of at least five genes from Table 3 comprises ENSG00000161203 (AP2M1), ENSG00000204420 (C6orf25), ENSG00000204592 (HLA-E), ENSG00000064601 (CTSA), and ENSG00000005961 (ITGA2B). Use of this additional set of genes, besides the most preferred set of at least ten genes, resulted in classification of early-stage NSCLC with an AUC-value of 0.66 (95%-CI: 0.55-0.76) and an accuracy of 65% (n=106 samples).
Several bioinformatic optimization algorithms can be exploited for solving mathematical problems regarding parameter selection. These optimization processes iteratively seek most optimal parameter settings of parameters that determine the mathematical problem. This iterative process is guided by the optimization algorithm, effectively and efficiently. We claim the use of particle swarm intelligence optimization (PSO), the mathematical approach for parameter selection, including its subvariants and hybridization/combination with other mathematical optimization algorithms for gene panel selection in liquid biopsies. We define PSO as a meta-algorithm exploiting particle position and particle velocity using iterative repositioning in a high-dimensional space for efficient and optimized parameter selection, i.e. gene panel selection. PSO also includes other optimization meta-algorithms that can be applied for automated and enhanced gene panel selection. We tested the particle swarm optimization algorithm, and demonstrate that PSO-enhanced algorithms enable efficient selection of spliced RNA biomarker panels from platelet RNA-seq libraries (n=728). This resulted in accurate TEP-based detection of stage IV non-small cell lung cancer (NSCLC) (n=520 independent validation cohort, accuracy: 89%, AUC: 0.94, 95%-CI: 0.93-0.96, p<0.001), independent of age of the individuals, whole blood storage time, and various inflammatory conditions. In addition, we employed swarm intelligence to explore spliced RNA biomarker profiles for the blood-based therapeutic response prediction of stage IV NSCLC patients at moment of baseline for anti-PD-1 nivolumab immunotherapy (n=64). The nivolumab response prediction algorithm resulted in an accuracy of 88% (AUC 0.89, 95%-CI: 0.8-1.0, p<0.01). To our knowledge this is the first demonstration of PSO for selection of biomarker gene panels to diagnose cancer and predict therapy response from TEPs. The PSO-algorithm was exploited for optimization of four parameters that determined the gene panel used for support vector machine training. As aside analyzing RNA molecules from TEPs. PSO can also be applied for analysis of small RNAs, RNA rearrangements. DNA single nucleotide alterations, protein levels, metabolomic levels, which constituents are isolated from TEPs, plasma, serum, circulating tumor cells, or extracellular vesicles, by subjecting similar or combined data input to the PSO-algorithm.
For the purpose of clarity and a concise description, features are described herein as part of the same or separate embodiments, however, it will be appreciated that the scope of the invention may include embodiments having combinations of all or some of the features described.
Peripheral whole blood was drawn by venipuncture from cancer patients, patients with inflammatory and other non-cancerous conditions, and asymptomatic individuals at the VU University Medical Center, Amsterdam. The Netherlands, the Dutch Cancer Institute (NKI/AvL), Amsterdam, The Netherlands, the Academical Medical Center, Amsterdam, The Netherlands, the Utrecht Medical Center, Utrecht, The Netherlands, the University Hospital of Umeå, Umeå, Sweden, the Hospital Germans Trias i Pujol, Barcelona, Spain, The University Hospital of Pisa, Pisa, Italy, and Massachusetts General Hospital, Boston, USA. Whole blood was collected in 4-, 6-, or 10-mL EDTA-coated purple-capped BD Vacutainers containing the anti-coagulant EDTA. Cancer patients were diagnosed by clinical, radiological and pathological examination, and were confirmed to have at moment of blood collection detectable tumor load. 106 NSCLC samples included were follow-up samples of the same patient, collected weeks to months after the first blood collection. Age-matching was performed retrospectively using a custom script in MATLAB, iteratively matching samples by excluding and including Non-cancer and NSCLC samples aiming at a similar median age and age-range between both groups. Samples for both training, evaluation, and validation cohorts were collected and processed similarly and simultaneously. A detailed overview of the included samples, demographic characteristics, the hospital of origin, time between blood collection and platelet isolation (blood storage time), and for which analyses and classifiers were used is provided in Table 4. Asymptomatic individuals were at the moment of blood collection, or previously, not diagnosed with cancer, but were not subjected to additional tests confirming the absence of cancer. The patients with inflammatory or other non-cancerous conditions did not have a diagnosed malignant tumor at the moment of blood collection. This study was conducted in accordance with the principles of the Declaration of Helsinki. Approval for this study was obtained from the institutional review board and the ethics committee at each participating hospital. Clinical follow-up of asymptomatic individuals is not available due to anonymization of these samples according to the ethical rules of the hospitals.
For collection and annotation of clinical data, patient records were manually queried for demographic variables. i.e. age, gender, smoking, type of tumor, metastases, details of current and prior treatments, and co-morbidities. In case of transgender individuals, the new gender was stated (n=1). Platelet samples were collected before start of (a new) treatment or during treatment, respectively baseline and follow-up samples. Response assessment of patients treated with nivolumab (
To estimate the contribution of the variables 1) patient age (in years) at moment of blood collection, 2) whole blood storage time, 3) gender, and 4) smoking (current, former, never), we summarized the available data from Supplemental Table S1A-C and Supplemental FIG. S2C of our previous study (Best et al., 2015, Cancer Cell, 28: 666-676), and performed logistic regression analyses in the statistical software module SAS (v.13.0.0; SAS Institute Inc., 100 SAS Campus Drive, Cary, N.C. 27513-2414, USA). Blood storage time was defined as the time between blood collection and the start of platelet isolation by differential centrifugation, stratified into a <12 hours group and a >12 hours group. For variables of samples of which data was missing, that particular value of the particular samples was excluded from the calculation. The joint predictive power of patient age, blood storage time, and the predictive strength of the diagnostics classifier for NSCLC, was assessed using a measure of logistic regression with nominal response, by selecting disease state as the Role Variable Y. and adding patient age, blood storage time, gender, smoking, and predictive strength for NSCLC as the model effects. All additional settings were set at default.
Whole blood samples in 4-, 6-, or 10-mL EDTA-coated Vacutainer tubes were processed using standardized protocols within 48 hours as described previously (Best et al., 2015. Cancer Cell 28: 666-676; Nilsson et al., 2011. Blood 118: 3680-3683). Whole blood collected in the VU University Medical Center, the Dutch Cancer Institute, the Utrecht Medical Center, the University Hospital of Umeå, the Hospital Germans Trias I Pujol, and the University Hospital of Pisa was subjected to platelet isolation within 12 hours after blood collection. Whole blood samples collected at Massachusetts General Hospital Boston and the Academical Medical Center Amsterdam were stored overnight and processed after 24 hours. To isolate platelets, platelet rich plasma (PRP) was separated from nucleated blood cells by a 20-minute 120×g centrifugation step, after which the platelets were pelleted by a 20-minute 360×g centrifugation step. Removal of 9/10th of the PRP has to be performed carefully to reduce the risk of contamination of the platelet preparation with nucleated cells, pelleted in the buffy coat. Centrifugations were performed at room temperature. Platelet pellets were carefully resuspended in RNAlater (Life Technologies) and after overnight incubation at 4° C. frozen at −80° C.
To assess the relative platelet activation during our platelet isolations, we measured the surface protein expression levels of the constitutively expressed platelet marker CD41 (APC anti-human, clone: HIP8) and platelet activation-dependent markers P-selectin (CD62P, PE anti-human, clone: AK4, Biolegend) and CD63 (FITC anti-human, clone: H5C6, Biolegend), using a BD FACSCalibur flow cytometer. We collected five 6-mL EDTA-coated Vacutainers tubes from each of six healthy donors, and determined the platelet activation state at baseline (0 hours), 8 hours, 24 hours, 48 hours, and 72 hours. As a negative control, we isolated at time point zero platelets from whole blood using a standardized platelet isolation protocol from citrate-anticoagulated whole blood that has been validated for inducing minimal platelet activation. This protocol consisted of a step of OptiPrep (Sigma) density gradient centrifugation (350×g, 15 minutes) after collection of platelet rich plasma. This was followed by two washing steps first with Hepes, followed by a washing step in SSP+ (Macopharma) buffer. We used 400 nM prostaglandin I2 (Sigma-Aldrich) before every centrifugation step to prevent platelet activation during the isolation procedure. As a positive control, we included platelets activated by 20 μM TRAP (TRAPtest, Roche). Platelet pellets were after isolation prefixed in 0.5% formaldehyde (Roth), stained, and stored in 1% formaldehyde for flow cytometric analysis. Relative activation and mean fluorescent intensity (MFI) values were analyzed with FlowJo. Hence, absence of platelet activation during blood collection and storage was confirmed by stable levels of P-selectin and CD63 platelet surface markers (
Preparation of samples for sequencing was performed in batches, and included per batch a mixture of clinical conditions. For platelet RNA isolation, frozen platelets were thawed on ice and total RNA was isolated using the mirVana miRNA isolation kit (Ambion. Thermo Scientific, AM1560). Platelet RNA was eluated in 30 μL elution buffer. We evaluated the platelet RNA quality using the RNA 6000 Picochip (Bioanalyzer 2100, Agilent), and included as a quality standard for subsequent experiments only platelet RNA samples with a RIN-value >7 and/or distinctive rRNA curves. All Bioanalyzer 2100 quality and quantity measures were collected from the automatically generated Bioanalyzer result reports using default settings, and after critical assessment of the reference ladder (quantity, appearance, and slope). The Truseq cDNA labelling protocol for Illumina sequencing (see below) requires ˜1 μg of input cDNA. Since a single platelet contains an estimated ˜2 femtogram of RNA (Teruel-Montoya et al., 2014. PLuS ONE 9(7): e102259), assuming an average platelet count of 300×106 per mL of whole blood and highly efficient platelet isolation and RNA extraction, the estimated optimal yield of platelets from clinically relevant blood volumes (6 mL) is about 3.6 microg. The average total RNA obtained from our blood samples is 146 ng (standard deviation: 130 ng, n=237 samples, see
Raw RNA sequence data of platelets encoded in FASTQ-files were subjected to a standardized RNA-seq alignment pipeline, as described previously (Best et al., 2015. Cancer Cell 28: 666-676). In summary, RNA-sequence reads were subjected to trimming and clipping of sequence adapters by Trimmomatic (v. 0.22) (Bolger et al., 2014. Bioinformatics 30: 2114-2120), mapped to the human reference genome (hg19) using STAR (v. 2.3.0) (Dobin et al., 2013. Bioinformatics 29: 15-21), and summarized using HTSeq (v. 0.6.1), which was guided by the Ensembl gene annotation version 75 (Anders et al., 2014. Bioinformatics 31: 166-169). All subsequent statistical and analytical analyses were performed in R (version 3.3.0) and R-studio (version 0.99.902). Of samples that yielded less than 0.2×10E6 intron-spanning reads in total after sequencing, we again sequenced the original Truseq preparation of the sample and merged the read counts generated from the two individual FASTQ-files after HTSeq count summarization (performed for n=52 samples). Genes encoded on the mitochondrial DNA and the Y-chromosome were excluded from downstream analyses, except for the analyses in
Assessment of the Technical Performance of thromboSeq
We observed in the platelet RNA a rich repertoire of spliced RNAs (
Prior to differential splicing analyses the data was subjected to the iterative correction-module as described in the section ‘Data normalisation and RUV-mediated factor correction’ in Example 1 (age correlation threshold 0.2, library size correlation threshold 0.8 (Non-cancer/NSCLC,
Distribution of mapped RNA-seq reads of platelet cDNA, and thus the origin of the RNA fragments, was investigated in samples assigned to the patient age- and blood storage time-matched NSCLC/Non-cancer cohort (training, evaluation, and validation, totaling 263 samples). The mitochondrial genome and human genome, of which the latter includes exonic, intronic, and intergenic regions were quantified separately (
To determine the correlation between p-selection levels and exonic read counts, we compared the P-selectin (SELP, ENSG00000174175) counts-per-million values of 263 patient age- and blood storage time-matched individuals to the number of reads mapping to exons (
We employed the MISO algorithm (Katz et al., 2010. Nature methods 7: 1009-15) for alternative splicing analysis in our 100 bp single read RNA-seq data. Briefly, the MISO algorithm quantifies the number of reads favouring inclusion or exclusion of a particular annotated event, such as exon skipping, or RNA isoforms. By scoring reads supporting either one variant or the other (on/off) and scoring reads supporting both isoforms, the algorithm infers the ratio of inclusion, and thereby the percent spliced in (PSI). A detailed description of the alternative splicing analysis in TEPs and interpretation of the results is provided in Example 3.
Processing of Raw mRNA Sequencing Data for MISO Splicing Analysis
For the MISO RNA splicing analyses (
For alternative isoform analysis, we narrowed the analysis to the 4722 genes identified with confident intron-spanning expression levels in platelets (see also section ‘Processing of raw RNA-sequencing data’ in Example 1). For each annotated Ensemble transcript ID, available in the MISO summary output files, the assigned read counts (reads assigned to the particular RNA isoform) were summarized in a count matrix. A schematic overview of the procedure is presented in
For analysis of exon skipping events, we developed a custom analysis pipeline summarizing reads supporting inclusion or exclusion of annotated exons and scoring the relative contribution in groups of interest, i.e. Non-cancer versus NSCLC. The input for the algorithm is a PSI-values count matrix and an ‘assigned counts’ count matrix, as generated from summary output files generated by MISO. The former count matrix is required to calculate the relative PSI-values and distribution per group, the latter count matrix is required to only include exons with sufficient coverage in the RNA-seq data (i.e. >10 reads in >60% of the samples, which support both inclusion (1.0) and exclusion (0,1) of the variant, see also Katz et al.,). The coverage selector dowvnscaled the available exons for analysis to 230 exons (
To identify RNA-binding protein (RBP) profiles associated with the TEP signatures in NSCLC patients (
(i) The algorithm selects of all inputted genes the annotated RNA isoforms and identifies genomic regions of the annotated RNA isoforms that are associated with either the 5′-UTR or 3′-UTR. The genomic coding sequence is extracted from the human hg19 reference genome using the getfasta function in Bedtools (v. 2.17.0). For this study, we used the Ensembl annotation version 75.
(ii) All characterized motif sequences extracted from literature (102 in total, Supplementary Table 3 of Ray et al., (Ray et al., 2013. Nature 499: 172-177), filtered for Homo Sapiens) are reduced to 547 non-redundant (‘A’, ‘G’, ‘C’, and ‘T’-sequence) annotations according to the IUPAC motif annotation. These non-redundant motif sequences serve as the representative motif sequences for the initial search.
(iii) In an iterative manner, per RBP the associated non-redundant RBP motif sequences are matched with all identified and included UTR sequences (using the str_count function of the seqinr package in R).
(iv) The algorithm identifies the number of reads mapping to each UTR region per sample using Samtools View (q 30, -c enabled.
(v) For all 5′- and 3′-UTRs with sufficient coverage associated with the same parent gene (ensg), all matched UTR-non-redundant motif hits were summed, and summarized in a gene-motif matrix. Non-redundant motifs were converted to RBP-ids by overlaying all possible RBP-motif matches. This matrix is used for downstream analyses, data interpretation, and visualization.
We confirmed 3′- and 5′-UTR enrichment of particular RBPs (
We identified two variables, i.e. blood storage time and patient age, that potentially influence the classifiers predictive strength (Table 4). To reduce the influence of confounding factors participating in the classification model, we applied the following novel approach for iterative RNA-sequencing data correction (see also schematic representation in
(i) Filtering of genes with low abundance, i.e. less than 30 intron-spanning spliced RNA reads in more than 90% of the sample cohort (also included in the general QC-module, see section ‘Processing of raw RNA-sequencing data’).
(ii) Determination of genes showing least variability among confounding variables. For this, the non-normalized raw reads counts of each gene that passed the initial filter in (i) were correlated using Pearson's correlation to either the total intron-spanning library size (as calculated by the DGEList-function of the edgeR package in R) or the age of the individuals. Genes with a high Pearson correlation (towards 1) show the least variability after counts-per-million normalisation (see
(iii) Raw read counts of the training cohort were subjected to the RUVg-function from the RUVSeq-package in R. The stable genes identified among the confounding variables were used as ‘negative control genes’. Following, the individual estimated factors for each sample identified by RUVg are correlated to potential confounding factors (in the current study: library size, age of the individual) or the group of interest (for example Non-cancer versus NSCLC). The continuous (confounding) variables are correlated to the estimated variance of the samples. Dichotomous variables (e.g. group) are compared using a Student's t-test. In both instances, the p-value was used as a significance surrogate between the RUVg variable and the (confounding) variable. Of note, to prevent removal of a variable likely correlated to group, we applied two rules prior to matching a variable to a (confounding) factor. i.e. a) the p-value between RUVg variable and group should be at least >1e−5, and b) the p-value between RUVg variable and the other variable should be at least <0.01. Raw non-normalized reads were corrected for RUVg variable x in case this variable was correlated to a confounding factor. Finally, the total intron-spanning library size per sample was adjusted by calculating the sum of the RUVg-corrected read counts per sample.
(iv) RUVg-normalized read counts are subjected to counts-per-million normalization, log-transformation, and multiplication using a TMM-normalisation factor. The latter normalisation factor was calculated using a custom function, implemented from the calcNornmFactors-function in the edgeR package in R. Here, the eligible samples for TMM-reference sample selection can be narrowed to a subset of the cohort. i.e. for this study the samples assigned to the training cohort, and the selected reference sample was locked. We applied this iterative correction module to all analyses in this work. The estimated RUVg number of factors of unwanted variation (k) was 3. We directly compared the performance of our previous normalisation module and the iterative correction module presented in this study using relative log intensity (RLE) plots (
The swarm-enhanced thromboSeq algorithm implements multiple improvements over the previously published thromboSeq algorithm (Best et at, 2015. Cancer Cell 28: 68-676). An overview of the swarm-enhanced thromboSeq classification algorithm is provided in
Performance Measurement of the Swarm-Enhanced thromboSeq Algorithm
We assessed the performance, stability, and reproducibility of the swarm-enhanced thromboSeq platform using multiple training, evaluation, and validation cohorts. A schematic overview of the cohorts used for assessment of the performance of the platform in patient age- and blood storage-matched cohorts is provided in
For the gene ontology analysis, we investigated co-associated gene clusters using the PAGODA functions implemented in version 1.99 of the scde R-package (http://pklab.med.harvard.edu/scde/). PAGODA allows for clustering of redundant heterogeneity patterns and the identification of de novo gene clusters through pathway and gene set overdispersion analysis (Fan et al., 2016. Nature Methods 13: 241-244). In particular, the ability to identify de novo gene clusters is of interest for the analysis of platelet RNA-seq data, as platelet biological functions are potentially unannotated and can only be inferred by unbiased cluster analysis. Gene IDs as selected by differential splicing analysis (n=1622,
By analysis of platelet RNA samples after SMARTer amplification we observed delicate differences in the SMARTer cDNA profiles (
RNA-seq data offers an opportunity to quantify nearly any region of the transcriptome at high resolution. Hence we investigated the distribution of RNA species in the platelet RNA profiles. The platelets analyzed in this study make up a snapshot of all platelets circulating in the blood stream at moment of blood collection, and may be influenced by variables such as total platelet counts, medication, bleeding disorders, injuries, activities or sports, and circadian rhythm. For the following analyses, in order to reduce the influence of factors highly suspected of confounding the platelets profiles (Table 4), we selected 263 patient age- and blood storage time-matched individuals. Based on intron-spanning read count analysis we identified 1625 spliced platelet genes with significantly differentially spliced levels (FDR<0.01, 698 genes with enhanced splicing in platelets of NSCLC patients and 927 genes with decreased splicing in platelets of NSCLC patients), which is in line with previous findings (Best et al., 2015. Cancer Cell 28: 666-676; Calverley et al., 2010. Clinical and Transl Science 3: 227-232).
Based on unsupervised hierarchical clustering of intron-spanning reads the Non-cancer and NSCLC samples separated into two distinct groups (p<0.0001, Fisher's exact test,
Next, we investigated the contribution of alternative splicing events to the platelet RNA repertoire, since alternative splicing events might influence the number of spliced HNA reads used for the diagnostic classifiers. For characterization of transcriptome-wide alternative isoforms and splicing events, we implemented the previously published MISO algorithm (Katz et al., 2010. Nature Methods 7: 1009-1015) for the quantification and summarization of annotated RNA isoforms. From this, we inferred a count matrix, which contains the number of reads supporting each included RNA isoform per sample (
Next, we investigated alternative splicing events within genes. i.e. exon skipping. Here, we again applied the MISO algorithm (Katz et al., 2010. Nature Methods 7: 1009-1015) to profile 38327 annotated exons, and to infer the fraction of reads supporting either inclusion or exclusion of the particular exon as compared to neighbouring exons (schematic representation in
We also observed multiple variables converging, i.e. 1) platelets of NSCLC patients have an higher RNA yield on average (
Platelets are anucleated cell fragments. They contain, however, a functional spliceosome and several splice factor proteins (Denis et al., 2005. Cell 122: 379-391). Therefore, platelets retain their capacity to initiate pre-mRNA splicing. Several experiments have demonstrated that platelets are able to splice pre-mRNA upon environmental queues (Rondina et al., 2011. Journal Thromb Haemostasis 9: 748-758; Schwertz et al., 2006. J Exp Med 203: 2433-2340; Denis et al., 2005. Cell 122: 379-391), and that they have the ability to translate RNA into proteins (Weyrich et al., 1998. Proceedings of the National Academy of Sciences 95: 5556-5561). As platelets lack a nucleus, but are packaged with ˜20-40 femtograms of RNA (Angénieux et al., 2016. PloS One 11: e0148064) and circulate for 7-10 days, the (pre-)mRNA needs to be properly curated. The inability of platelets to transcribe chromosomal DNA, as opposed to nucleated cells, prevents the platelets from transcription factor-mediated gene regulation, hinting at post-transcriptional regulation of the RNA pool (
Blood platelets act as local and systemic responders during tumorigenesis and cancer metastasis (McAllister and Weinberg 2014. Nature Cell Biol 16: 717-27), thereby being exposed to tumor-mediated platelet education, and resulting in altered platelet behaviour (Labelle et al., 2011. Cancer Cell 20: 576-590; Schumacher et al., 2013. Cancer Cell 24: 130-137; Kerr et al., 2013. Oncogene 32: 4319-4324). We have previously demonstrated that platelet RNA can function as a biomarker trove to detect and classify cancer from blood via self-learning support vector machine (SVM)-based algorithms (Best et al., 2015. Cancer Cell 28: 666-676)(
Next, we investigated the clinical utility of swarm-modulated TEP biomarker signatures for therapy response prediction in patients with NSCLC. For this, we prospectively included patients with NSCLC that were selected for treatment with the PD-1 monoclonal antibody nivolumab that is associated with an objective response rate of approximately 20% in unselected NSCLC cohorts in the second line setting (Borghaei et al., 2015. New England J Med 373: 1627-1639; Brahmer et al., 2015. New England J Med 373: 123-135). Currently, stratification of patients for anti-PD-(L)1 targeted therapy is hampered by limited accuracy and concordance of available biomarkers, including PD-L1 immunohistochemistry of tumor tissue. Studies have identified correlations between tumor tissue mutational load, presence of neo-antigens, infiltration of immune cells, and response to anti-PD-(L)1 immunotherapy (Rizvi et al., 2015. Science 348: 124-128; McGranahan et al., 2016. Science 351: 1463-1469). Identification of patients with a low likelihood of response to anti-PD-(L)1 immunotherapy, while still correctly identifying individuals who most likely benefit from this therapy, might prevent unnecessary treatment and concomitant costs, and potential exposure of patients to serious immunological adverse events. Platelets can behave as immunomodulators in inflammatory conditions (Boilard et al., 2010. Science 327: 580-583), and are therefore potentially also involved in the immune response towards a tumor. To this end, we collected platelet samples before start of nivolumab treatment (n=64). These samples are part of the cohort presented in
Assuming a 20% response rate to nivolumab among an unselected population of NSCLC patients (Borghaei et al., 2015. New Engl J Med 373: 1627-1639; Brahmer et al., 2015. New Engl J Med 373: 123-135), 42% of the full population will safely be withheld nivolumab treatment. We noted that classification of the n28-Follow-up-cohort (collected 2-4 weeks after start of treatment) in the 1246-genes nivolumab response prediction algorithm yielded in random classification (data not shown). However, we observed similar distinctive power in TEP RNA profiles at 2-4 weeks following start of treatment when analyzed separately (
Altogether, we provide evidence that TEPs could potentially serve as a diagnostics platform for cancer detection and therapy selection. The PSO-driven thromboSeq algorithm development approach allowed for efficient biomarker selection and may be applicable to other diagnostics biosources and indications. A further increase in the classification power of swarm-enhanced thromboSeq may be achieved by 1) training of the swarm-enhanced self-learning algorithms on significantly more patient age- and blood storage time-matched samples, 2) including analysis of small RNA-seq (e.g. miRNAs), 3) including non-human RNAs, and/or 4) combining multiple blood-based biosources, such as TEP RNA, exosomal RNA, cell-free RNA, and cell-free DNA. By nature, swarm intelligence allows for self-reorganization and re-evaluation, enabling continuous algorithm optimization (
A 60-years-old male presents at the general practitioner (GP). He complains about sputum mixed with blood, tiredness, shortness of breath, and loss of weight. Upon physical examination the GP notices enlargement of clavicular lymph nodes. The GP suspects the patient of localized or metastasized lung cancer. He orders a platelet-RNA-based diagnostic test (thromboSeq). The patient is subjected to a venipuncture, and whole blood is collected in a EDTA-coated tube. The EDTA-coated tube with blood is send via medical transport to a sequencing facility, compatible with the thromboSeq system. Upon arrival of the blood tube at the sequencing facility the EDTA-coated tube is subjected to the standardized platelet isolation protocol, and from the resulting platelet pellet total RNA isolation is performed. The total RNA is quantified, quality-controlled, and ˜500 pg RNA is subjected to the standardized SMARTer cDNA amplification protocol. Resulting cDNA is labelled for Illumina sequencing, and the sample is sequenced using the Illumina sequencing platform.
Following sequencing, the samples' FASTQ-file is processed using the thromboSeq bioinformatics pipeline, consisting of read mapping, quantification, normalization, and correction, and classified using the swarm-enhanced NSCLC Dx signature-based support vector machine (SVM) classifier. The classification result is send to the GP.
A 66-years-old female is diagnosed with a stage IV non-small cell lung cancer (NSCLC), with multiple metastases to the brain. The medical doctors decide to investigate the sensitivity of the primary tumor for anti-PD(L)1-targeted treatment, especially nivolumab treatment. They draw blood using a regular venipuncture procedure, and collect the whole blood in EDTA-coated vacutainer tubes. The EDTA-coated tube with blood is send via medical transport to a sequencing facility, compatible with the thromboSeq system. Upon arrival of the blood tube at the sequencing facility the EDTA-coated tube is subjected to the standardized platelet isolation protocol, and from the resulting platelet pellet total RNA isolation is performed. The total RNA is quantified, quality-controlled, and ˜500 pg RNA is subjected to the standardized SMARTer cDNA amplification protocol. Resulting cDNA is labelled for Illumina sequencing, and the sample is sequenced using the Illumina sequencing platform. Following sequencing, the samples' FASTQ-file is processed using the thromboSeq bioinformatics pipeline, consisting roughly of read mapping, quantification, normalization, and correction, and classified using the swarm-enhanced nivolumab therapy response signature-based SVM classifier. The classification result, which contains a predicted response efficacy to nivolumab, is send to the medical team.
To select a minimal biomarker gene panel for TEP-RNA NSCLC diagnostics, a NSCLC diagnostics score was calculated. The NSCLC/Non-cancer RNA-sequencing dataset (n=779 samples) was first subjected to the RUV-normalization module (lib-size threshold: 0.418, as determined by PSO). The genes with stable expression levels among the cohort and the factors for RUV-correction were determined using the training cohort only (n=120 samples). Next, ANOVA differential expression analysis using only the samples assigned to the age-, gender-, EDTA-, and smoking-matched NSCLC/Non-cancer training cohort was performed. Following, an iterative biomarker gene panel selection algorithm, which adds per iteration a new gene according to a ranked FDR- or p-value-ranked ANOVA list, was employed. The biomarker gene panel is composed of genes with a positive logarithmic fold change. The NSCLC diagnostics score was calculated per iteration by selecting the median 2-log counts-per-million value for each sample for the genes in the biomarker gene panel. For each biomarker set, the AUC-value of a ROC-curve of the biomarker gene in an evaluation cohort (n=88) was evaluated. This was performed for a biomarker gene panel ranging from 2 genes up to and including 500 genes.
The evaluation cohort (n=88 samples) showed the highest AUC-value in the ROC-curve of the NSCLC diagnostics score in a 60-gene biomarker gene panel (AUC-value: 0.86, classification accuracy: 81%). Subsequent locking of the 60-genes biomarker gene panel and ROC-curve evaluation of an independent NSCLC late-stage validation cohort (n=518, n=245 NSCLC and n=273 non-cancer) resulted in an AUC-value of 0.80 (95%-CI: 0.77-0.84) and an classification accuracy of 73%, and an independent NSCLC locally-advanced validation cohort (n=106, n=53 NSCLC and n=53 non-cancer) resulted in an AUC-value of 0.74 (95%-CI: 0.64-0.84) and an classification accuracy of 69%.
Before the biomarker gene panel was reduced to 10 genes, the 60-genes biomarker gene panel was filtered for genes that were also selected by PSO (see above). 45 out of 60 genes were present in both gene panels and thus selected for further analyses. The 45 genes resulted in an AUC-value of 0.77 (95%-CI: 0.73-0.81) and a classification accuracy of 77% in an independent, late stage validation set (n=518 samples). The AUC-value was 0.74 (95%-CI: 0.65-0.83), with a classification accuracy of 70% in an early stage validation set (n=106 samples). Following, random 10-gene panel biomarker gene panels from these 45 candidate biomarkers were selected (n=1000 iterations), and the classification accuracy in the evaluation cohort (n=88) was determined. The randomly selected biomarker gene panel (n=10 genes) with highest AUC-value and classification accuracy (respectively 0.87 and 81%) was selected for validation in the independent early and late-stage validation cohort (early-stage cohort: n=106, AUC-value: 0.69 (95%-CI: 0.59-0.79), classification accuracy 65%, late-stage cohort: n=518, AUC-value: 0.74 (95%-CI: 0.70-0.77), classification accuracy 68%).
The p-selectin 5-gene signature was selected using a similar approach. First, all genes correlated to the expression level of p-selectin RNA were selected and sorted according to the correlation coefficient and FDR-value. Next, the sorted p-selectin correlating genes were filtered for those with a positive logarithmic fold change in the non-cancer versus NSCLC ANOVA. Again, the p-selectin gene panel was iteratively increased by adding in each iteration one additional gene, according to the FDR-ranked p-selectin co-correlating gene list. This was performed for two up till and including 50 genes. For each biomarker set the samples in the evaluation cohort were evaluated for AUC-value and classification accuracy, and the p-selectin gene panel with the best AUC-value and classification accuracy was selected (n=5 genes, AUC: 0.74, classification accuracy: 70%). The resulting 5 gene panel classified the independent NSCLC late-stage validation samples, resulting in an AUC-value of 0.58 (95%-CI: 0.53-0.62) and classification accuracy of 57% (n=518 samples). The early-stage NSCLC were classified with an AUC-value of 0.66 (95%-CI: 0.55-0.76) and classification accuracy of 65% (n=106 samples).
A minimal gene panel for nivolumab response prediction was selected using a similar approach. Platelet samples were collected up to one month before start of treatment (baseline, n=179 samples). Response assessment of patients treated with nivolumab was performed by CT-imaging at baseline, 6-8 weeks, 3 months and 6 months after start of treatment. Treatment response was assessed according to the updated RECIST version 1.1 criteria (Eisenhauer et al., 2009. Europ J Cancer 45: 228-247; Schwartz et al., 2016. Eur J Cancer 62: 132-7), and scored as progressive disease (PD), stable disease (SD), partial response (PR), or complete response (CR). The main aim was to identify those patients who showed control of disease in response to therapy versus non-responders. Hence, for the nivolumab response prediction analysis, patients were grouped who showed progressive disease as the most optimal response in the non-responding group, totaling 179 samples. Patients with partial response at any response assessment time point as most optimal response or stable disease at 6 months response assessment were annotated as responders, totaling 91 samples. To select and validate the nivolumab biomarker gene panel, 91 responders and 91 non-responders matched for age and gender were selected randomly, to enable for equal group sizes. 55 responders and non-responders were assigned to the training cohort (n=110 in total), 25 responders and non-responders were assigned to the evaluation cohort (n=50 in total), and 11 responders and non-responders remained for independent validation (n=22 in total). We first subjected the cohort to the RUV-normalization module (Jacob et al., 2016. Biostatistics 17: 16-28). For this analysis, genes were selected that showed expression levels which correlated to sample library sizes (calculated by Pearson's correlation) and hospital of sample collection (calculated by ANOVA statistics), and subjected the samples to RUV-correction. This enables for correction of the read counts for confounding factors in the RNA-sequencing data. The stable genes were determined using the training cohort only. Next, we performed trimmed mean of M values-normalization (TMM-normalization; Robinson and Oshlack, 2010. Genome Biol 11: R25) and subjected the TMM-normalized log-2-transformed counts-per-million reads to per-gene wilcoxon differential expression analysis. For this, only the samples assigned to the training cohort were included. The gene list resulting from the wilcoxon differential expression analysis sorted by p-value served as an input for an iterative biomarker gene panel selection algorithm as described above. The direction of the differential expression was calculated by subtracting the median counts from the non-responders from the responders (delta_median-value). The nivolumab response prediction score was determined by subtracting per sample the median counts of genes that showed decreased expression from those that showed increased expression. During each iteration of the iterative biomarker gene panel selection algorithm both an increased and decreased RNA was added. For each biomarker set, the AUC-value of a HOC-curve of the biomarker gene was evaluated in an evaluation cohort (n=50 samples). This was performed for a biomarker gene panel ranging from 4 up till and including 1600 genes. The evaluation cohort reached the highest AUC-value in the ROC-curve of the nivolumab response prediction score in a 4-gene biomarker gene panel (AUC-value: 0.69, classification accuracy: 70%). Subsequent locking of the 4-gene biomarker gene panel and HOC-curve analysis of classification of an independent validation cohort (n=22, n=11 responders, n=11 non-responders) resulted in an AUC-value of 0.70 (95%-CI: 0.47-0.94) and an classification accuracy of 73%. Additional evaluation of a 6-gene biomarker gene panel, selected using the three most significantly increase and the three most significantly decreased differentially expressed RNAs resulted in an classification accuracy of (0% in the evaluation cohort (AUC: 0.60, n=50 samples) and classification accuracy of 64% in the validation cohort (AUC: 0.61, 95%-CI: 0.36-0.86, n=22 samples).
Number | Date | Country | Kind |
---|---|---|---|
2018391 | Feb 2017 | NL | national |
2018567 | Mar 2017 | NL | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/NL2018/050110 | 2/19/2018 | WO | 00 |