The field of the invention is profiling of omics data as they relate to cancer, especially as it relates to the reduction of false positive results for polymorphisms in gene panel tumor-only analysis for various cancers.
The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
All publications and patent applications herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
Commercially-available clinical-grade gene panel tests, based on DNA sequencing are widely used in clinical practice. These panel-based tests, based on tumor-only analysis are presently the most common approach for genomic testing in oncology for clinical decision support. Sequencing-based approaches seek to identify the somatically-derived genomic variation that drives tumor growth and to precisely distinguish these genetic variants from the large background of inherited germline genomic variation that inevitably predominates in the tumor genome.
In 2016, the Centers for Medicare and Medicaid Services (CMS) authorized coverage of a tumor-only DNA sequencing-based test of 35 genes that were intended to inform lung cancer treatment. This currently CMS approved test is based on tumor-only analysis of a targeted gene panel, with the specific exclusion of comparing such analysis to the patient's normal germline tissue. Instead the current approved test utilizes a reference genome and filtration technique to distinguish ‘true’ somatic variants from either normal polymorphism or inherited germline variants. The test (MolDX: L36194) is defined as a “single test using tumor tissue only (i.e., not matched tumor and normal) that does not distinguish between somatic and germline alterations”. However, this tumor-only approach has been reported by others to increase the risk of mistakenly identifying germline mutations as somatically-derived genetic changes and potential cancer driver mutations (“false positives”). While it was recently shown that false positive rates associated with tumor-only sequencing can at least to some degree be reduced by molecular pathologist review of all putative somatic variants, such individual review is generally time consuming and still error prone.
Thus, there remains a need for improved methods of analyzing omics data from cancer patients, especially where false positive test results are likely.
The inventive subject matter is directed to various methods of analyzing and/or identifying tumor-associated single nucleotide variants (SNVs) using genomics and transcriptomics data of tumor DNA, germline DNA, and tumor RNA from a patient, which unexpectedly improves accuracy, and with that, chances of effective treatment.
Thus, in one aspect of the inventive subject matter, the inventors contemplate a method of performing a SNV-based cancer test with increased accuracy. This method includes a step of obtaining DNA sequencing data from a tumor sample and a matched normal sample (i.e., non-tumor sample of the same patient), and a further step of obtaining RNA sequencing data from the tumor sample. The method then further includes a step of determining presence of DNA single nucleotide variants in the tumor sample relative to the matched normal sample and a step of determining expression of the DNA single nucleotide variants using the RNA sequencing data. In some embodiments, the step of determining the presence of the DNA single nucleotide variant is performed using location guided synchronous alignment of the DNA sequencing data from the tumor sample and the matched normal sample. Preferably, the method further includes a step of identifying at least one DNA single nucleotide variant as being associated with cancer status of the patient based on the presence and the expression of the single nucleotide variants.
Most typically, the DNA sequencing data is whole genome DNA sequencing data. Preferably, DNA sequencing data of the tumor tissue have a read depth of at least 50×, and/or the DNA sequencing data of the matched normal tissue have a read depth of at least 30×. In some embodiments, the method further comprises a step of filtering the DNA single nucleotide variants using allele frequencies of the DNA single nucleotide variants.
In another aspect of the inventive subject matter, the inventors contemplate a method of identifying a treatment option for a patient with increased accuracy. This method includes a step of determining presence of DNA single nucleotide variants in the tumor sample relative to the matched normal sample of the patient, and a step of determining expression of the DNA single nucleotide variants using the RNA sequencing data. Then, the method further comprises a step of identifying the treatment option targeting a gene having at least one DNA single nucleotide variant that is expressed as RNA.
Preferably, the step of determining the presence of the DNA single nucleotide variant is performed using location guided synchronous alignment of the DNA sequencing data from the tumor sample and the matched normal sample. In some embodiments, the step of determining the presence of the DNA single nucleotide variant is performed using an in silico gene panel having a plurality of reference sequences of tumor associated genes. In such embodiment, it is preferred that the in silico gene panel is cancer type-specific and/or the tumor associated genes are selected from a group consisting of ABL1, EGFR, GNAS, KRAS, PTPN11, AKT1, ERBB2, GNAQ, MET, RB1, ALK, ERBB4, HNF1A, MLH1, RET, APC, EZH2, HRAS, MPL, SMAD4, ATM, FBXW7, IDH1, NOTCH1, SMARCB1, BRAF, FGFR1, JAK2, NPM1, SMO, CDH1, FGFR2, JAK3, NRAS, SRC, CDKN2A, FGFR3, IDH2, PDGFRA, STK11, CSF1R, FLT3, KDR, PIK3CA, TP53, CTNNB1, GNA11, KIT, PTEN, VHL.
In some embodiments, the method further comprises a step of filtering the DNA single nucleotide variants using allele frequencies of the DNA single nucleotide variants.
In some embodiments, the step of determining the expression of the DNA single nucleotide variants comprises measuring RNA expression level of the DNA single nucleotide variants and comparing with a predetermined threshold. In such embodiment, it is contemplated that the method may further comprise a step of ranking the DNA single nucleotide variants based on the RNA expression level and/or a step of classifying the DNA single nucleotide variants into an “expressed” or “non-expressed” group based on the comparison with the predetermined threshold.
In still another aspect of the inventive subject matter, the inventors contemplate a method of testing a patient sample that includes a step of generating or obtaining DNA omics data from tumor and matched normal tissue of the patient, and a further step of generating or obtaining RNA omics data from the tumor tissue of the patient. In yet another step, tumor and patient specific SNVs are identified in the DNA omics data of the tumor using the DNA omics data of the matched normal tissue, and the RNA omics data from the tumor tissue are used to confirm presence and quantity of expression of the SNV.
Preferably, the DNA and/or RNA omics data are in BAM format, and the step of identifying tumor and patient specific SNVs is performed using incremental synchronous alignment (e.g., using BAMBAM, which may use the DNA omics data and the RNA omics data). Most typically, but not necessarily, the RNA omics data are RNAseq data, and/or the SNVs in the DNA omics data of the tumor are in a cancer driver gene or in an inherited cancer risk gene. For example, suitable cancer driver genes include ACT1, ACT2, ACT3, APC, ATM, BRAF, BRCA1, BRCA2, CHEK1, CHEK2, EGFR, ERBB2, ERBB3, ERBB4, FGFR1, FGFR2, FGFR3, HRAS, JAK3, KIT, KRAS, MET, NOTCH1, NRAS, PALB2, PDGFRA, PIC3CA, PTEN, SMO, SRC, and TP53, and suitable inherited cancer risk genes include APC, ATM, AXIN2, BMPR1ACHD1, CHEK2, EPCAM, GREM1, MLH1, MSH2, MSH6, MUTYH, PMS2, POLD1, POLE, PTEN, SMAD4, STK11, and TP53.
In still another aspect of the inventive subject matter, the inventors contemplate a method of increasing accuracy in identifying a true somatic single nucleotide in a patient having a tumor. This method includes steps of obtaining DNA sequencing data from a tumor sample and a matched normal sample of a patient, and further obtaining RNA sequencing data from the tumor sample, determining presence of DNA single nucleotide variants in the tumor sample relative to the matched normal sample, determining presence of DNA single nucleotide variants in the tumor sample relative to the matched normal sample, and identifying at least one DNA single nucleotide variant as being associated with cancer status of the patient based on the presence and the expression of the single nucleotide variants.
Most typically, the DNA sequencing data is whole genome DNA sequencing data. In some embodiments, the DNA sequencing data of the tumor tissue have a read depth of at least 50×, and/or the DNA sequencing data of the matched normal tissue have a read depth of at least 30×.
In some embodiments, the step of determining the presence of the DNA single nucleotide variant is performed using location guided synchronous alignment of the DNA sequencing data from the tumor sample and the matched normal sample. In other embodiments, the method may further comprise a step of filtering the DNA single nucleotide variants using allele frequencies of the DNA single nucleotide variants.
In some embodiments, the step of determining the presence of the DNA single nucleotide variant is performed using an in silico gene panel having a plurality of reference sequences of tumor associated genes. In such embodiments, it is preferred that the in silico gene panel is cancer type-specific, and/or the tumor associated genes are selected from a group consisting of ABL1, EGFR, GNAS, KRAS, PTPN11, AKT1, ERBB2, GNAQ, MET, RB1, ALK, ERBB4, HNF1A, MLH1, RET, APC, EZH2, HRAS, MPL, SMAD4, ATM, FBXW7, IDH1, NOTCH1, SMARCB1, BRAF, FGFR1, JAK2, NPM1, SMO, CDH1, FGFR2, JAK3, NRAS, SRC, CDKN2A, FGFR3, IDH2, PDGFRA, STK11, CSF1R, FLT3, KDR, PIK3CA, TP53, CTNNB1, GNA11, KIT, PTEN, VHL.
In some embodiments, the step of determining the expression of the DNA single nucleotide variants comprises measuring RNA expression level of the DNA single nucleotide variants and comparing with a predetermined threshold. In such embodiments, it is also contemplated that the method may further comprise a step of ranking the DNA single nucleotide variants based on the RNA expression level, and/or classifying the DNA single nucleotide variants into an “expressed group” or a “non-expressed group” based on the comparison with the predetermined threshold.
Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawings.
The inventors have unexpectedly discovered that single nucleotide variants (SNVs) identified by conventional tumor DNA analysis poses high risk of including false-positive and/or false-negative SNVs as majority of such SNVs identified are germline-originated variants. The inventors further discovered that many of identified somatic SNVs are not expressed as RNA such that identification of such non-expressed somatic SNVs as molecular target for tumor treatment leads to ineffective cancer treatment. Viewed from the different perspective, the inventors now have discovered that the accuracy of a single nucleotide variant-based cancer test can be significantly increased by simultaneous bioinformatics analysis of tumor genomic DNA relative to matched normal to identify somatic SNVs and of tumor RNA expression to identify expressed or nonexpressed somatic SNVs. Consequently, the inventors contemplate that such identified somatic SNVs that is expressed in the tumor can be associated with cancer status, and further be identified as an effective target of the tumor treatment.
As used herein, the term “tumor” refers to, and is interchangeably used with one or more cancer cells, cancer tissues, malignant tumor cells, or malignant tumor tissue, that can be placed or found in one or more anatomical locations in a human body. It should be noted that the term “patient” as used herein includes both individuals that are diagnosed with a condition (e.g., cancer) as well as individuals undergoing examination and/or testing for the purpose of detecting or identifying a condition. Thus, a patient having a tumor refers to both individuals that are diagnosed with a cancer as well as individuals that are suspected to have a cancer. As used herein, the term “provide” or “providing” refers to and includes any acts of manufacturing, generating, placing, enabling to use, transferring, or making ready to use.
Thus, in one especially preferred aspect of the inventive subject matter, the inventors contemplate that an accuracy of a single nucleotide variant-based cancer test can be significantly increased by obtaining DNA and RNA data from a tumor sample and/or a matched normal sample of a patient to so determine DNA single nucleotide variants in the tumor sample relative to the matched normal sample and determine expression of the DNA single nucleotide variants. It is contemplated that DNA single nucleotide variants that is expressed as RNA can be associated with cancer status of the patient with high accuracy.
Any suitable methods of obtaining a tumor sample (tumor cells or tumor tissue) from the patient (or healthy tissue from a patient or a healthy individual as a comparison) are contemplated. Most typically, a tumor sample can be obtained from the patient via a biopsy (including liquid biopsy, or obtained via tissue excision during a surgery or an independent biopsy procedure, etc.), which can be fresh or processed (e.g., frozen, etc.) until further process for obtaining omics data from the tissue. For example, the tumor cells or tumor tissue may be fresh or frozen. For other example, the tumor cells or tumor tissues may be in a form of cell/tissue extracts. In some embodiments, the tumor samples may be obtained from a single or multiple different tissues or anatomical regions. For example, a metastatic breast cancer tissue can be obtained from the patient's breast as well as other organs (e.g., liver, brain, lymph node, blood, lung, etc.) for metastasized breast cancer tissues. Preferably, a healthy tissue of the patient or matched normal tissue (e.g., patient's non-cancerous breast tissue) can be obtained or a healthy tissue from a healthy individual (other than the patient) can be also obtained via a similar manner as a comparison.
In some embodiments, tumor samples can be obtained from the patient in multiple time points in order to determine any changes in the tumor samples over a relevant time period. For example, tumor samples (or suspected tumor samples) may be obtained before and after the samples are determined or diagnosed as cancerous. In another example, tumor samples (or suspected tumor samples) may be obtained before, during, and/or after (e.g., upon completion, etc.) a one time or a series of anti-tumor treatment (e.g., radiotherapy, chemotherapy, immunotherapy, etc.). In still another example, the tumor samples (or suspected tumor samples) may be obtained during the progress of the tumor upon identifying a new metastasized tissues or cells.
From the obtained tumor cells or tumor tissue, DNA (e.g., genomic DNA, extrachromosomal DNA, etc.), RNA (e.g., mRNA, miRNA, siRNA, shRNA, etc.), and/or proteins (e.g., membrane protein, cytosolic protein, nucleic protein, etc.) can be isolated and further analyzed to obtain omics data. Alternatively and/or additionally, a step of obtaining omics data may include receiving omics data from a database that stores omics information of one or more patients and/or healthy individuals. For example, omics data of the patient's tumor may be obtained from isolated DNA, RNA, and/or proteins from the patient's tumor tissue, and the obtained omics data may be stored in a database (e.g., cloud database, a server, etc.) with other omics data set of other patients having the same type of tumor or different types of tumor. Omics data obtained from the healthy individual or the matched normal tissue (or healthy tissue) of the patient can be also stored in the database such that the relevant data set can be retrieved from the database upon analysis. Likewise, where protein data are obtained, these data may also include protein activity, especially where the protein has enzymatic activity (e.g., polymerase, kinase, hydrolase, lyase, ligase, oxidoreductase, etc.).
As used herein, omics data includes but is not limited to information related to genomics, proteomics, and transcriptomics, as well as specific gene expression or transcript analysis, and other characteristics and biological functions of a cell. With respect to genomics data, suitable genomics data includes DNA sequence analysis information that can be obtained by whole genome sequencing and/or exome sequencing (typically at a coverage depth of at least 10×, more typically at least 20×) of both tumor and matched normal sample. Alternatively, DNA data may also be provided from an already established sequence record (e.g., SAM, BAM, FASTA, FASTQ, or VCF file) from a prior sequence determination. Therefore, data sets may include unprocessed or processed data sets, and exemplary data sets include those having BAM format, SAM format, FASTQ format, or FASTA format. However, it is especially preferred that the data sets are provided in BAM format or as BAMBAM diff objects (e.g., US2012/0059670A1 and US2012/0066001A1). Omics data can be derived from whole genome sequencing, exome sequencing, transcriptome sequencing (e.g., RNA-seq), or from gene specific analyses (e.g., PCR, qPCR, hybridization, LCR, etc.). Likewise, computational analysis of the sequence data may be performed in numerous manners. In most preferred methods, however, analysis is performed in silico by location-guided synchronous alignment of tumor and normal samples as, for example, disclosed in US 2012/0059670A1 and US 2012/0066001A1 using BAM files and BAM servers. Such analysis advantageously reduces false positive neoepitopes and significantly reduces demands on memory and computational resources.
It should be noted that any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, or other types of computing devices operating individually or collectively. One should appreciate the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, solid state drive, RAM, flash, ROM, etc.). The software instructions preferably configure the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus. Further, the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps associated with implementations of computer-based algorithms, processes, methods, or other instructions. In especially preferred embodiments, the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges among devices can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network.
It is contemplated that somatic SNVs can be distinguished and identified from germline SNVs by comparing the genomic DNA sequences obtained from tumor tissue and matched normal tissue of a patient (e.g., non-tumor tissue of a patient including liquid biopsy of nontumor blood sample). With respect to the analysis of tumor and matched normal tissue of a patient, numerous manners are deemed suitable for use herein so long as such methods will be able to generate a differential sequence object or other identification of location-specific difference between tumor and matched normal sequences. Exemplary methods include sequence comparison against an external reference sequence (e.g., hg18, or hg19) or sequence comparison against an internal reference sequence (e.g., matched normal), and sequence processing against known common mutational patterns (e.g., SNVs). Therefore, contemplated methods and programs to detect mutations between tumor and matched normal, tumor and liquid biopsy, and matched normal and liquid biopsy include iCallSV (URL: github.com/rhshah/iCallSV), VarScan (URL: varscan.sourceforge.net), MuTect (URL: github.com/broadinstitute/mutect), Strelka (URL: github.com/Illumina/strelka), Somatic Sniper (URL: gmt.genome.wustl.edu/somatic-sniper/), and BAMBAM (US 2012/0059670).
However, in especially preferred aspects of the inventive subject matter, the sequence analysis is performed by incremental synchronous alignment of the first sequence data (tumor sample) with the second sequence data (matched normal), for example, using an algorithm as for example, described in Cancer Res 2013 Oct. 1; 73(19):6036-45, US 2012/0059670 and US 2012/0066001 to so generate the patient and tumor specific mutation data. As will be readily appreciated, the sequence analysis may also be performed in such methods comparing omics data from the tumor sample and matched normal omics data to so arrive at an analysis that can not only inform a user of mutations that are genuine to the tumor within a patient, but also of mutations that have newly arisen during treatment (e.g., via comparison of matched normal and matched normal/tumor, or via comparison of tumor). In addition, using such algorithms (and especially BAMBAM), allele frequencies and/or clonal populations for specific mutations can be readily determined, which may advantageously provide an indication of treatment success with respect to a specific tumor cell fraction or population. Thus, omics data analysis may reveal missense and nonsense mutations, changes in copy number, loss of heterozygosity, deletions, insertions, inversions, translocations, changes in microsatellites, etc.
Moreover, it should be noted that the data sets are preferably reflective of a tumor and a matched normal sample of the same patient to so obtain patient and tumor specific information. Thus, genetic germ line alterations not giving rise to the tumor (e.g., silent mutation, SNP, etc.) can be excluded. Of course, it should be recognized that the tumor sample may be from an initial tumor, from the tumor upon start of treatment, from a recurrent tumor or metastatic site, etc. In most cases, the matched normal sample of the patient may be blood, or non-diseased tissue from the same tissue type as the tumor.
In some embodiments, where the whole genome or exome sequencing data of the tumor and matched normal is compared with external reference sequences, it is contemplated that the external reference sequences are organized as an in silico gene panel. Preferably, the in silico gene panel includes a plurality of tumor-associated genes, including tumor-driver gene(s) or cancer-driver gene(s) (e.g., EGFR, KRAS, TP53, APC, etc.) and/or drug-sensitivity or metabolism related genes. It is contemplated that the numbers and types of genes in the in silico gene panel may vary depending on the type of cancer the patient may have or be diagnosed (e.g., cancer type-specific in silico gene panel), and preferably includes at least 20 genes, at least 30 genes, at least 40 genes, or at least 50 genes. For example, the in silico gene panel may include whole genome sequences and/or whole exome sequences of ABL1, EGFR, GNAS, KRAS, PTPN11, AKT1, ERBB2, GNAQ, MET, RB1, ALK, ERBB4, HNF1A, MLH1, RET, APC, EZH2, HRAS, MPL, SMAD4, ATM, FBXW7, IDH1, NOTCH1, SMARCB1, BRAF, FGFR1, JAK2, NPM1, SMO, CDH1, FGFR2, JAK3, NRAS, SRC, CDKN2A, FGFR3, IDH2, PDGFRA, STK11, CSF1R, FLT3, KDR, PIK3CA, TP53, CTNNB1, GNA11, KIT, PTEN, VHL.
Additionally, it is also contemplated that such identified DNA single nucleotide variants are further filtered using DNA allele frequencies (e.g., using a public database with reported population allele frequencies). In some embodiments, the DNA single nucleotide variants can be filtered with a predetermined frequency threshold, for example, reported allele frequencies ≥0.01 (1%), preferably ≥0.005 (0.5%), or more preferably ≥0.001 (0.1%).
Additionally, the significance of the sequence change (DNA single nucleotide variants) can be assessed by variant calling where the genomics data is in BAM file format. Because BamBam keeps the sequence data in the pair of files in sync across the genome, a complex mutation model that requires sequencing data from both BAM files derived from two biological samples as well as the reference can be implemented easily. This model aims to maximize the joint probability of both sequence strings of two biological samples. To find the optimal genotypes of two sequence strings from two biological samples, the inventors aim to maximize the likelihood defined by:
P(Dg,Dt,Gg,Gt|α,r)=P(Dg|Gg)P(Gg|r)P(Dt|Gg,Gt,α)P(Gt|Gg) (1)
P(D↓g,D↓t,G↓g,G↓t┤|α,r)=P(D↓g┤|G↓g)P(G↓g┤|r)P(D↓t┤|G↓g,G↓t,α)P(G↓t┤|G↓g) (1)
where r is the observed reference allele, α the fraction of normal contamination, and the genotypes of sequence string 1 and 2 are defined by Gt=(t1, t2) and Gg=(g1, g2), respectively, where t1, t2, g1, g2ε{A, T, C, G}. The sequence data of sequence string 1 and 2 are defined as a set of reads Dt={dt1, dt2, . . . , dtm} and Dg={dg1, dg2, . . . , dgm}, respectively, with the observed bases dti, dgiε{A, T, C, G}. All data used in the model must exceed user-defined base and mapping quality thresholds.
The probability of the germline alleles given the germline genotype is modeled as a multinomial over the four nucleotides:
where n is the total number of germline reads at this position and nA, nG, nC, nT are the reads supporting each observed allele. The base probabilities, P(dgi|Gg), are assumed to be independent, coming from either of the two parental alleles represented by the genotype Gg, while also incorporating the approximate base error rate of the sequencer. The prior on the sequence string 1 genotype is conditioned on the reference base as:
P(Gg|r=a)={μaa,μab,μbb}
where μaa is the probability that the position is homozygous reference, μab is heterozygous reference, and μbb is homozygous non-reference. At this time, the sequence string 1prior does not incorporate any information on known, inherited SNPs.
The probability of the set of sequence 2 reads is again defined as multinomial
where m is the total number of germline reads at this position and mA, mG, mC, mT are the reads supporting each observed allele in the sequence 2 dataset, and the probability of each sequence 2read is a mixture of base probabilities derived from both sequence 2 and sequence 1 genotypes that is controlled by the fraction of normal contamination, α, as
P(dti|Gt,Ggα)=αP(dti|Gt)+(1−α)P(dti|Gg)
and the probability of the sequence 2 genotype is defined by a simple mutation model from on the sequence 1 genotype
P(Gt|Gg)=max[P(t1|g1)P(t2|g2),P(t1|g2)P(t2|g1)],
where the probability of no mutation (for example, t1=g1) is maximal and the probability of transitions (that is, A→G,T→C) are four times more likely than transversions (that is, A→T,T→G). All model parameters, α, μaa, μab, μbb, and base probabilities, P(di|G), for the multinomial distributions are user-definable.
The sequence 2 and 1 genotypes, Gt max, Gg maxi, selected are those that maximize (1), and the posterior probability defined by
can be used to score the confidence in the pair of inferred genotypes. If the sequence 2 and sequence lgenotypes differ, the mutations in sequence 2 will be reported along with its respective confidence.
Maximizing the likelihood of one or both sequence 1 and 2 genotypes helps to improve the accuracy of both inferred genotypes, especially in situations where one or both sequence datasets have low coverage of a particular genomic position. Other mutation calling algorithms, such as MAQ and SNVMix, that analyze a single sequencing dataset are more likely to make mistakes when the non-reference or mutant alleles have low support (Li, H., et al. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores, Genome Research, 11, 1851-1858; Goya, R. et al. (2010) SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors, Bioinformatics, 26, 730-736).
In addition to collecting allele support from all reads at a given genomic position, information on the reads are collected (such as which strand, forward or reverse, the read maps to, the position of the allele within the read, the average quality of the alleles, etc.) and used to selectively filter out false positive calls. We expect a random distribution of strands and allele positions for all of the allele supporting a variant, and if the distribution is skewed significantly from this random distribution (that is, all variant alleles are found near the tail end of a read), then this suggest that the variant call is suspect.
It is also contemplated that the variant calling for sequence changes can be also performed by other analysis tools, including, but not limited to, MuTect (Nat Biotechnol. 2013 March; 31(3):213-9), MuTect2, HaploTypeCaller, Strelka2 (Bioinformatics, Volume 28, Issue 14, 15 Jul. 2012, Pages 1811-1817), or other genomic artifact detection tool.
In addition, omics data of tumor and/or matched normal comprises transcriptome data set that includes sequence information and expression level (including expression profiling or splice variant analysis) of RNA(s) (preferably cellular mRNAs) that is obtained from the patient. There are numerous methods of transcriptomic analysis known in the art, and all of the known methods are deemed suitable for use herein (e.g., RNAseq, RNA hybridization arrays, qPCR, etc.). Consequently, preferred materials include mRNA and primary transcripts (hnRNA), and RNA sequence information may be obtained from reverse transcribed polyA+-RNA, which is in turn obtained from a tumor sample and a matched normal (healthy) sample of the same patient. Likewise, it should be noted that while polyA+-RNA is typically preferred as a representation of the transcriptome, other forms of RNA (hn-RNA, non-polyadenylated RNA, siRNA, miRNA, etc.) are also deemed suitable for use herein. Preferred methods include quantitative RNA (hnRNA or mRNA) analysis and/or quantitative proteomics analysis, especially including RNAseq. In other aspects, RNA quantification and sequencing is performed using RNA-seq, qPCR and/or rtPCR based methods, although various alternative methods (e.g., solid phase hybridization-based methods) are also deemed suitable. Viewed from another perspective, transcriptomic analysis may be suitable (alone or in combination with genomic analysis) to identify and quantify genes having a cancer- and patient-specific mutation.
Preferably, the transcriptomics data set includes allele-specific sequence information and copy number information. In such embodiment, the transcriptomics data set includes all read information of at least a portion of a gene, preferably at least 10×, at least 20×, or at least 30×. Allele-specific copy numbers, more specifically, majority and minority copy numbers, are calculated using a dynamic windowing approach that expands and contracts the window's genomic width according to the coverage in the germline data, as described in detail in U.S. Pat. No. 9,824,181, which is incorporated by reference herein. As used herein, the majority allele is the allele that has majority copy numbers (>50% of total copy numbers (read support) or most copy numbers) and the minority allele is the allele that has minority copy numbers (<50% of total copy numbers (read support) or least copy numbers).
The inventors contemplate that in some embodiments, the expression of the gene (or a portion of a gene) having one or more single nucleotide variant(s) can be determined by RNA sequencing data (e.g., RNAseq). In such embodiments, the expression of the one or more single nucleotide variant(s) can be assessed as presence or absence (or existence or non-existence) of the one or more single nucleotide variant(s) in the expressed RNA. Consequently, based on the RNA sequencing data the single nucleotide variant(s) can be grouped into “expressed group” or a “non-expressed group”. In other embodiments, the e expression of the gene (or a portion of a gene) having one or more single nucleotide variant(s) can be determined by combining RNAseq data and RNA quantification data (e.g., using qPCR and/or rtPCR). In such embodiments, the expression level of the one or more single nucleotide variant(s) can be assessed as presence or absence (or existence or non-existence) by comparing with a predetermined threshold. It is contemplated that the predetermined threshold may vary depending on the genes. For example, the predetermined threshold may be 10%, 5%, or 1% of the average RNA expression level of the gene in the same or similar types of tissue (e.g., liver, lung, etc.) of healthy individuals or the RNA expression level of the gene in the matched normal tissue of the patient. Alternatively, the predetermined threshold may vary depending on the qPCR and/or rtPCR noise level in the given reaction(s). For example, the predetermined threshold may be within 20%, within 10%, within 5% of the noise level of the qPCR and/or rtPCR reaction. Consequently, based on the RNA expression level, the single nucleotide variant(s) can be grouped into “expressed group” where the expression level is on or above the predetermined threshold, or a “non-expressed group” where the expression level is below the predetermined threshold.
Without wishing to be bound by any specific theory, the inventors contemplate that combination of genomics data and transcriptomics data to identify expressed DNA single nucleotide variants significantly reduce false-positive rate (mistakenly identifying germline mutations as somatically-derived cancer driver mutations, and/or identifying somatically-derived cancer driver mutations that are not expressed as an effective mutation, etc.) and/or false-negative rate (e.g., true tumor somatic SNVs are excluded, etc.). Reduction in false-positive and/or false-negative rate in identification of DNA single nucleotide variants in tumor-associated genes further significantly increases the efficiency and accuracy in identifying the genes associated with tumor and/or cancer, and also in identifying any effective treatment regimen with reduced undesired side effects or toxicity as the numbers of expressed DNA single nucleotide variants to be analyzed and targeted in association with the tumor or cancer can be significantly reduced in the relatively early stage of analysis or application.
Consequently, the inventors further contemplate that based on the presence/absence and the expression of the single nucleotide variants, such single nucleotide variants can be identified as a cancer-associated variants (or mutation) that may be further associated with a cancer status of the patient. As used herein, the term “cancer status” refers any molecular, physiological, pathological condition of a cancer or a tumor. Thus, the cancer status may include an anatomical type of cancer (e.g., gastrointestinal cancer, lung cancer, brain tumor, etc.), a metastatic status of the tumor (e.g., metastasized, high-tendency of metastasis, non-metastasized, etc.), tumor clonality, an immune status of the tumor tissue (e.g., immune suppressed, immune-activated, immune-dormant, etc.), prognosis of the tumor (e.g., stage of the tumor, grade of the tumor including the morphogenesis of the tumor, etc.). In addition, the cancer status may include the sensitivity or resistance of the tumor to a tumor treatment (e.g., resistance to checkpoint inhibitor administration, sensitivity to cytokine treatment, etc.), a toxicity by a chemotherapeutic drug (e.g., due to a mutation/single nucleotide variant in an element of CYP2D6 enzyme-mediated pathway, etc.).
In some embodiments, the association of the expressed DNA single nucleotide variants to a status of tumor or cancer may be quantified by providing significance score(s). For example, the significance score can be determined by combining sub-scores for number of DNA single nucleotide variants (1 score per one nucleic acid change), the type of DNA single nucleotide variants (e.g., nonsense mutation, missense mutation, etc.), location of DNA single nucleotide variants (e.g., exon 3 of the gene encoding the functional binding domain, etc.), and physiological impact (dominant negative factor for signaling pathway B). Also, the significance score can be determined by the expression of the gene including the DNA single nucleotide variants (e.g., −1 for each non-expressed DNA single nucleotide variant, +1 for each expressed DNA single nucleotide variant, or various incremental scores based on the expression levels of gene including DNA single nucleotide variants such as 1 score per each 10% increased expression of the gene including DNA single nucleotide variants, etc.). Thus, in such embodiments, the significance of DNA single nucleotide variants can be ranked based on the expression (presence or absence in RNA) or expression level (increase or decrease of the RNA expression level compared to normal tissue or healthy individual). Alternatively and/or additionally, the significant score(s) of genes including DNA single nucleotide variants can be used to further rank the genes or DNA single nucleotide variants.
The inventors further contemplate that such identified and/or ranked DNA single nucleotide variants and/or genes including DNA single nucleotide variants can be further used to identify a treatment option to treat the cancer or tumor of the patient. For example, Upon confirmation of the DNA single nucleotide variants (identified by tumor matched-normal sequencing) in the RNA and upon confirmation of the RNA as being expressed (e.g., at least 25% as compared to matched normal, at least 50% as compared to matched normal, at least 75% as compared to matched normal, at least 100% as compared to matched normal, at least 125% as compared to matched normal, or at least 150% as compared to matched normal) in a tumor-associated gene having one or more DNA single nucleotide variants, a drug targeting the tumor-associated gene is administered to the patient in a dose and schedule effective to treat the tumor. As used herein, the drug targeting the tumor-associated gene may include a drug that modulates the expression of the gene (in transcriptional level or translational level), a drug that modulate the post-translational modification of the gene product (protein), a drug that modulate the activity of the gene product (protein), or a drug that modulate the degradation of the gene product (protein).
As used herein, the term “administering” a drug or a cancer treatment refers to both direct and indirect administration of the drug or the cancer treatment. Direct administration of the drug or the cancer treatment is typically performed by a health care professional (e.g., physician, nurse, etc.), and wherein indirect administration includes a step of providing or making available the drug or the cancer treatment to the health care professional for direct administration (e.g., via injection, oral consumption, topical application, etc.).
Currently approved tests for lung cancer are based on tumor-only analysis of a targeted gene panel, with the specific exclusion of patient's normal germline tissue. However, as is shown in more detail below, the tumor-only approach substantially increases the risk of mistakenly identifying germline mutations as somatically-derived cancer driver mutations (i.e., false positives), and further fails to inform a physician where a potentially druggable target is even present in meaningful quantities in the tumor.
More specifically, the inventors discovered that 94% of all variants found in a currently approved, gene panel tumor-only analysis for lung cancer patients were indeed false positive polymorphisms, and that 48% remained false positives after stringent filtration. Of true somatic mutations identified in a directly druggable subset of this panel, about 18% were not expressed, compounding the risk of inaccurate treatment decisions and treatment futility. On the backdrop of such diagnostic failure it has become apparent that there is a need for improved identification of true tumor somatic variants. As is described in more detail below, such improved analysis has been accomplished by coordinated analyses of tumor DNA, germline DNA, and tumor RNA.
Based on concerns of false positives of tumor-only gene panel analysis, the inventors sought to demonstrate enhanced precision afforded by simultaneously sequencing and analyzing both tumor and germline, and improving the confidence with which mutations can be identified as potential drivers of disease. As is discussed in more detail below, the inventors undertook a study to demonstrate that i) molecular characterization of tumors for the purpose of treatment decision support is appreciably more precise by bioinformatic analysis of using the patient's normal tissue as control, that is tumor-normal DNA sequencing and that the precision of true somatic variants so identified is further enhanced when combined with RNA sequencing, ii) bioinformatic filtration of polymorphisms from tumor-only sequence analysis does not match the precision of tumor-normal genomic analysis, iii) confirmation that any true somatic mutation is expressed in the mRNA provides the critical second line of evidence that a detected somatic tumor mutation may play a role as an oncogenic driver.
In this example, DNA sequencing of tumor and normal germline genomes of the 35-gene panel authorized for coverage by CMS from 45 lung cancer patients and 621 total cancer patients with 33 cancer types was used to quantify the rate of false positive tumor somatic variants originating from the use of the tumor-only sequencing approach. Potential increase in precision from expression analysis of alterations in these 35 genes by RNA sequencing was also assessed.
Patients and Sequencing Data:
In this example, the inventors focused on mutation analysis in 35 genes that have been previously authorized for Medicare coverage by CMS to enable clinicians to better define therapy for patients with lung cancer. CMS approved the use of this gene panel only when genomic variants were identified through tumor only DNA sequencing and analysis (i.e., not matched tumor and normal). This approach does not directly distinguish between somatic and germline alterations. The panel included 25 genes implicated as somatic tumor drivers (tumor driver gene panel) and 10 genes that are known to affect inherited cancer risk (inherited risk gene panel). The tumor driver gene panel consists of: ALK, BRAF, CDKN2A, CEBPA, DNMT3A, EGFR, ERBB2, EZH2, FLT3, IDH1, IDH2, JAK2, KIT, KMT2A, KRAS, MET, NOTCH1, NPM1, NRAS, PDGFRA, PDGFRB, PGR, PIK3CA, PTEN, RET. The inherited cancer risk panel consisted of: APC, BMPR1A, EPCAM, MLH1, MSH2, MSH6, PMS2, POLD1, POLE, STK11.
Whole genome sequencing data from tumor DNA, tumor RNA, and normal DNA of 621 cancer patients was analyzed to identify somatically-derived single nucleotide variants potentially contributing to cancer growth and expansion. This example included 45 lung cancer patients. All patients provided informed consent for the use of the data described in this study. DNA and RNA was extracted from preserved tissue and sequenced using the Illumina platform in a NantOmics Clinical Laboratory Improvement Amendments (CLIA)- and Certified Authorization Profession (CAP)-certified sequencing laboratory. Performance characteristics of the test used include >95% sensitivity and >99% specificity to detect SNVs transcribed and expressed as RNA. Normal germline and tumor genomes were sequenced to read depths of approximately 30× and 60×, respectively. Approximately 300 million RNA sequencing reads were generated for each tumor.
Data Analysis:
DNA sequencing data was aligned to GRCh37 (www.ncbi.nlm.nih.gov/assembly/2758/) by BWA, duplicate-marked by samblaster, and indel realignment and base quality recalibration performed by GATK v2.3. RNA sequencing data is aligned by bowtie and RNA transcript expression estimated by RSEM. Tumor vs. matched-normal variant analysis was performed using the NantOmics Contraster analysis pipeline to determine somatic and germline SNVs, insertions and deletions, and identify highly amplified regions of the tumor genome.
Small variants were annotated with base-level PhastCons conservation scores, population allele frequencies from dbSNP (Build 142), and their predicted impact to gene transcripts downloaded from the RefSeq database (e.g., changes in DNA sequence and protein).
Identification of Tumor Somatic Single Nucleotide Variants (SNVs):
Whole-genome DNA sequencing of 45 lung cancer patients' tumor and normal (germline) genomes resulted in the identification of 802 missense or nonsense protein-altering SNVs in the panel of 35 genes associated with lung cancer etiology. The panel included 25 genes considered somatic tumor drivers (tumor driver gene panel), and 10 genes known to affect inherited cancer risk (inherited risk gene panel; Table 1). Among the 45 lung cancer patients, the total of 802 SNVs occurred at 147 unique SNV sites. All 802 variants were present in the tumor genomes. Bioinformatic analysis of tumor and normal germline DNA sequence showed that 701 of the 746 SNVs (94%) originated in the germline, and the remaining 45 SNVs (6%) originated in somatic tissue. Applying the same gene panel to the analysis of 621 cancer patients' with 33 cancer types, tumor-normal sequencing analysis resulted in the identification of 10,704 missense or nonsense protein-altering SNVs. There were 919 unique SNVs sites that contributed to the 10,704 SNVs identified. Analysis of each patient's tumor and normal germline genome determined that 10,149 (95%) of the SNVs were of germline origin, while the remaining 555 (5%) SNVs were of somatic origin.
For lung cancer patients, just 7% and 3% of SNVs were of somatic origin in the tumor driver gene panel and inherited risk gene panels, respectively. Among all cancer patients, the percentage of SNVs representing somatic changes was 6% and 3% for genes in the tumor driver gene panel and inherited risk gene panel, respectively. A greater percentage of somatic variants was expected to be observed among the 25 genes that are known to harbor somatic cancer driver mutations. There was significant variation in the number of SNVs observed in each gene. The number of unique SNV sites was strongly correlated with the size of the gene protein-coding sequence (p-value<10-9, R2=0.70 for all cancer types). However, there was no correlation between the number of germline, somatic, or total variants and the size of the gene (all p-values>0.40). The degree of association between each gene and the cancer outcomes is a likely determinant of the variation in SNV counts observed between genes as well as the natural population genetic variation present in each gene. Furthermore, specific cancer driver SNVs are enriched among the patients.
The small number of unique variants compared to total variants illustrates the presence of common SNVs that are observed in many genomes in the study population of cancer patients. There were 21 variants that had allele frequencies >0.02 in the sample of 621 cancer patients, 17 of which were common germline SNPs and 4 of which were common somatic driver mutations (2 in KRAS and 2 in PIK3CA). All 21 common variants are archived in the single nucleotide polymorphism database (dbSNP) of genetic polymorphisms. Among all patients, 645 of the 919 total unique variants (70%) were observed only once. Three SNVs were of both germline and somatic origin.
Tumor genome sequencing alone (without comparison to the normal germline genome) of the lung cancer patients would identify 746 missense and nonsense protein-altering SNVs (Table 1). In the context of tumor molecular profiling, any SNV of germline origin that is categorized as of somatic origin constitutes a false positive result. Without any filtering of putative germline variants, false positive rates of approximately 94% are expected, given the data presented in Table 1.
Excluding all of the SNPs that were present in the dbSNP database resulted in the lowest numbers of false positive SNVs. However, the improved false positive rate came at the cost of an increased false negative rate, as many true tumor somatic SNVs were excluded. Excluding all SNVs present in dbSNP resulted in 17 false negatives among 45 true tumor somatic variants observed in the 45 lung cancer patients (38%), and 245 false negatives out of the 555 true somatic variants among the lung cancer patients (44%). Using the 0.001 allele frequency threshold filter, there were 41 false positive results (5% of the 746 total SNVs observed and 48% of the 86 SNVs remaining after filtering) and zero false negative results among lung cancer patients. The same filtering threshold resulted in 554 false positive results (5% of the 10,704 total SNVs observed and 50% of the 1,107 SNVs remaining after filtering) and zero false negative results among all 621 cancer patients.
Consequences of the Tumor-Only Sequencing Approach:
After filtering to remove all SNVs with a population allele frequency ≥0.001, 37 of the 45 lung cancer patients, and 472 of the 621 all cancer patients had at least one missense or nonsense protein-altering SNV in the panel of 35 genes. The 7 lung cancer and 149 total patients without SNVs after filtering did not have any true somatic variants, showing that the population allele frequency filter did not produce false negative results.
False positive SNVs can have a direct detrimental impact on patient care. Table 2 shows 12 druggable genes, the specific drugs that target each of the genes when they are somatically mutated, and the number of patients with at least 1 false positive SNV observed in each of the genes. Furthermore, the cost and possible adverse health effects associated with each drug are shown to illustrate the financial and clinical implications of prescribing a drug based on a false positive result. Tumor-only sequence analysis can put patients at unnecessary risk of serious adverse drug effects, along with the negative impact of prescribing a drug treatment that is likely to be non-efficacious.
aAverage wholesale price for 30 days unless otherwise noted.
bDrug not given continuously.
cSingle cycle based on body surface area of 2.02.
dBased on 21 days on and 7 days off schedule.
eBased on 14 days on and 14 days of schedule.
Expression of Somatic Single Nucleotide Variants: RNA sequencing data allowing assessment of the expression of the tumor somatic SNVs was available from 26 lung cancer patients and 378 of all patients. Table 3 shows the total number of somatic SNVs assessed, the number of somatic SNVs that were not expressed, and the number of patients with a somatic SNV that was not expressed. A significant percentage of SNVs were not expressed: 18% (7 out of 39 SNVs) for lung cancer patients, and 15% (75 out of 517 SNVs) for all cancer patients. There was substantial variation in the percent of expressed tumor somatic variants between genes. Nearly 80% or more of SNVs in FLT3, PDGFRA, PGR, and RET were not expressed among all cancer patients. In the study population, 9% of lung cancer patients (6 of all 26 patients with tumor RNA sequencing data) and 13% of all cancer patients (51 of 378 total cancer patients with tumor RNA sequencing data) had at least one true tumor somatic SNV that was not expressed in the messenger RNA. There were 4 tumor somatic SNVs in 4 lung cancer patients that were not expressed in the twelve genes that are targets for specific drugs shown in Table 2. There were 33 of all cancer patients with tumor somatic SNVs that were not expressed in the RNA. Treatment decisions based on DNA analysis alone might thus result in administration of ineffective therapies.
Currently, two sequencing-based approaches are available to identify a patient's tumor somatic variation. In the first approach, the tumor DNA representing a targeted gene panel, the exome, or whole genome is sequenced, and putative germline variation is filtered based on a reference genome and the characteristics of the individual genomic variants discovered in the tumor (termed tumor-only analysis). Identification of a genomic variant in a population genetic database at an appreciable allele frequency is a common filtering criterion for determining if a variant is of inherited germline origin. The second and more precise approach as shown herein, is to use the patient's own germline genome as the precise control (rather than a reference genome for filtration) for distinguishing the inherited germline variants from those that are somatically derived (termed tumor-normal analysis). The currently CMS approved test for informing lung cancer treatment is based on the former approach and specifically excludes the use of normal tissue (germline information) in determining somatic variants.
In contrasting the two approaches, the inventors analyzed tumor and normal DNA sequencing data from 45 lung cancer and 621 total cancer patients versus a tumor only gene panel approved for coverage by CMS. The study demonstrated a 94% false positive rate (95% for all cancers) when using tumor-only sequencing to identify somatic variants. Even after utilizing multiple methods for bioinformatically filtering polymorphisms from the putative somatic mutations, the false positive rates still ranged from 38%-94%. Depending on the method used, excessively stringent filtering led to potential false negatives. When focusing on a subset of 12 genes targeted by FDA-approved drugs, where identification of somatic mutations could inform treatment decisions, the percentage of lung cancer patients affected by false positive calls ranged from 29%-51% depending on the method of polymorphism filtration used. Further risk of false positive results stem from the identification of variants identified from somatic tissue, i.e., true somatic mutations misidentified as deleterious (inherited) germline variants in such genes as BRCA1, BRCA2, and ATM. In 10 genes associated with germline risk for familial disease (the inherited risk gene panel), true somatic mutations in germline genes were discovered in 10 lung cancer patients (11 variants) and 101 total patients (118 variants) when using the tumor-only sequencing approach.
Sequencing and analysis of data from the patient's normal germline genome and tumor genome eliminates false positive results associated with analysis of tumor genome sequence data alone. The potential for tumor somatic SNVs to fruitfully inform patient treatment depends on expression of the DNA variants as messenger RNA, and then translation into protein. RNA sequencing of the tumor provides valuable information about relative expression levels of cancer driver genes, and the gene expression of specific tumor somatic variants. RNA expression analysis in this study showed that 18% of true somatic mutations identified from tumor/normal sequencing of lung cancer patients, as well as 15% for all cancer patients, were not expressed at the level of messenger RNA. In the study population, these results could impact clinical decision making for 9% of lung cancer patients, and 13% of all cancer patients. The results presented herein provide further evidence of the advantages associated with heightened precision of molecular analysis for drug targeting derived from tumor/normal DNA sequencing plus RNA sequencing.
In view of the above, it should therefore be appreciated that simultaneous sequencing and bioinformatics analysis of the DNA of both the normal germline genome and the tumor genome is necessary for accurate identification of molecular targets for cancer therapy. Analysis of only the tumor genome results in a high false positive rate in SNV identification. Even higher precision is achieved with simultaneous tumor-normal DNA and RNA sequencing analysis. Treatment decisions based on tumor-only DNA analysis or in the absence of RNA analysis might result in administration of ineffective therapies while also increasing risk of negative drug-related side effects. When used to guide clinical decision-making, the approach of tumor-only gene-panel analysis may increase risk to patients, cause potential long-term negative health consequences, and increase healthcare costs.
In this example, the inventors included 204 cancer patients with 11 gastrointestinal (GI) cancer types with whole genome sequencing of both tumor and normal genomes. True positive (true somatic variants) and false positive (true germline variants estimated to be somatic variants) rates were measured for missense and nonsense single nucleotide variants (SNVs) in a 45-gene panel as shown below. The 45-gene panel included 26 known somatic driver genes, 14 inherited cancer risk genes, and 5 of these genes can act both as somatic tumor drivers and inherited risk genes. RNA sequencing was available for 139 of the 204 patients. Sequence alignment and SNV variant calling was performed using well-established and published bioinformatics methods. In preferred methods BAMBAM was used to synchronously and incrementally align and identify SNV using DNA and RNA sequences.
Results: 92% of SNVs identified from sequencing tumor genomes alone were of germline origin and potential false positives rather than true somatic variants (Somatic=true somatic variants; Germline=true germline variants). See
It should therefore be appreciated that sequencing the tumor genome identified all of the SNVs of inherited germline origin and tumor somatic origin, with the large majority being of germline origin. While population allele frequencies and other parameters could be used to filter SNV data and estimate somatic versus germline origin, such filtering was not accurately enough for clinical use. Further, it should be appreciated that simultaneous sequencing and bioinformatics analysis of DNA of both the normal germline genome and tumor genome is necessary for accurate identification of molecular targets. Analysis of tumor genome alone results in false-positive results. Higher precision is achieved with simultaneous tumor-normal DNA and tumor RNA sequencing analysis. Treatment decisions based on tumor-only DNA analysis or in the absence of RNA might result in administration of ineffective therapies while also increasing risk of negative drug-related side effects.
In this example, the inventors aimed to compare the accuracy and precision of tumor somatic calling with a 50 gene commonly used hotspot panel and analyzing the tumor tissue alone versus analyzing tumor DNA simultaneously with normal germline DNA and tumor RNA. Specifically, in this example, tumor samples and matched normal samples from 1879 cancer patients with 42 cancer types were obtained and whole genome sequencing data or whole exome sequencing data of those tissues were generated. The demographic overview of cohort is shown in Table 4 below, and the number of analytes sequenced by different cancer types are shown in
indicates data missing or illegible when filed
From the genomic sequencing data of the tumor tissue, the inventors determined that all patients have a least one germline single nucleotide variant (30955 single nucleotide variants total). Then, the inventors quantified the number of all single nucleotide variants (including those of germline origin and those of tumor somatic origin) identified from comparing the genomic sequencing data of the tumor and matched normal. 1127 out of 1879 patients (65%) had at least 1 somatic single nucleotide variants (308721 total). 741 out of 1135 (65%) of patients whose analytes were analyzed for paired DNA/RNA had at least 1 somatic single nucleotide variants (198844 total), resulting in 1775 unique single nucleotide variants amongst patients of paired DNA/RNA analysis. As shown in
The inventors further filtered the identified single nucleotide variants from sequencing tumor genome alone using population allele frequencies and other parameters (e.g., known germline variants, gnomAD) to determine the ratio of single nucleotide variants (germline origin versus tumor somatic origin). As shown in
Further, the inventors found that not all single nucleotide variants of tumor somatic origin is expressed in RNA, indicating further filtering using RNA expression analysis is necessary to obtain the true somatic single nucleotide variants among all identified single nucleotide variants. As shown in
As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their endpoints, and open-ended ranges should be interpreted to include commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.
Moreover, all methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g. “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.
Groupings of alternative elements or embodiments of the invention disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims.
It should be apparent to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the scope of the appended claims. Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Where the specification claims refers to at least one of something selected from the group consisting of A, B, C . . . and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.
This application claims priority to our U.S. Provisional Patent Application Ser. No. 62/570,580, which was filed Oct. 10, 2017, and US provisional application with the Ser. No. 62/618,893, which was filed Jan. 18, 2018, both of which are incorporated herein by reference in their entireties.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2018/055025 | 10/9/2018 | WO | 00 |
| Number | Date | Country | |
|---|---|---|---|
| 62618893 | Jan 2018 | US | |
| 62570580 | Oct 2017 | US |