Analysis of cancer-derived cell-free DNA (cfDNA) has the potential to revolutionize detection and monitoring of cancer. Noninvasive access to malignant DNA is particularly attractive for solid tumors, which cannot be repeatedly sampled without invasive procedures. In non-small cell lung cancer (NSCLC), PCR-based assays have been used previously to detect recurrent point mutations in genes such as KRAS or EGFR in plasma DNA (Taniguchi et al. (2011) Clin. Cancer Res. 17:7808-7815; Gautschi et al. (2007) Cancer Lett. 254:265-273; Kuang et al. (2009) Clin. Cancer Res. 15:2630-2636; Rosell et al. (2009) N. Engl. J. Med. 361:958-967), but the majority of patients lack mutations in these genes. Other studies have proposed identifying patient-specific chromosomal rearrangements in tumors via whole genome sequencing (WGS), followed by breakpoint qPCR from cfDNA (Leary et al. (2010) Sci. Transl. Med. 2:20ra14; McBride et al. (2010) Genes Chrom. Cancer 49:1062-1069). While sensitive, such methods require optimization of molecular assays for each patient, limiting their widespread clinical application. More recently, several groups have reported amplicon-based deep sequencing methods to detect cfDNA mutations in up to 6 recurrently mutated genes (Forshew et al. (2012) Sci. Transl. Med. 4:136ra168; Narayan et al. (2012) Cancer Res. 72:3492-3498; Kinde et al. (2011) Proc. Natl Acad. Sci. USA 108:9530-9535). While powerful, these approaches are limited by the number of mutations that can be interrogated (Rachlin et al. (2005) BMC Genomics 6:102) and the inability to detect genomic fusions.
PCT International Patent Publication No. 2011/103236 describes methods for identifying personalized tumor markers in a cancer patient using “mate-paired” libraries. The methods are limited to monitoring somatic chromosomal rearrangements, however, and must be personalized for each patient, thus limiting their applicability and increasing their cost.
U.S. Patent Application Publication No. 2010/0041048 A1 describes the quantitation of tumor-specific cell-free DNA in colorectal cancer patients using the “BEAMing” technique (Beads, Emulsion, Amplification, and Magnetics). While this technique provides high sensitivity and specificity, this method is for single mutations and thus any given assay can only be applied to a subset of patients and/or requires patient-specific optimization. U.S. Patent Application Publication No. 2012/0183967 A1 describes additional methods to identify and quantify genetic variations, including the analysis of minor variants in a DNA population, using the “BEAMing” technique.
U.S. Patent Application Publication No. 2012/0214678 A1 describes methods and compositions for detecting fetal nucleic acids and determining the fraction of cell-free fetal nucleic acid circulating in a maternal sample. While sensitive, these methods analyze polymorphisms occurring between maternal and fetal nucleic acids rather than polymorphisms that result from somatic mutations in tumor cells. In addition, methods that detect fetal nucleic acids in maternal circulation require much less sensitivity than methods that detect tumor nucleic acids in cancer patient circulation, because fetal nucleic acids are much more abundant than tumor nucleic acids.
U.S. Patent Application Publication Nos. 2012/0237928 A1 and 2013/0034546 describe methods for determining copy number variations of a sequence of interest in a test sample comprising a mixture of nucleic acids. While potentially applicable to the analysis of cancer, these methods are directed to measuring major structural changes in nucleic acids, such as translocations, deletions, and amplifications, rather than single nucleotide variations.
U.S. Patent Application Publication No. 2012/0264121 A1 describes methods for estimating a genomic fraction, for example, a fetal fraction, from polymorphisms such as small base variations or insertions-deletions. These methods do not, however, make use of optimized libraries of polymorphisms, such as, for example, libraries containing recurrently-mutated genomic regions.
U.S. Patent Application Publication No. 2013/0024127 A1 describes computer-implemented methods for calculating a percent contribution of cell-free nucleic acids from a major source and a minor source in a mixed sample. The methods do not, however, provide any advantages in identifying or making use of optimized libraries of polymorphisms in the analysis.
PCT International Publication No. WO 2010/141955 A2 describes methods of detecting cancer by analyzing panels of genes from a patient-obtained sample and determining the mutational status of the genes in the panel. The methods rely on a relatively small number of known cancer genes, however, and they do not provide any ranking of the genes according to effectiveness in detection of relevant mutations. In addition, the methods were unable to detect the presence of mutations in the majority of serum samples from actual cancer patients.
There is thus a need for new and improved methods to detect and monitor tumor-related nucleic acids in cancer patients.
The present invention addresses these and other problems by providing novel methods and systems relating to the characterization, diagnosis, and monitoring of cancer. In particular, according to one aspect, the invention provides methods for creating a library of recurrently mutated genomic regions comprising:
identifying a plurality of genomic regions from a group of genomic regions that are recurrently mutated in a specific cancer;
wherein the library comprises the plurality of genomic regions;
the plurality of genomic regions comprises at least 10 different genomic regions; and
at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In specific embodiments of these methods, the plurality of genomic regions comprises at least 25, at least 50, at least 100, at least 150, at least 200, or at least 500 different genomic regions.
In other specific method embodiments, at least two mutations within the plurality of genomic regions or at least three mutations within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In still other specific method embodiments, at least one mutation within the plurality of genomic regions is present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, or 99.9% of all subjects with the specific cancer.
In some embodiments, the identifying step comprises for each genomic region in the plurality of genomic regions, ranking the genomic region to maximize the number of all subjects with the specific cancer having at least one mutation within the genomic region.
In other embodiments, the identifying step comprises for each genomic region in the plurality of genomic regions, ranking the genomic region to maximize the ratio between the number of all subjects with the specific cancer having at least one mutation within the genomic region and the length of the genomic region.
In some embodiments, the library comprises a plurality of genomic regions encoding a plurality of driver sequences, more specifically known driver sequences or driver sequences that are recurrently mutated in the specific cancer.
In some embodiments, the library comprises a plurality of genomic regions that are recurrently rearranged in the specific cancer.
In preferred embodiments, the specific cancer is a carcinoma, and in more preferred embodiments, the carcinoma is an adenocarcinoma, a non-small cell lung cancer, or a squamous cell carcinoma.
In specific embodiments, the cumulative length of the plurality of genomic regions is at most 30 Mb, 20 Mb, 10 Mb, 5 Mb, 2 Mb, 1 Mb, 500 kb, 200 kb, 100 kb, 50 kb, 20 kb, or 10 kb.
In another aspect, the invention provides methods for analyzing a cancer-specific genetic alteration in a subject comprising the steps of:
obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer;
sequencing a plurality of target regions in the tumor nucleic acid sample and in the genomic nucleic acid sample to obtain a plurality of tumor nucleic acid sequences and a plurality of genomic nucleic acid sequences; and
comparing the plurality of tumor nucleic acid sequences to the plurality of genomic nucleic acid sequences to identify a patient-specific genetic alteration in the tumor nucleic acid sample;
wherein the plurality of target regions are selected from a plurality of genomic regions that are recurrently mutated in the specific cancer;
the plurality of genomic regions comprises at least 10 different genomic regions; and
at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In specific embodiments of this aspect of the invention, the plurality of genomic regions comprises at least 25, at least 50, at least 100, at least 150, at least 200, or at least 500 different genomic regions.
In other specific embodiments, at least two mutations within the plurality of genomic regions or at least three mutations within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In still other specific embodiments, at least one mutation within the plurality of genomic regions is present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, or 99.9% of all subjects with the specific cancer.
In some embodiments, each genomic region in the plurality of genomic regions is identified by ranking the genomic region to maximize the number of all subjects with the specific cancer having at least one mutation within the genomic region.
In other embodiments, each genomic region in the plurality of genomic regions is identified by ranking the genomic region to maximize the ratio between the number of all subjects with the specific cancer having at least one mutation within the genomic region and the length of the genomic region.
In some embodiments, the plurality of genomic regions comprises genomic regions encoding a plurality of driver sequences, more specifically known driver sequences or driver sequences that are recurrently mutated in the specific cancer.
In some embodiments, the plurality of genomic regions comprises genomic regions that are recurrently rearranged in the specific cancer.
In preferred embodiments, the specific cancer is a carcinoma, and in more preferred embodiments, the carcinoma is an adenocarcinoma, a non-small cell lung cancer, or a squamous cell carcinoma.
In specific embodiments, the cumulative length of the plurality of genomic regions is at most 30 Mb, 20 Mb, 10 Mb, 5 Mb, 2 Mb, 1 Mb, 500 kb, 200 kb, 100 kb, 50 kb, 20 kb, or 10 kb.
In some embodiments, the methods further comprising the steps of:
obtaining a cell-free nucleic acid sample from the subject; and
identifying the patient-specific genetic alteration in the cell-free nucleic acid sample.
In specific embodiments, the step of identifying the patient-specific genetic alteration in the cell-free nucleic acid sample comprises sequencing a genomic region comprising the patient-specific genetic alteration in the cell-free sample.
In other specific embodiments, the step of obtaining a tumor nucleic acid sample and a genomic nucleic acid sample comprises the step of enriching the plurality of target regions in the tumor nucleic acid sample and the genomic nucleic acid sample, and in more specific embodiments, the enriching step comprises use of a custom library of biotinylated DNA.
In still other specific embodiments, the step of obtaining a cell-free nucleic acid sample comprises the step of enriching the plurality of target regions in the cell-free nucleic acid sample, and in still more specific embodiments, the enriching step comprises use of a custom library of biotinylated DNA.
In some embodiments, the methods further comprise the step of quantifying the cancer-specific genetic alteration in the cell-free sample.
In yet another aspect, the invention provides methods for screening a cancer-specific genetic alteration in a subject comprising the steps of:
obtaining a cell-free nucleic acid sample from a subject;
sequencing a plurality of target regions in the cell-free sample to obtain a plurality of cell-free nucleic acid sequences; and
identifying a cancer-specific genetic alteration in the cell-free sample;
wherein the plurality of target regions are selected from a plurality of genomic regions that are recurrently mutated in the specific cancer;
the plurality of genomic regions comprises at least 10 different genomic regions; and
at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In specific embodiments, the plurality of genomic regions comprises at least 25, at least 50, at least 100, at least 150, at least 200, or at least 500 different genomic regions.
In other specific embodiments, at least two mutations within the plurality of genomic regions or at least three mutations within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
In still other specific embodiments, at least one mutation within the plurality of genomic regions is present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, or 99.9% of all subjects with the specific cancer.
In particular embodiments, each genomic region in the plurality of genomic regions is identified by ranking the genomic region to maximize the number of all subjects with the specific cancer having at least one mutation within the genomic region.
In other particular embodiments, each genomic region in the plurality of genomic regions is identified by ranking the genomic region to maximize the ratio between the number of all subjects with the specific cancer having at least one mutation within the genomic region and the length of the genomic region.
In still other particular embodiments, the plurality of genomic regions comprises genomic regions encoding a plurality of driver sequences, and, more particularly, the driver sequences are known driver sequences or are recurrently mutated in the specific cancer.
In yet still other particular embodiments, the plurality of genomic regions comprises genomic regions that are recurrently rearranged in the specific cancer.
In some embodiments, the specific cancer is a carcinoma, including, for example, an adenocarcinoma, a non-small cell lung cancer, or a squamous cell carcinoma.
In specific embodiments, the cumulative length of the plurality of genomic regions is at most 30 Mb, 20 Mb, 10 Mb, 5 Mb, 2 Mb, 1 Mb, 500 kb, 200 kb, 100 kb, 50 kb, 20 kb, or 10 kb.
In other specific embodiments, the step of obtaining a cell-free nucleic acid sample comprises the step of enriching the plurality of target regions in the cell-free nucleic acid sample, and, in some embodiments, the enriching step comprises use of a custom library of biotinylated DNA.
Tumors continually shed DNA into the circulation, where it is readily accessible. Stroun et al. (1987) Eur J Cancer Clin Oncol 23:707-712. Provided herein are methods for the ultrasensitive detection of circulating tumor DNA called CAncer Personalized Profiling by Deep Sequencing (CAPP-Seq). Also provided are methods for creating libraries of recurrently mutated genomic regions used in the CAPP-Seq methods. CAPP-Seq targets hundreds of recurrently mutated genomic regions and simultaneously detects point mutations, insertions/deletions, and rearrangements. CAPP-Seq for non-small cell lung cancer has been demonstrated herein with a design that identified mutations in >95% of tumors. CAPP-Seq accurately quantified circulating tumor DNA from early and advanced stage tumors and identified mutant alleles down to 0.025% with a detection limit of <0.01%. Tumor-derived DNA levels paralleled clinical responses to diverse therapies and CAPP-Seq identified actionable mutations in plasma. Moreover, CAPP-Seq identified significant co-occurrence of ROS1 translocations with U2AF1 splicing factor mutations. Finally, the utility of CAPP-Seq for cancer screening is also described. CAPP-Seq can be routinely applied to noninvasively detect and monitor tumors, thus facilitating personalized cancer therapy.
According to one aspect of the invention, methods for creating a library of recurrently mutated genomic regions are provided. The methods comprise the step of identifying a plurality of genomic regions from a group of genomic regions that are recurrently mutated in a specific cancer, wherein the library comprises the plurality of genomic regions, the plurality of genomic regions comprises at least 10 different genomic regions, and at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer.
It should be understood that the term “library” represents a compilation or collection of individual components. Thus, a library of recurrently mutated genomic regions is a compilation or collection of recurrently mutated genomic regions. The libraries of the instant disclosure are useful because they include a large number of potentially mutated genomic regions within a minimal length of genomic sequence. Use of these libraries to identify genetic alternations in specific patient samples is particularly advantageous because the libraries do not need to be optimized on a patient-by-patient basis.
The libraries created according to the instant methods comprise genomic regions that are recurrently mutated in a specific cancer. The identification of these recurrent mutations benefits greatly from the availability of databases such as, for example, The Cancer Genome Atlas (TCGA) and its subsets (http://cancergenome.nih.gov/). Such databases serve as the starting point for identifying the recurrently mutated genomic regions of the instant libraries. The databases also provide a sample of mutations occurring within a given percentage of subjects with a specific cancer.
The libraries created according to the instant methods comprise a plurality of genomic regions, wherein the plurality of genomic regions comprises at least 10 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 25, at least 50, at least 100, at least 150, at least 200, at least 500, or even more different genomic regions.
It should be understood that the inclusion of larger numbers of genomic regions generally increases the likelihood that a unique mutation will be identified to distinguish tumor nucleic acid in a subject from the subject's genomic nucleic acid. Including too many genomic regions in the library is not without a cost, however, since the number of genomic regions is directly related to the length of nucleic acids that must be sequenced in the analysis. At the extreme, the entire genome of a tumor sample and a genomic sample could be sequenced, and the resulting sequences could be compared to note any differences. Such a brute force approach is not possible, however, with the vanishingly small quantities of tumor nucleic acid present in a cell-free sample.
The libraries of the instant disclosure address this problem by identifying genomic regions that are recurrently mutated in a particular cancer, and then ranking those regions to maximize the likelihood that the region will include a distinguishing genetic alteration in a particular tumor. The library of recurrently mutated genomic regions, or “selectors”, can be used across an entire population for a given cancer, and does not need to be optimized for each subject.
The term “mutation”, as used herein, refers to a genetic alteration in the genome of an organism, specifically to a change in the nucleotide sequence of the organism. Examples of mutations include point mutations, where a single nucleotide is changed in the genome, and larger-scale changes in the genome, such as rearrangements, insertions, deletions, and amplifications. A recurrent mutation is a mutation that has been identified in more than one individual.
The terms “patient” and “subject” are used interchangeably. These are typically individuals that suffer from the cancer of interest. While the individuals are typically human individuals, the methods and systems of the instant disclosure could also be applied to other species, in particular, to other animal species, for example, livestock animals and pets.
The libraries of recurrently mutated genomic regions disclosed herein are created for a given type of cancer using one or more of the following design phases:
Phase 1: Identify known “driver” genes, i.e., genes that are known to be mutated frequently in the particular cancer.
Phase 2: Maximize patient coverage by selecting genomic regions that contain recurrent mutations in multiple subjects with the particular cancer and ranking those selections to maximize the number of patients identified by mutations in those regions.
Phases 3 and 4: Further ranking of genomic regions containing recurrent mutations by maximizing the “recurrence index”.
Phase 5: Add genomic regions from genes predicted to harbor “driver” mutations in the particular cancer.
Phase 6: Add genomic regions covering fusions and their flanking regions.
It should be understood, however, that the above-described phases of selector design are independent of one another and may be applied separately or in a different order within the methods of library creating and still achieve the desired result.
Application of the above approaches for recurrently mutated genomic regions in non-small cell lung cancer results in the library shown in Table 1. All genomic regions included in the selector, along with their corresponding HUGO gene symbols and genomic coordinates, as well as patient statistics for NSCLC and a variety of other cancers, are shown, organized by selector design phase. The percentage of coverage of NSCLC patients as the Table 1 library was developed is shown in
Accordingly, the libraries of recurrently mutated genomic regions created using the instant methods comprise a plurality of genomic regions that are recurrently mutated in a specific cancer, and the plurality of genomic regions comprises at least 10 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 25 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 50 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 100 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 150 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 200 different genomic regions. In some embodiments, the plurality of genomic regions comprises at least 500 different genomic regions or even more.
In some embodiments, the plurality of genomic regions comprises at most 5000 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 2000 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 1000 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 500 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 200 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 150 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 100 different genomic regions. In some embodiments, the plurality of genomic regions comprises at most 50 different genomic regions or even fewer.
Importantly, the libraries of recurrently mutated genomic regions created according to the instant methods enable the identification of patient- and tumor-specific mutations within the genomic regions in a high percentage of subjects. Specifically, in these libraries, at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer. In some embodiments, at least two mutations within the plurality of genomic regions are present in at least 60% of all subjects with the specific cancer. In specific embodiments, at least three mutations, or even more, within the plurality of genomic regions are present in at least 60% of all subjects with the specific cancer.
In some embodiments, in the libraries of recurrently mutated genomic regions created according to these methods, at least one mutation within the plurality of genomic regions is present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, 99.9% or even higher percentages of all subjects with the specific cancer.
In specific embodiments, at least two mutations within the plurality of genomic regions are present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, 99.9% or even higher percentages of all subjects with the specific cancer.
In more specific embodiments, at least three mutations, or even more, within the plurality of genomic regions are present in at least 60%, 70%, 80%, 90%, 95%, 98%, 99%, 99.9% or even higher percentages of all subjects with the specific cancer.
As previously noted, the cumulative length of genomic regions in the libraries of recurrently mutated genomic regions created according to the instant methods are relatively short, thus minimizing sequencing costs associated with the analytical methods relying on these libraries and maximizing their sensitivity. In some embodiments, the cumulative length of genomic regions is at most 30 megabases (Mb). In some embodiments, the cumulative length of genomic regions is at most 20 Mb, 10 Mb, 5 Mb, 2 Mb, or 1 Mb. In some embodiments, the cumulative length of genomic regions is at most 500 kilobases (kb), 200 kb, 100 kb, 50 kb, 20 kb, 10 kb, or even fewer.
In some embodiments, the library of recurrently mutated genomic regions created according to the instant methods comprises the genomic regions displayed in Table 1, or a subset of those genomic regions.
The instant methods include the step of identifying a plurality of genomic regions from a group of genomic regions that are recurrently mutated in a specific cancer. As noted elsewhere, the libraries are particularly useful in methods for analyzing cancer-specific gene alterations in solid tumors, because those alterations can be detected in cell-free nucleic acids present in blood samples. Accordingly, the libraries created according to these methods include genomic regions that are recurrently mutated in a solid tumor. In some embodiments, the solid tumor is a carcinoma. In specific embodiments, the carcinoma is an adenocarcinoma, a non-small cell lung cancer, or a squamous cell carcinoma. The methods are also applicable to genomic regions that are recurrently mutated in other cancers, however. Specifically, the other cancer may be, for example, a sarcoma, a leukemia, a lymphoma, or a myeloma.
The methods for creating a library of recurrently mutated genomic regions, as disclosed herein, are typically implemented by a programmed computer system. Therefore, according to another aspect, the instant disclosure provides computer systems for creating a library of recurrently mutated genomic regions. Such systems comprise at least one processor and a non-transitory computer-readable medium storing computer-executable instructions that, when executed by the at least one processor, cause the computer system to carry out the above-described methods for creating a library.
The libraries created according to the above-described methods are useful in the analysis of genetic alterations, particularly in comparing tumor and genomic sequences in a patient with cancer. As shown in
Accordingly, in this aspect of the invention, methods are provided for analyzing a cancer-specific genetic alteration in a subject comprising the steps of:
obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer;
sequencing a plurality of target regions in the tumor nucleic acid sample and in the genomic nucleic acid sample to obtain a plurality of tumor nucleic acid sequences and a plurality of genomic nucleic acid sequences; and
comparing the plurality of tumor nucleic acid sequences to the plurality of genomic nucleic acid sequences to identify a patient-specific genetic alteration in the tumor nucleic acid sample.
In these methods, the plurality of target regions are selected from a plurality of genomic regions that are recurrently mutated in the specific cancer; the plurality of genomic regions comprises at least 10 different genomic regions; and at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer. More specifically, the plurality of target regions may correspond to the plurality of genomic regions found in the libraries of recurrently mutated genomic regions created using the above-described methods. In other words, in various embodiments, the number of different genomic regions in the plurality of genomic regions, the number of mutations within the plurality of genomic regions that are present in a specific percentage of all subjects with the specific cancer, the percentage of all subjects with the specific cancer with at least one mutation within the plurality of genomic regions, the specific composition of the plurality of genomic regions, the types of cancer, and the cumulative length of the plurality of genomic regions have the values disclosed above for the methods of creating a library.
In some embodiments, the plurality of target regions used in the methods for analyzing a cancer-specific genetic alteration in a subject corresponds to the library of recurrently mutated genomic regions displayed in Table 1, or a subset of those genomic regions.
It should be understood that the step of obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer may occur in a single step or in separate steps. For example, it may be possible to obtain a single tissue sample from a patient, for example from a biopsy sample, that includes both tumor nucleic acids and genomic nucleic acids. It is also within the scope of this step to obtain the tumor nucleic acid sample and the genomic nucleic acid sample from the subject in separate samples, in separate tissues, or even at separate times.
The step of obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer may also include the process of extracting a biological fluid or tissue sample from the subject with the specific cancer. These particular steps are well understood by those of ordinary skill in the medical arts, particularly by those working in the medical laboratory arts.
The step of obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer may additionally include procedures to improve the yield or recovery of the nucleic acids in the sample. For example, the step may include laboratory procedures to separate the nucleic acids from other cellular components and contaminants that may be present in the biological fluid or tissue sample. As noted, such steps may improve the yield and/or may facilitate the sequencing reactions.
It should also be understood that the step of obtaining a tumor nucleic acid sample and a genomic nucleic acid sample from a subject with a specific cancer may be performed by a commercial laboratory that does not even have direct contact with the subject. For example, the commercial laboratory may obtain the nucleic acid samples from a hospital or other clinical facility where, for example, a biopsy or other procedure is performed to obtain tissue from a subject. The commercial laboratory may thus carry out all the steps of the instantly-disclosed methods at the request of, or under the instructions of, the facility where the subject is being treated or diagnosed.
The methods of the instant invention may also be applied to the detection of cancer in a patient, where there is no prior knowledge of the presence of a tumor in the patient. Accordingly, in this aspect of the invention are provided methods for screening a cancer-specific genetic alteration in a subject comprising the steps of:
obtaining a cell-free nucleic acid sample from a subject;
sequencing a plurality of target regions in the cell-free sample to obtain a plurality of cell-free nucleic acid sequences; and
identifying a cancer-specific genetic alteration in the cell-free sample.
In these methods, the plurality of target regions are selected from a plurality of genomic regions that are recurrently mutated in the specific cancer. In some embodiments, the plurality of genomic regions comprises at least 10 different genomic regions, and at least one mutation within the plurality of genomic regions is present in at least 60% of all subjects with the specific cancer. More specifically, the plurality of target regions may correspond to the plurality of genomic regions found in the libraries of recurrently mutated genomic regions created using the above-described methods. In other words, in various embodiments, the number of different genomic regions in the plurality of genomic regions, the number of mutations within the plurality of genomic regions that are present in a specific percentage of all subjects with the specific cancer, the percentage of all subjects with the specific cancer with at least one mutation within the plurality of genomic regions, the specific composition of the plurality of genomic regions, the types of cancer, and the cumulative length of the plurality of genomic regions have the values disclosed above for the methods of creating a library.
In some embodiments, the plurality of target regions used in the methods for screening a cancer-specific genetic alteration in a subject corresponds to the library of recurrently mutated genomic regions displayed in Table 1, or a subset of those genomic regions.
It will be readily apparent to one of ordinary skill in the relevant arts that other suitable modifications and adaptations to the methods and applications described herein may be made without departing from the scope of the invention or any embodiment thereof. Having now described the present invention in detail, the same will be more clearly understood by reference to the following Examples, which are included herewith for purposes of illustration only and are not intended to be limiting of the invention.
To overcome the limitations of prior methods, an ultrasensitive and specific strategy for analysis of cancer-derived cfDNA (CAncer Personalized Profiling by Deep Sequencing (CAPP-Seq)) that can simultaneously detect single nucleotide variants (SNVs), insertions/deletions (indels), and rearrangements, without the need for patient-specific optimization has been developed. CAPP-Seq employs an adaptable “selector” to enrich recurrently mutated regions in the cancer of interest using a custom library of biotinylated DNA oligonucleotides (Ng et al. (2010) Nat. Genetics 42:30-35). To use CAPP-Seq for monitoring circulating tumor DNA, this selector is typically applied first to matched tumor and normal genomic DNA to identify a patient's cancer-specific genetic aberrations and then directly to cfDNA in order to quantify these mutations (
The design of an NSCLC CAPP-Seq selector is shown in
For the initial implementation of CAPP-Seq we focused on NSCLC, although our approach is generalizable to any cancer for which a comprehensive list of recurrent mutations has been identified. We employed a multi-phase approach to design a NSCLC-specific selector, aiming to identify genomic regions recurrently mutated in this disease (
Approximately 8% of NSCLCs contain clinically actionable rearrangements involving the receptor tyrosine kinases, ALK, ROS1 and RET (Bergethon et al. (2012) J. Clin. Oncol. 30:863-870; Kwak et al. (2010) N. Engl. J. Med. 363:1693-1703; Pao & Hutchinson (2012) Nat. Med. 18:349-351). To utilize the personalized nature and low false detection rate of structural rearrangements (Leary et al. (2010) Sci. Transl. Med. 2:20ra14; McBride et al. (2010) Genes Chrom. Cancer 49:1062-1069), introns and exons spanning recurrent fusion breakpoints in these genes were included in the final design phase (
Collectively, the NSCLC CAPP-Seq selector design targets 521 exons and 13 introns from 139 recurrently mutated genes, in total covering ˜125 kb (
Using this CAPP-Seq selector, we profiled a total of 52 samples including NSCLC cell lines, primary tumor specimens, peripheral blood leukocytes (PBLs), and cfDNA isolated from plasma of patients with NSCLC before and after various cancer therapies (Table 2). To assess and optimize the performance of CAPP-Seq, we first applied it to cfDNA purified from healthy control plasma. Approximately 60% of reads mapped within the selector target region (Table 2). Sequenced cfDNA fragments had a median length of 169 bp (
The detection limit of CAPP-Seq is affected by the absolute number of available cfDNA molecules in a given volume of peripheral blood, as well as PCR and sequencing errors (i.e. “technical” background). The latter primarily affects substitutions/SNVs as opposed to other CAPP-Seq reporters (i.e., indels (Minoche et al. (2011) Genome Biol. 12:R112) and rearrangements). Separately, mutant cfDNA could be present in the absence of cancer due to contributions from pre-neoplastic cells from diverse tissues (i.e., “biological” background). The combined background from these sources was measured by assessing the error rate at each nucleotide position across the selector in plasma cfDNA from 6 patients and a healthy individual, excluding tumor-derived mutations. Mean and median background rates of ˜0.007% and ˜0% (not detected, N.D.) were found, respectively (
Next, the allele frequency detection limit and linearity of CAPP-Seq was benchmarked by spiking defined concentrations of fragmented genomic DNA from a NSCLC cell line into cfDNA from a healthy individual (
Having designed, optimized, and benchmarked CAPP-Seq, it was applied to the discovery of somatic mutations in tumors collected from a diverse group of NSCLC patients (n=11;
To explore the potential clinical utility of CAPP-Seq for disease monitoring and minimal residual disease detection, we next applied CAPP-Seq to serial plasma samples collected from a subset of these same 11 patients (N=6), all of whom had pre- and post-treatment samples available (
G
A
chr3
89457148
T
G
chr4
66242868
A
C
chr5
176522747
T
C
chr17
7577551
A
G
chr17
7576275
In addition to its potential clinical utility, CAPP-Seq analysis promises to yield novel biological insights. For example, in one patient's tumor (P9), we identified both a classic EML4-ALK fusion and two previously unreported fusions involving ROS1: FYN-ROS1 and ROS1-MKX (
Finally, we explored whether CAPP-Seq analysis of cfDNA could potentially be used for cancer screening. As proof-of-principle, we blinded ourselves to the mutations present in each patient's tumor and developed a statistical method to test for the presence of cancer DNA in each pre-treatment plasma sample in our cohort (
In conclusion, we have developed a flexible method for ultrasensitive and specific assessment of circulating tumor DNA. CAPP-Seq overcomes limitations of previously proposed methods for cfDNA analysis by simultaneously measuring multiple types of mutations without patient-specific optimization and by covering mutations in the majority of patients. Moreover, due to multiplexing, CAPP-Seq is highly economical, and per sample costs for plasma cfDNA are expected to drop further as NGS costs continue to fall. Our method has the potential to accelerate the personalized detection, therapy, and monitoring of cancer patients. We anticipate that CAPP-Seq will prove valuable in a variety of clinical settings, including the assessment of cancer DNA in alternative biological fluids and specimens with low cancer cell content.
Between April 2010 and June 2012, patients undergoing treatment for newly diagnosed or recurrent NSCLC were enrolled in a study approved by the Stanford University Institutional Review Board. Enrolled patients had not received blood transfusions within 3 months of blood collection. Patient characteristics are in Table 3.
Peripheral blood from consented patients was collected in EDTA Vacutainer tubes (BD). Blood samples were processed within 3 hours of collection. Plasma was separated by centrifugation at 2,500×g for 10 min, transferred to microcentrifuge tubes, and centrifuged at 16,000×g for 10 min to remove cell debris. The cell pellet from the initial spin was used for isolation of germline genomic DNA from PBLs (peripheral blood leukocytes) with the DNeasy Blood & Tissue Kit (Qiagen). Matched tumor DNA was isolated from FFPE specimens or from the cell pellet of pleural effusions. Genomic DNA was quantified by Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen).
Cell-free DNA (cfDNA) was isolated from 1-5 mL plasma with the QIAamp Circulating Nucleic Acid Kit (Qiagen). Absolute quantification of purified cfDNA was determined by quantitative PCR (qPCR) using an 81 bp amplicon on chromosome 1 (Fan et al. (2008) Proc. Natl Acad. Sci. USA 105:16266-16271) and a dilution series of intact male human genomic DNA (Promega) as a standard curve. Power SyberGreen was used for qPCR on a HT7900 Real Time PCR machine (Applied Biosystems). Standard PCR thermal cycling parameters were used.
Indexed Illumina NGS libraries were prepared from cfDNA and shorn tumor, germline, and cell line genomic DNA. For patient cfDNA, 7-32 ng DNA was used for library construction without additional shearing or fragmentation. For tumor, germline, and cell line genomic DNA, 69-1000 ng DNA was sheared prior to library construction with a Covaris S2 instrument using the recommended settings for 200 bp fragments. See Table 2 for details.
The NGS libraries were constructed using the KAPA Library Preparation Kit (Kapa Biosystems) employing a DNA Polymerase possessing strong 3′-5′ exonuclease (or proofreading) activity and displaying the lowest published error rate (i.e. highest fidelity) of all commercially available B-family DNA polymerases (Quail et al. (2012) Nat. Methods 9:10-11; Oyola et al. (2012) BMC Genomics 13:1). The manufacturer's protocol was modified to incorporate with-bead enzymatic and cleanup steps (Fisher et al. (2011) Genome Biol. 12:R1). Briefly, following the end repair reaction, Agencourt AMPure XP beads (Beckman-Coulter) were added to bind and wash the DNA fragments. The DNA was then eluted directly into 50 μL 1× A-tailing buffer containing the A-tailing enzyme. Following the A-tailing reaction, the DNA fragments were forced to bind to the same AMPure XP beads by adding 90 μL (1.8×) of PEG buffer (20% PEG-8000 in 2.5M NaCl). After washing, the DNA was eluted into 50 μL 1× ligation buffer with ligase and 100-fold molar excess of indexed Illumina TruSeq adapters. Ligation was performed for 16 hours at 16° C. Single-step size selection was performed by adding 40 μL (0.8×) of PEG buffer to enrich for ligated DNA fragments. The ligated fragments were then amplified using 500 nM Illumina backbone oligonucleotides and a variable number of PCR cycles (between 4 and 9) depending on input DNA mass. In order to minimize bias and maximize recovery of GC-rich templates, all PCR reactions were carried out in a BioRad DNA Engine Thermal Cycler with a ramp rate of 2.2° C./sec or an Eppendorf Vapo Protect Mastercycler with the Safe ramp rate setting.
Library purity and concentration was assessed by spectrophotometer (NanoDrop 2000) and qPCR (KAPA Biosystems), respectively. Fragment length was determined on a 2100 Bioanalyzer using the DNA 1000 Kit (Agilent).
Custom hybrid selection was performed with the SeqCap EZ Choice Library, v2.0 (Roche NimbleGen). The custom SeqCap library was designed through the NimbleDesign portal (v1.2.R1) using genome build HG19 NCBI Build 37.1/GRCh37 and with Maximum Close Matches set to 1. Input genomic regions were selected according to the most frequently mutated genes and exons in NSCLC. These regions were identified from the COSMIC database, TCGA, and other published sources as described in the Detailed Materials. Final selector coordinates are provided in Table 1.
NimbleGen SeqCap EZ Choice was used according to the manufacturer's protocol with modifications. Between 9 and 12 indexed Illumina libraries were included in a single capture reaction. Prior to hybrid selection, the libraries were quantified with a NanoDrop 2000 spectrophotometer, and 83-111 ng of each library was added (1 μg total DNA per capture reaction). Following hybrid selection, the captured DNA fragments were amplified with 12-to-14 cycles of PCR using 1× KAPA HiFi Hot Start Ready Mix and 2 μM Illumina backbone oligonucleotides in 4-to-6 separate 50 μL reactions. The reactions were then pooled and processed with the QIAquick PCR Purification Kit (Qiagen). Multiplexed libraries were sequenced using 2×100 bp pared-end runs on an Illumina HiSeq 2000.
Paired-end reads were mapped to the hg19 reference genome with BWA 0.6.2 (default parameters) (Li & Durbin (2009) Bioinformatics 25:1754-1760), and sorted/indexed with SAMtools (Li et al. (2009) Bioinformatics 25:2078-2079). QC was assessed using a custom Perl script to collect a variety of statistics, including mapping characteristics, read quality, and selector on-target rate (i.e., number of unique reads that intersect the selector space divided by all aligned reads), generated respectively by SAMtools flagstat, FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and BEDTools coverageBed (Quinlan & Hall (2010) Bioinformatics 26:841-842). Importantly, we used a custom version of coverageBed modified to count each read at most once. Plots of fragment length distribution and sequence depth/coverage were automatically generated for visual QC assessment. To mitigate the impact of sequencing errors, analyses not involving fusions were restricted to properly paired reads, and high-quality bases with a Phred quality score of at least 30 (≦0.1% probability of a sequencing error) were further analyzed.
Two dilution series were performed to assess the linearity and accuracy of CAPP-Seq for quantitating tumor-derived cfDNA. In one experiment, shorn genomic DNA from a NSCLC cell line (HCC78) was spiked into cfDNA from a healthy individual, while in a second experiment, shorn genomic DNA from one NSCLC cell line (NCI-H3122) was spiked into shorn genomic DNA from a second NSCLC line (HCC78). A total of 32 ng DNA was used for library construction. Following mapping and quality control, homozygous reporters were identified as alleles unique to each sample with at least 20× sequencing depth at an allelic fraction >80%. Fourteen such reporters were identified between HCC78 genomic DNA and plasma cfDNA (
Details of bioinformatics methods are supplied in the Detailed Methods, and a graphical schematic is provided in
The NSCLC selector was validated in silico using an independent cohort of lung adenocarcinomas (Imielinski et al. (2012) Cell 150:1107-1120) (
We used Monte Carlo sampling to estimate the distribution of background alleles across the NSCLC selector (
To assess the significance of tumor burden estimates in plasma cfDNA, we compared patient-specific SNV frequencies against the null distribution of background SNVs across the selector. Briefly, patient-specific background was quantified using the method described for
Additional details on cell lines, tumor cell sorting, optimizations of library preparation, mutation/translocation validation, CAPP-Seq design and analytical pipelines including FACTERA translocation detection tool, and additional statistical methods are presented in the Detailed Methods.
The lung adenocarcinoma cell lines NCI-H3122 and HCC78 were obtained from ATCC and DSMZ, respectively, and grown in RPMI 1640 with L-glutamine (Gibco) supplemented with 10% fetal bovine serum (Gembio) and 1% penicillin/streptomycin cocktail. Cells were maintained in mid-log-phase growth in a 37° C. incubator with 5% CO2. Genomic DNA was purified from freshly harvested cells with the DNeasy Blood & Tissue Kit (Qiagen).
Cells from pleural fluid from patients P9 and P6 were harvested by centrifugation at 300×g for 5 min at 4° C. and washed in FACS staining buffer (HBSS+2% heat-inactivated calf serum [HICS]). Red blood cells were lysed with ACK Lysing Buffer (Invitrogen), and clumps were removed by passing through a 100 μm nylon filter. Filtered cells were spun down and resuspended in staining buffer. While on ice, the cell suspension was blocked for 20 min with 10 μg/mL rat IgG and then stained for 20 min with APC-conjugated mouse anti-human EpCAM (BioLegend, clone 9C4), PerCP-Cy5.5-conjugated mouse anti-human CD45 (eBioscience, clone 2D1), and PerCP-eFluor710-conjugated mouse anti-human CD31 (eBioscience, clone WM59). After staining, cells were washed and resuspended with staining buffer containing 1 μg/mL DAPI, analyzed, and sorted with a FACSAria II cell sorter (BD Biosciences). Cell doublets and DAPI-positive cells were excluded from analysis and sorting. CD31−CD45−EpCAM+ cells were sorted into staining buffer, spun down, and flash frozen in liquid nitrogen. DNA was isolated with the QIAamp DNA Micro Kit (Qiagen).
A3. Optimization of NGS Library Preparation from Low Input cfDNA
Any method for detecting mutant cfDNA relies on its ability to interrogate each cfDNA molecule in the circulation in order to maximize sensitivity. For this reason, we used the QIAamp Circulating Nucleic Acid kit (Qiagen) with carrier RNA as per the manufacturer's protocol to isolate cfDNA. We also took specific steps to improve the Illumina library preparation workflow.
Protocols for Illumina library construction were compared in a step-wise manner with the goal of (1) optimizing adapter ligation efficiency, (2) reducing the necessary number of PCR cycles following adapter ligation, (3) preserving the naturally occurring size distribution of cfDNA fragments, and (4) minimizing variability in depth of sequencing coverage across all captured genomic regions. Initial optimization was done with NEBNext DNA Library Prep Reagent Set for Illumina (New England BioLabs), which includes reagents for end-repair of the cfDNA fragments, A-tailing, adapter ligation, and amplification of ligated fragments with Phusion High-Fidelity PCR Master Mix. Input was 4 ng cfDNA (obtained from plasma of the same healthy volunteer) for all conditions. Relative allelic abundance in the constructed libraries was assessed by qPCR of 4 genomic loci (Roche NimbleGen: NSC-0237, NSC-0247, NSC-0268, and NSC-0272) and compared by the 2−ΔCt method.
Ligations were performed at 20° C. for 15 min (as per the manufacturer's protocol), at 16° C. for 16 hours, or with temperature cycling for 16 hours as previously described (Lund et al. (1996) Nucl. Acids Res. 24:800-801). Ligation volumes were varied from the standard (50 μL) down to 10 μL while maintaining a constant concentration of DNA ligase, cfDNA fragments, and Illumina adapters. Subsequent optimizations incorporated ligation at 16° C. for 16 hours in 50 μL reaction volumes.
Next, we compared standard SPRI bead processing procedures, in which new AMPure XP beads are added after each enzymatic reaction and DNA is eluted from the beads for the next reaction, to with-bead protocol modifications as previously described (Fisher, S. et al. (2011) Genome Biol. 12:R1). We compared 2 concentrations of Illumina adapters in the ligation reaction: 12 nM (10-fold molar excess to cfDNA fragments) and 120 nM (100-fold molar excess).
Using the optimized library preparation procedures, we next compared the NEBNext DNA Library Prep Reagent Set (with Phusion DNA Polymerase) to the KAPA Library Preparation Kit (with KAPA HiFi DNA Polymerase). The KAPA Library Preparation Kit with our modifications was also compared to the NuGEN SP Ovation Ultralow Library System with automation on Mondrian SP Workstation.
We performed CAPP-Seq on 32 ng cfDNA using standard library preparation procedures with the NEBNext kit, or with optimized procedures using either the NEBNext kit or the KAPA Library Preparation Kit. In parallel we performed CAPP-Seq on 4 ng and 128 ng cfDNA using the KAPA kit with our optimized procedures. Indexed libraries were constructed, and hybrid selection was performed in multiplex. The post-capture multiplexed libraries were amplified with Illumina backbone primers for 14 cycles of PCR and then sequenced on a paired-end 100 bp lane of an Illumina HiSeq 2000.
We also evaluated CAPP-Seq on ultralow input following whole genome amplification (WGA). For WGA we chose not to use multiple displacement amplification with Φ29 DNA polymerase due given the small size of cfDNA fragments in plasma (
All structural rearrangements and a subset of tumoral SNVs detected by CAPP-Seq were independently confirmed by qPCR and/or Sanger sequencing of amplified fragments. For HCC78, a 120 bp fragment containing the SLC34A2-ROS1 breakpoint was amplified from genomic DNA using the primers: 5′-AGACGGGAGAAAATAGCACC-3′ and 5′-ACCAAGGGTTGCAGAAATCC-3′. A 141 bp fragment containing exon 2 of U2AF1 was amplified using the primers: 5′-CATGTGTTTGATATCTTCCCAGC-3′ and 5′-CTGGCTAAACGTCGGTTTATTG-3′. For NCI-H3122, a 143 bp fragment containing the EML4-ALK breakpoint was amplified using the primers: 5′-GAGATGGAGTTTCACTCTTGTTGC-3′ and 5′-GAACCTTTCCATCATACTTAGAAATAC-3′. 5 ng genomic DNA was used as template with 250 nM oligos and 1× Phusion PCR Master Mix (NEB) in 50 μL reactions. Products were resolved on 2.5% agarose gel and bands of the expected size were removed. The amplified DNA fragments were purified using the Qiaquick Gel Extraction Kit (Qiagen) and submitted for Sanger sequencing (Elim Biopharm). For P9, genomic DNA breakpoints were confirmed by qPCR using the primers: 5′-TCCATGGAAGCCAGAAC-3′ and 5′-ATGCTAAGATGTGTCTGTCA-3′ for EML4-ALK; 5′-CCTTAACACAGATGGCTCTTGATGC-3′ and 5′-TCCTCTTTCCACCTTGGCTTTCC-3′ for ROS1-MKX; and 5′-GGTTCAGAACTACCAATAACAAG-3′ and 5′-ACCTGATGTGTGACCTGATTGATG-3′ for FYN-ROS1. For qPCR, 10 ng of pre-amplified genomic DNA was used as template with 250 nM oligos and 1× Power SyberGreen Master Mix in 10 μL reactions performed in triplicate on a HT7900 Real Time PCR machine (Applied Biosystems). Standard PCR thermal cycling parameters were used. Amplification of amplicons spanning all 3 breakpoints detected in P9 were confirmed in tumor genomic DNA as well as plasma cfDNA, and PBL genomic DNA was used as a negative control. Separately, at least 88% of SNVs and indels detected were bona fide somatic mutations in tumors, as 38 of 46 of them were independently observed above 0.025% allele frequency in plasma cfDNA and/or were independently confirmed by SNaPshot clinical assays.
The CAPP-Seq background rate was estimated by Monte Carlo sampling of allelic frequencies across the NSCLC selector (
We likewise applied Monte Carlo simulation to estimate the probability of finding a background allele in plasma cfDNA at a given fractional abundance (
We included only cases in which the status of both ROS1 fusion status and U2AF1 S34 mutation was known. There were 163 such cases from TCGA (genotyped for U2AF1 by whole exome sequencing and for ROS1 fusions by RNA-Seq as detailed below), 23 cases from Imielinski et al. (2012) Cell 150:1107-1120, 17 cases from Govindan et al. (2012) Cell 150:1121-1134, and 13 cases from the present study (11 patients and 2 NSCLC cell lines). U2AF1 S34F mutations were detected in 11 cases (5 from TCGA, 3 from Imielinski et al., 1 from Govindan et al., and 2 from the present study), and ROS1 fusions were detected in 6 cases (2 from TCGA, described below, and 4 from the present study). Significance testing was performed using the Fisher's exact test, and a two-tailed P-value is reported.
B2.2. Analysis of Whole Transcriptome Sequencing Data from TCGA for ROS1 Fusions
We identified two TCGA lung adenocarcinoma patients, TCGA-05-4426 and TCGA-64-1680, harboring candidate ROS1 fusions (
Most human cancers are relatively heterogeneous for somatic mutations in individual genes. Specifically, in most human tumors, recurrent somatic alterations of single genes account for a minority of patients, and only a minority of tumor types can be defined using a small number of recurrent mutations (<5-10) at predefined positions. Therefore, the design of the selector is vital to the CAPP-Seq method because (1) it dictates which mutations can be detected in with high probability for a patient with a given cancer, and (2) the selector size (in kb) directly impacts the cost and depth of sequence coverage. For example, the hybrid selection libraries available in current whole exome capture kits range from 51-71 Mb, providing ˜40-60 fold maximum theoretical enrichment versus whole genome sequencing. The degree of potential enrichment is inversely proportional to the selector size such that for a ˜100 kb selector, >10,000 fold enrichment should be achievable.
We employed a six-phase design strategy to identify and prioritize genomic regions for the CAPP-Seq NSCLC selector as detailed below. Three phases were used to incorporate known and suspected NSCLC driver genes, as well as genomic regions known to participate in clinically actionable fusions (phases 1, 5, 6), while another three phases employed an algorithmic approach to maximize both the number of patients covered and SNVs per patient (phases 2-4). The latter relied upon a metric that we termed “Recurrence Index” (RI), defined as the number of NSCLC patients with SNVs that occur within a given kilobase of exonic sequence (i.e., No. of patients with mutations/exon length in kb). RI thus serves to measure patient-level recurrence frequency at the exon level, while simultaneously normalizing for gene/exon size. As a source of somatic mutation data uniformly genotyped across a large cohort of patients, in phases 2-4, we analyzed non-silent SNVs identified in TCGA whole exome sequencing data from 178 patients in the Lung Squamous Cell Carcinoma dataset (SCC) (Hammerman et al. (2012) Nature 489:519-525) and from 229 patients in the Lung Adenocarcinoma (LUAD) datasets (TCGA query date was Mar. 13, 2012). Thresholds for each metric (i.e. RI and patients per exon) were selected to statistically enrich for known/suspected drivers in SCC and LUAD data (
The following algorithm was used to design the CAPP-Seq selector (parenthetical descriptions match design phases noted in
Phase 1 (Known Drivers)
Initial seed genes were chosen based on their frequency of mutation in NSCLCs.
Analysis of COSMIC (v57) (Forbes et al. (2010) Nucl. Acids Res. 38:D652-657) identified known driver genes that are recurrently mutated in ≧9% of NSCLC (denominator ≧500 cases). Specific exons from these genes were selected based on the pattern of SNVs previously identified in NSCLC. The seed list also included single exons from genes with recurrent mutations that occurred at low frequency but had strong evidence for being driver mutations, such as BRAF exon 15, which harbors V600E mutations in <2% of NSCLC (Ding et al. (2008) Nature 455:1069-1075; Youn & Simon (2011) Bioinformatics 27:175-181; Okuda et al. (2008) Cancer Sci. 99:2280-2285; Su et al. (2011) J. Mol. Diagn. 13:74-84; Tsao et al. (2007) J. Clin. Oncol. 25:5240-5247; Chaft et al. (2012) Mol. Cancer Ther. 11:485-491; Paik et al. (2011) J. Clin. Oncol. 29:2046-2051; Stephens et al. (2004) Nature 431:525-526; Jin et al. (2010) Lung Cancer 69:279-283; Malanga et al. (2008) Cell Cycle 7:665-669).
Phase 2 (Max. Coverage)
For each exon with SNVs covering ≧5 patients in LUAD and SCC, we selected the exon with highest RI that identified at least 1 new patient when compared to the prior phase. Among exons with equally high RI, we added the exon with minimum overlap among patients already captured by the selector. This was repeated until no further exons met these criteria.
Phase 3 (RI≧30)
For each remaining exon with an RI≧30 and with SNVs covering ≧3 patients in LUAD and SCC, we identified the exon that would result in the largest reduction in patients with only 1 SNV. To break ties among equally best exons, the exon with highest RI was chosen. This was repeated until no additional exons satisfied these criteria.
Phase 4 (RI≧20)
Same procedure as phase 3, but using RI≧20.
Phase 5 (Predicted Drivers)
We included all exons from additional genes previously predicted to harbor driver mutations in NSCLC (Ding et al. (2008) Nature 455:1069-1075; Youn & Simon (2011) Bioinformatics 27:175-181).
Phase 6 (Add Fusions)
For recurrent rearrangements in NSCLC involving the receptor tyrosine kinases ALK, ROS1, and RET, the introns most frequently implicated in the fusion event and the flanking exons were included.
All exons included in the selector, along with their corresponding HUGO gene symbols and genomic coordinates, as well as patient statistics for NSCLC and a variety of other cancers, are provided in Table 1, organized by selector design phase.
For detection of somatic SNV and insertion/deletion events, we employed VarScan 2 (Koboldt et al. (2012) Genome Res 22:568-576) (somatic p-value=0.01, minimum variant frequency=5%, and otherwise default parameters). Somatic variant calls (SNV or indel) present at less than 0.5% mutant allelic frequency in the paired normal sample (PBLs), but in a position with at least 1000× overall depth in PBLs and 100× depth in the tumor, and with at least 1× read depth on each strand, were retained (Table 3). While the selector was designed to predominantly capture exons, in practice, it also captures limited sequence content flanking each targeted region. For instance, this phenomenon is the basis for the (thus far) uniformly successful recovery by CAPP-Seq of fusion partners (which are not included within the selector) for kinase genes such as ALK and ROS1 recurrently rearranged in NSCLC. As such, we also considered variant calls detected within 500 bps of defined selector coordinates. These calls were eliminated if present in non-coding repeat regions, since repeats may confound mapping accuracy. Repeat sequence coordinates were obtained using the RepeatMasker track in the UCSC table browser (hg19). Variant annotation was performed using the SeattleSeq Annotation 137 web server (http://snp.gs.washington.edu/SeattleSeqAnnotation137/). Complete details for all identified SNVs and indels are provided in Table 2.
By manual inspection, two patients (P2 and P6) had SNVs with frequencies consistent with potential heterozygous and homozygous alleles. We labeled these alleles accordingly (Table 3), and based on our assumption of zygosity in these two patients, we adjusted measured fractions of heterozygous reporters in plasma cfDNA to better estimate tumor burden (Table 4).
For practical and robust de novo enumeration of genomic fusion events and breakpoints from paired-end next-generation sequencing data, we developed a novel heuristic approach, termed FACTERA (FACile Translocation Enumeration and Recovery Algorithm). FACTERA has minimal external dependencies, works directly on a preexisting .bam alignment file, and produces easily interpretable output. Major steps of the algorithm are summarized below, and are complemented by a graphical schematic to illustrate key elements of the breakpoint identification process (
As input, FACTERA requires a .bam alignment file of paired-end reads produced by BWA (Li & Durbin (2009) Bioinformatics 25:1754-1760), exon coordinates in .bed format (e.g., hg19 RefSeq coordinates), and a 0.2 bit reference genome to enable fast sequence retrieval (e.g., hg19). In addition, the analysis can be optionally restricted to reads that overlap particular genomic regions (.bed file), such as the CAPP-Seq selector used in this work.
FACTERA processes the input in three sequential phases: identification of discordant reads, detection of breakpoints at base pair-resolution, and in silico validation of candidate fusions. Each phase is described in detail below.
To iteratively reduce the sequence space for gene fusion identification, FACTERA, like other algorithms (e.g. BreakDancer (Chen et al. (2009) Nat. Methods 6:677-681)), identifies and classifies discordant read pairs. Such reads indicate a nearby fusion event since they either map to different chromosomes or are separated by an unexpectedly large insert size (i.e. total fragment length), as determined by the BWA mapping algorithm. The bitwise flag accompanying each aligned read encodes a variety of mapping characteristics (e.g., improperly paired, unmapped, wrong orientation, etc.) and is leveraged to rapidly filter the input for discordant pairs. The closest exon of each discordant read is subsequently identified, and used to cluster discordant pairs into distinct gene-gene groups, yielding a list of genomic regions R adjacent to candidate fusion sites. For each member gene of a discordant gene pair, the genomic region Ri is defined by taking the minimum of all 3′ exon/read coordinates in the cluster, and the maximum of all 5′ exon/read coordinates in the cluster. These regions are used to prioritize the search for breakpoints in the next phase (
Discordant read pairs may be introduced by NGS library preparation and/or sequencing artifacts (e.g., jumping PCR). However, they are also likely to flank the breakpoints of bona fide fusion events. As such, all discordant gene pairs identified in the preceding of one read matches the soft-clipped region of the other, FACTERA records a putative fusion event. To assess inter-read concordance (e.g. see reads 1 and 2 in
In some instances, genomic subsequences flanking the true breakpoint may be nearly or completely identical, causing the aligned portions of soft-clipped reads to overlap. Unfortunately, this prevents an unambiguous determination of the breakpoint. As such, FACTERA incorporates a simple algorithm to arbitrarily adjust the breakpoint in one read (i.e., read 2) to match the other (i.e., read 1). Depending upon read orientation, there are two ways this can occur, both of which are illustrated in
To confirm each candidate breakpoint in silico, FACTERA performs a local realignment of reads against a template fusion sequence (±500 bp around the putative breakpoint) extracted from the 0.2 bit reference genome. BLAST is currently employed for this purpose, although BLAT or other fast aligners could be substituted. A BLAST database is constructed by collecting all reads that map to each candidate fusion sequence, including discordant reads and soft-clipped reads, as well as all unmapped reads in the original input .bam file. All reads that map to a given fusion candidate with at least 95% identity and a minimum length of 90% of the input read length (by default) are retained, and reads that span or flank the breakpoint are counted. As a final step, output redundancies are minimized by removing fusion sequences within a 20 bp interval of any fusion sequence with greater read support and with the same sequence orientation (to avoid removing reciprocal fusions).
FACTERA produces a simple output text file, which includes for each fusion sequence, the gene pair, the chromosomal sequence coordinates of the breakpoint, the fusion orientation (e.g., forward-forward or forward-reverse), the genomic sequences within 50 bp of the breakpoint, and depth statistics for reads spanning and flanking the breakpoint. Fusions identified in patients analyzed in this work are provided in Table 3.
To experimentally evaluate the performance of FACTERA, we generated NGS data from two NSCLC cell lines, HCC78 (21.5M×100 bp paired-end reads) and NCI-H3122 (19.4M×100 bp paired-end reads), each of which has a known rearrangement (ROS1 and ALK, respectively) (Bergethon et al. (2012) J. Clin. Oncol. 30:863-870; McDermott et al. (2008) Cancer Res. 68:3389-3395) with a breakpoint that has, to the best of our knowledge, not been previously published. FACTERA readily revealed evidence for a reciprocal SLC34A2-ROS1 translocation in the former and an EML4-ALK fusion in the latter. Precise breakpoints predicted by FACTERA were experimentally validated by PCR amplification and Sanger sequencing (
We implemented a user-directed option to “hunt” for fusions within expected candidate genes. A fusion could be missed by FACTERA if the fusion detection criteria employed by FACTERA are incompletely satisfied—such as if discordant reads, but not soft-clipped reads, are identified—and will most likely occur when fusion allele frequency in the tumor is extremely low. As input, the method is supplied with candidate fusion gene sequences as “baits”. All unmapped and soft-clipped reads in the input .bam file are subsequently aligned to these templates (using blastn) to identify reads that have sufficient similarity to both (for each read, 95% identity, e-value <1.0e-5, and at least 30% of the read length must map to the template, by default). Such reads are output as a list to the user for manual analysis.
We tested this simple approach on a low purity tumor sample found to harbor an ALK fusion by FISH, but not FACTERA (i.e., case P9). Using templates for ALK and its common fusion partner, ELM4, we identified 4 reads that mapped to both, in a region with an overall depth of ˜1900×. The estimated allele frequency of 0.21% is strikingly similar to the 0.22% tumor purity measured by FACS (
Using a custom Perl script, previously identified reporter alleles were intersected with a SAMtools mpileup file generated for each plasma cfDNA sample, and the number and frequency of supporting reads was calculated for each reporter allele. Only reporters in properly paired reads at positions with at least 500× overall depth were considered.
For enumeration of fusion frequency in sequenced plasma DNA, FACTERA executes the last step of the discovery phase (i.e., in silico validation of candidate fusions, above) using the set of previously identified fusion templates. The fusion allele frequency is calculated as α/β, where α is the number of breakpoint-spanning reads, and β is the mean overall depth within a genomic region ±5 bps around the breakpoint. Regarding the NSCLC selector described in this work, the latter calculation was always performed on the single gene contained in the NSCLC selector library. If both fusion genes are targeted within a selector library, overall depth is estimated by taking the mean depth calculated for both genes.
Notably, in some cases we observed lower fusion allele frequencies than would be expected for heterozygous alleles (e.g., see cell line fusions in Table 3). This was seen in cell lines, in an empirical spiking experiment, and in one patient's tumor and plasma samples (i.e., P6), and could potentially result from inefficient “pull-down” of fusions whose partners are not represented in the selector. Regardless, fusions are useful reporters—they possess virtually no background signal and show linear behavior over defined concentrations in a spiking experiment (
C5. Screening Plasma cfDNA without Knowledge of Tumor DNA
We devised the following statistical algorithm as a first step toward non-invasive cancer screening with plasma cfDNA. The method identifies candidate SNVs using iterative models of (i) background noise in paired germline DNA (in this work, PBLs), (ii) base-pair resolution background frequencies in plasma cfDNA across the selector, and (iii) sequencing error in cfDNA. Anecdotal examples are provided in
As input, the algorithm takes allele frequencies from a single plasma cfDNA sample and analyzes high quality background alleles, defined in a first step for each genomic position as the non-dominant base with highest fractional abundance. Only alleles with depth of at least 500× and strand bias <90% (conservative, by default) are analyzed. For consistency with variant calling, we allowed the screening approach to interrogate selector regions within 500 bp of defined coordinates, expanding the effective sequence space from ˜125 kb to ˜600 kb.
Second, the binomial distribution is used to test whether a given input cfDNA allele is significantly different from the corresponding paired germline allele (
Third, a database of cfDNA background allele frequencies is assembled. Here, we used samples analyzed in the present study (i.e., pre-treatment NSCLC samples and 1 sample from a healthy volunteer), except the input sample is left out to avoid bias. Based on the assumption that all background allele fractions follow a normal distribution, a Z-test is employed to test whether a given input allele differs significantly from typical cfDNA background at the same position (
Finally, candidate alleles are tested for remaining possible sequencing errors. This step leverages the observation that non-tumor variants (i.e., “errors”) in plasma cfDNA tend to have a higher duplication rate than bona fide variants detectable in the patient's tumor (data not shown). As such, the number of supporting reads is compared for each input allele between nondeduped (all fragments meeting QC criteria; see Methods) and deduped data (only unique fragments meeting QC criteria). An outlier analysis is then used to distinguish candidate tumor-derived SNVs from remaining background noise (
Importantly, this approach positively identified 60% of the cancer samples with tumor-derived SNVs analyzed in this study with no false positive calls (
All patents, patent publications, and other published references mentioned herein are hereby incorporated by reference in their entireties as if each had been individually and specifically incorporated by reference herein.
While specific examples have been provided, the above description is illustrative and not restrictive. Any one or more of the features of the previously described embodiments can be combined in any manner with one or more features of any other embodiments in the present invention. Furthermore, many variations of the invention will become apparent to those skilled in the art upon review of the specification. The scope of the invention should, therefore, be determined by reference to the appended claims, along with their full scope of equivalents.
This invention was made with government support under grant number W81XWH-12-1-0285 awarded by the Department of Defense. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61798925 | Mar 2013 | US |