This invention is related to the area of nucleic acid analysis. In particular, it relates to nucleic acid sequence analyses which have increased sensitivity and accuracy.
Extensive knowledge of the genetic alterations that underlie cancer is now available, opening new opportunities for the management of patients (1-3). Some of the most important of these opportunities involve “liquid biopsies,” i.e., the evaluation of blood and other bodily fluids for mutant DNA template molecules that are released from tumor cells into such fluids. Although the potential value of liquid biopsies was recognized more than two decades ago (4-6), more recent advances in sequencing technology have made this approach practical. For example, it has recently been shown that liquid biopsies of blood can detect minimal amounts of disease in patients with early stage colorectal cancers, thereby providing evidence that could substantially affect their survival (7). Other studies have shown that circulating tumor DNA (ctDNA) can be detected in the blood of patients with other malignancies, as well as in other bodily fluids such as pancreatic cysts, Pap smears, and saliva (8-16).
The vast majority of current technologies for detecting rare mutations employ digital approaches, where each template molecule is assessed, one by one, to determine whether it is wild type or mutant (17). The digitalization can be performed in wells (17), in tiny droplets formed by emulsification or microfluidics (18, 19), or in clusters (20). The most powerful of these approaches employs massively parallel sequencing to simultaneously analyze the entire sequences of hundreds of millions of individually amplified template molecules (21). However, all the currently available sequencing instruments have relatively high error rates, limiting sensitivity at many nucleotide positions to one mutant among 100 wild type (WT) template molecules, even with DNA templates that are of optimal quality (21). The DNA quality of clinical samples is often far less than optimal, compounding the problem. Sensitivity can be increased by pre-treating the DNA to remove damaged bases prior to sequencing (22, 23) and by bioinformatics and statistical methods to enhance base-calls after sequencing (24, 25). Although useful for a variety of purposes, the sensitivity obtainable with these improvements is generally not sufficiently high for the most challenging applications, such as liquid biopsies, which can require detection of one mutant molecule among thousands of WT molecules (9).
Another important way to improve sensitivity is with the use of “molecular barcodes,” in which each template is covalently linked to unique identifying sequences (UIDs). Molecular barcodes were originally used to count individual template molecules (26), but were subsequently incorporated into a powerful approach, termed SafeSeqS, for error reduction (27). After incorporation of the UIDs, subsequent amplification steps produce multiple copies of each UID-linked template. Each of the daughter molecules produced by amplification contains the same UID, forming a UID family. To be considered a bona fide mutation, termed a supermutant, every member of the UID family must have the identical sequence at each queried position (27).
There are two general ways to assign molecular barcodes to template DNA molecules. One is used to PCR-amplify specific loci using a set of locus-specific primers, and the other is used to ligate adapters prior to amplification of the entire genome, creating a library. The PCR method uses primers containing a stretch of random (N) bases to distinguish each individual template molecule (exogenous barcodes) (27, 28). The advantage of this approach is that it is applicable to very small amounts of DNA and virtually the only sequences amplified are the desired ones, reducing the amount of sequencing needed to evaluate a specific mutation. The disadvantage is that errors introduced into one strand during the UID-incorporation cycles will create supermutants.
This method will still therefore eliminate errors during sequencing, but not errors made during the initial cycles of PCR. The ligation method either employs random sequences in the adapters used for ligation (27-29) or uses the ends of the randomly sheared template DNA to which the adapters are ligated as “endogenous UIDs” (27, 30). Although errors are still introduced during the PCR steps with the ligation approach, its advantage is that both strands can be identified from the sequencing data (duplex sequencing). The probability that the identical, complementary mutation is introduced into both strands is low (the square of the probability of the mutation appearing in only one strand). The disadvantage of this approach is that it requires library preparation and capture of the sequences to be queried, neither of which are highly efficient.
There is a continuing need in the art to sensitively and specifically assay for sequence variations in an efficient manner.
According to one aspect of the invention a method is provided for detection of rare mutations in a population of DNA molecules. A population of DNA molecules is treated with bisulfite to convert Cytosine bases in the DNA molecules to Uracil bases, forming a population of converted DNA molecules. Molecular barcodes are attached to both strands of the population of converted DNA molecules using an excess of target-specific amplification primers attached to molecular barcodes, forming a population of amplified, barcoded, converted DNA molecules. The amplified, barcoded, converted DNA molecules are amplified in an amplification reaction to form families of amplified, barcoded, converted DNA molecules, wherein amplified, barcoded, converted DNA molecules that share the same molecular barcode form a family of DNA molecules. A plurality of members of the families are subjected to sequencing reactions to obtain nucleotide sequences of both strands of said plurality of members of the families. Nucleotide sequences of a plurality of members of a family are compared and families in which >90% of the members contain a selected mutation are identified. Nucleotide sequences of two complementary strands of an amplified, barcoded, converted DNA molecule are compared and the selected mutation is identified in two complementary strands.
According to another aspect of the invention a method is provided for detecting methylation at a CpG dinucleotide in plus and minus strands simultaneously. A population of DNA molecules is treated with bisulfite to convert Cytosine bases in the DNA molecules to Uracil bases, forming a population of converted DNA molecules. Molecular barcodes are attached to both strands of the population of converted DNA molecules using an excess of target-specific amplification primers attached to molecular barcodes, forming a population of amplified, barcoded, converted DNA molecules. The amplified, barcoded, converted DNA molecules are amplified in an amplification reaction to form families of amplified, barcoded, converted DNA molecules, wherein amplified, barcoded, converted DNA molecules that share the same molecular barcode form a family of DNA molecules. A plurality of members of the families is subjected to sequencing reactions to obtain nucleotide sequences of both strands of said plurality of members of the families. Nucleotide sequences of a plurality of members of a family are compared and families in which >90% of the members contain a selected methylated C at a CpG dinucleotide are identified. Nucleotide sequences of two complementary strands of an amplified, barcoded, converted DNA molecule are compared and a methylated C at the CpG dinucleotide is identified in two complementary strands.
In another aspect of the invention an amplification primer is provided that comprises a sequence selected from the group consisting of: SEQ ID NO: 1-32.
An additional aspect of the invention provides a kit comprising one or more sets of four amplification primers. Each of the primers in one set is complementary to one of four ends of a duplex fragment of bisulfite-converted DNA.
Another aspect of the invention is a method for detection of a polymorphism in a population of DNA molecules. A population of DNA molecules is treated with bisulfite to convert Cytosine bases in the DNA molecules to Uracil bases, forming a population of converted DNA molecules. Molecular barcodes are attached to both strands of the population of converted DNA molecules using an excess of target-specific amplification primers attached to molecular barcodes, forming a population of amplified, barcoded, converted DNA molecules. The amplified, barcoded, converted DNA molecules are amplified in an amplification reaction to form families of amplified, barcoded, converted DNA molecules, wherein amplified, barcoded, converted DNA molecules that share the same molecular barcode form a family of DNA molecules. A plurality of members of the families are subjected to sequencing reactions to obtain nucleotide sequences of both strands of said plurality of members of the families. Nucleotide sequences of a plurality of members of a family are compared and families in which >90% of the members contain a selected polymorphism are identified. Nucleotide sequences of two complementary strands of an amplified, barcoded, converted DNA molecule are compared and the selected polymorphism is identified in two complementary strands.
These and other aspects of the invention, which will be apparent to those of skill in the art upon reading the specification, provide techniques and tools for sensitively and specifically analyzing DNA variations and modifications.
The inventors have developed an approach that incorporates advantages of both the PCR- and ligation-based approaches described above. This approach takes advantage of the fact that bisulfite treatment can efficiently convert dC bases in DNA to U bases. This conversion makes the two strands of DNA distinguishable, and was previously used to distinguish RNA transcripts copied from each of the two possible template strands of DNA (31). Bisulfite conversion has also been extensively used to distinguish methylated C-residues, which do not get converted to T bases, from unmethylated C bases, thereby illuminating epigenetic changes (32). It has also been shown that dC bases can be partially converted to T bases so that each individual template DNA molecule can be distinguished from others by its unique pattern of C to T changes, thereby creating an intrinsic barcode similar to what can be achieved with externally added UIDs (33). DNA in which all C bases have been fully converted to T bases can be used as PCR-templates with specially designed primers linked to exogenous barcodes. This allows individual mutations to be assessed on both strands (duplex sequencing) in a reliable manner, without creation of libraries and with a relatively small number of sequencing reads.
The detection of rare mutations in clinical samples in essential to the screening, diagnosis, and treatment of cancer. While next generation sequencing has greatly enhanced the sensitivity of detecting mutations, the relatively high error rate of these platforms limits their overall clinical utility. The elimination of sequencing artifacts could facilitate the detection of early stage cancers and provide improved treatment recommendations tailored to the genetic profile of a tumor. BiSeqS, a bisulfite conversion-based sequencing approach, allows for the strand-specific detection and quantification of rare mutations. BiSeqS eliminates nearly all sequencing artifacts in three common types of mutations and thereby considerably increases the signal-to-noise ratio for diagnostic analyses.
Two types of barcodes are used in BiSeqS. Molecular barcodes serve to identify individual template molecules in an original sample prior to barcoding and amplification. Each individual template molecule will have a unique molecular barcode. Sample barcodes serve to identify a reaction sample or aliquot of an original sample; all template molecules in the reaction sample or aliquot share a barcode that identifies the reaction sample or aliquot. Barcodes may be, for example, randomly generated nucleotide runs or intentionally chosen nucleotide runs. For attaching molecular barcodes in particular, the number of individual molecular barcodes in a reaction mixture will be in excess of the number of template molecule. In the sequence listing which forms part of this application, barcodes are represented as a string of Ns.
Bisulfite conversion will be close to complete conversion. Thus primer design for amplifying bisulfite converted duplex oligonucleotides utilizes complementarity to the converted sequence. Primers are designed to be used in sets of at least four so that both strands of the original duplex template are amplified, sequenced, and identifiable.
Amplification of barcoded sequences generates families of similarly barcoded templates. Each family shares a molecular barcode, denoting that it derives from a single template molecule. Sequencing of the population of amplified templates, including multiple members of a family, permits comparison of nucleotide sequences of multiple members of a single family and assessment of the fraction of members of a family that contain a particular mutation. A high fraction, such as greater than 50, 60, 70, 80, 90, or 95% of families with a particular mutation suggests that the mutation was present in the original sample, prior to amplification. However, some of the identified mutations may still be ones that have been introduced during processing due to in vitro enzymatic errors. Detection of mutations that are due to such errors can be further reduced by comparing sequences obtained from families of two complementary strands. Requiring that a mutation exist on families generated from two strands reduces artifactual apparent mutations significantly.
Fragments of nucleic acids may optionally be obtained using a random fragment forming technique such as mechanical shearing, sonicating, or subjecting nucleic acids to other physical or chemical stresses. Fragments may not be strictly random, as some sites may be more susceptible to stresses than others. Endonucleases that randomly or specifically fragment may also be used to generate fragments. Size of fragments may vary, but desirably will be in ranges between 30 and 5,000 basepairs, between 100 and 2,000 basepairs, between 150 and 1,000 basepairs, or within ranges with different combinations of these endpoints. Nucleic acids may be, for example, RNA or DNA. Modified forms of RNA or DNA may also be used.
Attachment of a molecular barcode to an analyte nucleic acids fragment may be performed by any means known in the art, including enzymatic, chemical, or biologic. One means employs a polymerase chain reaction. Another means employs a ligase enzyme. The enzyme may be mammalian or bacterial, for example. Ends of fragments may be repaired prior to joining using other enzymes such as Klenow Fragment of T4 DNA Polymerase. Other enzymes which may be used for attaching are other polymerase enzymes. A molecular barcode may be added to one or both ends of the fragments, preferably to both ends. A molecular barcode may be contained within a nucleic acid molecule that contains other regions for other intended functionality. For example, a universal priming site may be added to permit later amplification. Another additional site may be a region of complementarity to a particular region or gene in the analyte nucleic acids. A molecular barcode may be from 2 to 4,000, from 100 to 1000, from 4 to 400, bases in length, for example.
Molecular barcodes may be made using random addition of nucleotides to form a short sequence to be used as an identifier. At each position of addition, a selection from one of four deoxyribonucleotides may be used. Alternatively a selection from one of three, two, or one deoxyribonucleotides may be used. Thus the molecular barcodes may be fully random, somewhat random, or non-random in certain positions. Another manner of making molecular barcodes utilizes pre-determined nucleotides assembled on a chip. In this manner of making, complexity is attained in a planned manner.
A cycle of polymerase chain reaction for adding exogenous molecular barcodes refers to the thermal denaturation of a double stranded molecule, the hybridization of a first primer to a resulting single strand, the extension of the primer to form a new second strand hybridized to the original single strand. A second cycle refers to the denaturation of the new second strand from the original single strand, the hybridization of a second primer to the new second strand, and the extension of the second primer to form a new third strand, hybridized to the new second strand. Multiple cycles may be required to increase efficiency, for example, when analyte is dilute or inhibitors are present.
Amplification of fragments containing a molecular barcode can be performed according to known techniques to generate families of fragments. Polymerase chain reaction can be used. Other amplification methods can also be used, as is convenient. Inverse PCR may be used, as can rolling circle amplification. Amplification of fragments typically is done using primers that are complementary to priming sites that are attached to the fragments at the same time as the molecular barcodes. The priming sites are distal to the molecular barcodes, so that amplification includes the molecular barcodes. Amplification forms a family of fragments, each member of the family sharing the same molecular barcode. Because the diversity of molecular barcodes is greatly in excess of the diversity of the fragments, each family should derive from a single fragment molecule in the analyte. Primers used for the amplification may be chemically modified to render them more resistant to exonucleases. One such modification is the use of phosphorothioate linkages between one or more 3′ nucleotides. Another employs boranophosphates. Additionally, LNA (locked nucleic acid) bases may be used in the primers; these can increase the Tm of an oligonucleotide containing them.
Family members are sequenced and compared to identify any divergences within a family. Sequencing is preferably performed on a massively parallel sequencing platform, many of which are commercially available. If the sequencing platform requires a sequence for “grafting,” i.e., attachment to the sequencing device, such a sequence can be added during addition of molecular barcodes or separately. A grafting sequence may be part of a molecular barcoded primer, a universal primer, a gene target-specific primer, the amplification primers used for making a family, a sample barcoded primer, or separate. Redundant sequencing refers to the sequencing of a plurality of members of a single family.
A threshold can be set for identifying a mutation in an analyte. If the “mutation” appears in all members of a family, then it derives from the analyte. If it appears in less than all members, then it may have been introduced during the analysis. Thresholds for calling a mutation may be set, for example, at 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 97%, 98%, or 100%. Thresholds will be set based on the number of members of a family that are sequenced and the particular purpose and situation.
Mutations which are detected, monitored, and/or analyzed according to the methods disclosed here may be in cancer driver genes or cancer passenger genes. They may be in other disease-causing or disease-related genes. They may simply be somatic mutations or germline polymorphisms that have no known functional consequence. Examples of driver genes which may be analyzed include NRAS, PIK3R1, PTEN, RNF43, and TP53. But the methods are in no way limited to these genes. Similarly, the method can be used to detect methylation on both strands of a duplex nucleic acid molecule.
Polymerases which can be used for amplification steps of the method can be any that have properties that are desirable for a particular amplification. We used ThermoFisher Phusion U Hot Start™ polymerase in the examples, but we also tested other polymerases and combinations of enzymes. These included Enzo AMPIGENE HS TAQ™ Polymerase; BioRad iTAQ Hot Start DNA Polymerase™; ThermoFisher Phusion HotStart II DNA™ Polymerase; and Sigma Aldrich FastStart™ DNA Polymerase and combinations of these mentioned polymerases.
Amplification primers may be packaged separately or in combinations. They may be in a liquid or dried. The package or kit may optionally contain analytic information on the primers and/or instructions for carrying out methods according to the invention. Kits may optionally contain additional components, such as polymerase enzyme(s), amplification buffer(s), reaction vessels, or other tools to facilitate practice of the methods.
The results described in the examples show that BiSeqS can accurately quantify rare mutations in a highly sensitive and specific manner. We envision that its major use will be in the surveillance of patients with cancer whose primary tumors have been sequenced. It has already been shown that liquid biopsies can be used for this purpose and can accurately identify patients whom are in clinical remission but are destined to recur (7, 11, 44). Many such patients, particularly when their residual burden of disease is small and therefore most likely to be cured by adjuvant therapy (45), have only one or two mutant DNA molecules in 10 ml of plasma. In such situations, a technique like BiSeqS, which can efficiently use all template molecules while maintaining high specificity, could prove particularly useful.
A disadvantage of BiSeqS is that it cannot be applied to most transition mutations because of the ambiguities caused by the bisulfite conversion of C to U, mimicking such transitions. Although one strand is still susceptible to BiSeqS, the power of the technology lies in its ability to detect mutations in both strands, so it poses no advantages over molecular barcoding for such mutations. For example, single base substitutions in KRAS codons 12, 13, and 61 are commonly mutated in colon, rectal, and pancreatic adenocarcinomas (46). BiSeqS can be used to quantify KRAS mutations in 38.7%, 43.4%, and 47.6% of these cancers, respectively (47). Across all cancers and mutations cataloged in the IARC TP53 database, approximated 44% of all mutations (i.e. SBS and indels) are amenable to BiSeqS analysis (IARC TP53 Database, R18).
Additionally, bisulfate treatment can result in conversion of methylated C bases to U in rare instances, depending upon the incubation time and reagent concentration (48). The protocol used for BiSeqS employs reduced incubation temperatures that appear to minimize this possibility (48), but sequence heterogeneity at methylated CpG sites may raise background and such sites are not preferred for mutation evaluation.
However, for liquid biopsies in surveillance, limitations inherent to a single gene are not a major issue because several different mutations, including transversions and indels, are generally observed upon genome-wide sequencing of cancers (1-3), and any identified mutation could in principle be applied to this clinical scenario. A recent study of 3,281 cancer samples highlighted that 93% had at least one non-synonymous mutation in at least one driver gene (49). While the average number of point mutations and small indels varied across tumor types, most cancers have at least one driver gene mutation that should be amenable to BiSeqS analysis (49). It is also worth noting that passenger gene mutations that are clonal can also be useful for diagnostic evaluation (50). Because there are at least 10-fold as many passenger mutations as driver gene mutations in nearly all cancers, it is likely that the vast majority of cancers will have several somatic mutations that could be assessed by BiSeqS. For example, in a study of 1157 single base substitutions detected in breast cancer, we calculate that 54.7% of substitutions would be amenable to BiSeqS analysis, in addition to the 7.4% of the tumors that contain insertion or deletion mutations, for a total of 62.1% of tumors (51).
The power of BiSeqS lies in its ability to drastically reduce background errors. Thus, BiSeqS may also complement screening for other genomic alterations, such as structural variants (SV), for rare allele detection and monitoring (52). Structural variants (SVs) provide exquisitely specific markers for cancer that can be used for liquid biopsies (9, 50). Simple polymerase errors do not produce structural variants, providing advantages over single base substitutions as diagnostic targets. On the other hand, there are disadvantages to the use of SVs as diagnostic markers. First, SV detection requires whole genome sequencing of tumors, rather than targeted sequencing of tumors, for their initial detection; the latter is currently much less expensive than the former. Second, and more importantly, structural variants are “private,” i.e., generally confined to one or a small number of patients. To be employed as a tumor marker, primers that specifically amplify the translocation junction must be designed and tested on the patient's tumor to ensure that the structural variant is somatic and the amplicon is specific. Although this approach is feasible in a research setting, it is not easily practicable in large scale settings. In contrast, single base substitutions and indels in driver genes are observed in numerous independent tumors, and a small set of “off-the-shelf” primers can be used to assess most patients. For example, we estimate that >98% of patients with colorectal cancer have mutations detectable through amplification with one of 130 pre-designed primer pairs.
In the future, it is possible that chemical treatments of DNA that convert A:T bp (rather than C:G) bp to other bp could substitute for bisulfite when transition mutations must be analyzed. Another avenue for future research is multiplexing, permitting mutations in a variety of amplicons to be assessed simultaneously in screening scenarios. This multiplexing is more difficult than normal because two amplicons must be designed for each region of interest while achieving homogeneous efficiency of every amplicon in all regions of interest.
The above disclosure generally describes the present invention. All references disclosed herein are expressly incorporated by reference. A more complete understanding can be obtained by reference to the following specific examples which are provided herein for purposes of illustration only, and are not intended to limit the scope of the invention.
Materials & Methods
Briefly, DNA from macro-dissected formalin-fixed paraffin-embedded (FFPE) tumor sections was extracted and bisulfite treated with an EZ DNA Methylation Kit (Zymo Research, Cat. # D5001). Custom primers containing a unique identifier (UID) and amplicon-specific sequence were used to amplify both strands of DNA, and the resulting products were sequenced on an Illumina MiSeq instrument. To characterize the specificity of BiSeqS, DNA isolated from one normal tissue was bisulfite-treated and processed through the BiSeqS pipeline to query for single base substitutions and indels. To characterize the sensitivity of BiSeqS, macro-dissected tumor samples with known MAFs were diluted with the DNA from normal WBCs to obtain final neoplastic cell contents ranging from 0.02% to 0.20%, bisulfite-treated and processed through the BiSeqS pipeline. More details are provided below.
Human Tissues
Formalin-fixed paraffin-embedded (FFPE) tumor sections were macro-dissected under a dissecting microscope to ensure a neoplastic cellularity of >30%. DNA was purified with a Qiagen FFPE Kit (Qiagen, Cat. #56494). Tumor samples with known MAFs were diluted with the DNA from normal WBCs to obtain final neoplastic cell contents ranging from 0.02% to 0.20%. To precisely quantify the DNA concentrations of the tumor and normal DNA samples, various mixtures of tumor and normal DNA were amplified with primers that revealed normal single nucleotide polymorphisms within the final amplicons. NGS was then used to quantify the fraction of neoplastic cells within each of the tested mixtures, and the same mixtures were then used as template DNA for BiSeqS, as described below. All tissues were obtained from consented patients at the Johns Hopkins Hospital with the approval of the Johns Hopkins Institutional Review Board.
Bisulfite Treatment and PCR Amplification of Purified DNA for BiSeqS
After extensive testing of various commercially available bisulfite conversion kits, we chose the EZ DNA Methylation Kit (Zymo Research, Cat. # D5001) to bisulfite treat and desulphonate purified DNA samples following the manufacturer's recommended protocol. DNA was eluted in 10 μL of Elution Buffer and stored at −20° C. Custom HPLC-purified PCR Primers (IDT) were designed for each bisulfite-converted strand of the DNA double helix at the amplified loci (sequence listing). Compared to traditional PCR primers, the custom primers were longer to account for the reduced sequence complexity of bisulfite-converted DNA. Each forward primer contained the sequence necessary for well barcode amplification at the 5′ end, followed by a string of 14 random nucleotides that served as the unique identifier (UID), and amplicon-specific primer sequences at the 3′ end (
The molecular barcoding PCR cycles included 12.5 μL of 2× Phusion U Hot Start PCR Master Mix (ThermoFisher, Cat. # F533S) in a 25 μL reaction, and optimized concentrations of each forward and reverse primer, ranging from 0.125 μM to 4 μM of each forward and each reverse primer for a total of four primers per well. The following cycling conditions were used: one cycle of 95° C. for 3 minutes, 20 cycles of 95° C. for 10 seconds, 63° C. for 2 minutes, and 72° C. for 2 minutes.
AMPure XP (Beckman Coulter, Cat. # A63881) was used to remove the primers for UID assignment. 0.025% of the PCR product generated from the UID cycles was used for the well barcoding (WBC) cycles. Primers used for the well barcode step were identical to those described previously and are diagrammed in
Sequencing
Sequencing of all the amplicons described above was performed using an Illumina MiSeq instrument. The total length of the reads used for each instrument varied from 79 to 130 bases. Reads passing Illumina CASAVA Chastity filters were used for subsequent analysis.
BiSeqS Pipeline
High quality reads were processed with the SafeSeqS pipeline (28) to generate aligned data that were then organized into tables for each BiSeqS analysis. Each of the tables contains: (i) strand information, (ii) well barcode and UID sequences, (iii) information listing all differences from the reference amplicon, and (iv) prevalence of each UID family corresponding to a change with respect to all UID families per amplicon. To determine whether a combination of plus and minus strand changes constitute a double strand mutant, the various mutations detected at a specific genomic locus are compared with respect to: (i) sample identity, (ii) chromosome, (iii) genomic position, and (iv) mutation type. Changes were called as true mutations when: (i) the change appeared on both the plus and the minus strands, and (ii) when the MAFs corresponding to the plus and minus strands differed by less than 10-fold.
Characterization of BiSeqS Specificity
To characterize the specificity of BiSeqS, DNA isolated from one normal tissue was bisulfite-treated and processed through the BiSeqS pipeline to query for single base substitutions and indels. Analysis using NGS across 8 amplicons and 608 bases for indels yielded 907 unique mutations were identified on the plus strand and 958 unique mutations were identified on the minus strand that were ultimately amenable to analysis by BiSeqS. For each strand of each amplicon, we calculated the mutant allele frequency (MAF) by dividing the number of reads or the number of UIDs containing >2 mutant reads per UID (UID Family Count >2) by the number of total reads or the number of total UIDs, respectively. Using molecular barcodes to group reads into families decreased the number of unique mutations to 92 on the plus strand and 71 on the minus strand (data not shown). After matching the plus and minus strand amplicons and imposing a filter of less than 10 for the ratio of mutations observed on the plus strand to the ratio of mutations observed on the minus strand (and vice versa), four mutations were identified (Data now shown). The number of SDMs was taken to be the minimum of the number of supermutants on the plus or the minus strand that corresponded to a mutation, as this is the limiting number of double stranded supermutant molecules detectable. The total number of double stranded molecules was similarly taken to be the minimum of the number of total UIDs on the plus or the minus strand, as this is the limiting number of double stranded template molecules detected. Standard NGS detected 197 and 167 indels on the plus and minus strands, respectively. Use of molecular barcodes reduced the number of detected indels to 6 and 5 for the plus and minus strand, respectively, while BiSeqS double strand analysis reduced the number of indels to zero.
BiSeqS Workflow
The principal feature of BiSeqS is the simultaneous detection of a mutation on both the plus and minus strands of DNA templates that were bisulfite treated and molecularly barcoded. We refer to the reference sequence as defined by UCSC as the plus (+) strand, and its reverse complement as the minus (−) strand. Three simple experimental steps (bisulfite conversion, molecular barcoding, and sample barcoding) can be employed prior to a specialized bioinformatics analysis of the sequencing data, as described below (
Step i: Bisulfite Conversion. Incubation of DNA with sodium bisulfite at elevated temperatures and low pH deaminates cytosine to form 5,6-dihydrocytosine-6-sulfonate (34). Subsequent hydrolytic deamination at high pH removes the sulfonate, resulting in uracil (35). Many modifications of this basic reaction have been described and used largely to differentiate between cytosine and 5-methylcytosine (5-mC), the latter of which is not susceptible to bisulfite conversion. In addition to converting C to U, bisulfite treatment denatures DNA and can degrade it. Although this degradation is not limiting for standard applications of bisulfite treatment, it is critical for applications involving mutation detection in clinical samples that are already degraded prior to conversion (36-38). In the current study, we evaluated many ways to convert DNA, and purify the converted strands. The best results were obtained with the reagents, conditions, and incubation times described in the Materials and Methods. As shown in
Step ii. Molecular Barcoding. The goal of bisulfite treatment is to create a code for distinguishing the two strands of DNA. This doubles the number of templates that need to be molecularly barcoded, utilizing specialized steps compared to that used for standardly amplifying DNA. First, four primers must be designed to amplify each region of interest, two primers for each strand. Second, the primers must be complimentary to the converted form of the DNA, accentuating the importance of full conversion—otherwise, some template molecules will not be amplified because they will not be perfectly complementary to the primers. Third, bisulfite treatment under the conditions we employed converts virtually all non-modified C residues to T, lowering the melting temperature of both the primer annealing sites and the amplicon in general. Because both strands must be amplified equivalently and in the same reaction, the primers must be chosen so that the same PCR cycling conditions can be used for amplifying both strands in a highly specific manner. For regions in which there is already a low C:G base pair content, the primers have to be long enough to allow specific amplification under relatively high-temperature annealing conditions. This proved difficult without yielding large amounts of primer dimers, and to overcome these challenges, several primer designs were evaluated. Eventually, variations in primer length, position, composition and C:G content allowed for specific and robust amplification of both strands of every target region attempted.
Another issue confronting amplification of bisulfite converted DNA is that many polymerases will not efficiently copy DNA that contains uracil bases. We tested seven commercially available polymerases and various reaction conditions to optimize efficiency of template use and uniformity of amplification of both strands when four primers were used (Table 1). While a combination of AMPIGene Hot Start Taq Polymerase and iTAQ Polymerase amplified the greatest number of template molecules, their lack of 3′→5′ exonuclease activity proved limiting for specificity in that the number of errors during PCR was unacceptably high. Ultimately, we chose Phusion U Hot Start Polymerase, a polymerase that exhibits 3′ →5′ exonuclease activity, as the enzyme to amplify uracil-containing templates with the highest specificity while maintaining sensitivity.
Step iii: Sample Barcoding. Part of the power of massively parallel sequencing instruments is that they can be used to analyze many samples at once. To enable this capacity for BiSeqS, we incorporated a sample barcode PCR cycle following the purification of the molecularly barcoded PCR products (
BiSeqS Data Processing Pipeline
High quality base calls were aligned to the bisulfite-converted reference sequence, and the aligned data were organized into tables for each sample, where each observed mutation in each strand of each well was listed in a separate row. The columns in this table included the number of reads, UIDs, and supermutants for each mutation (data not shown). Supermutants were defined as mutations in a UID family in which >90% of the family members contained that mutation. For example, if all three members of a UID family contained the same mutation, it was considered a supermutant. The supermutant allele fraction was defined as the number of supermutants divided by the number of UIDs in an individual well.
Individual mutations in the plus and minus strands were compared to determine whether the identical supermutant was found in both strands. If the mutation was found in both strands, the supermutant allele fractions in each strand were compared. The supermutant allele fractions on each strand provide an additional level of specificity because these fractions are expected to be similar if a mutant base pair existed in the template DNA prior to conversion and amplification. Given that mutations arising during PCR are relatively rare, it would be even rarer for the same mutation to arise at the identical position in both strands. This is especially true after conversion, when the two strands contain markedly different nucleotide contexts. If the supermutant allele fractions in each strand differed by <10-fold, then the mutation was considered to be a super-duper mutant (SDM). The SDM allelic fraction was defined as the number of SDMs divided by the number of UIDs in the strand that contained the fewest UIDs. For example, if the number of SDMs was 10, and the number of UIDs in the plus and minus strands were 10,000 and 20,000, respectively, then the SDM allelic fraction would be 0.1% (i.e., 10 of 10,000).
Special features of the analysis of mutations in converted DNA include the following. A transition from C>T noted in the sequencing could have resulted from a single base substitution mutation that changed a C:G bp to a T:A bp or from bisulfite conversion of a C to a T on one strand. In light of this ambiguity, C to T mutations cannot be considered supermutants in the strand containing the C, though a supermutant would still be evident at that position in the strand containing the G. There are a total of six possible single base substitutions in duplex DNA: A C:G bp can be mutated to either A:T, G:C, or T:A bps, and an A:T bp can be mutated to either C:G: G:C, or T:A. Of these six single base pair substitutions, all result in supermutants on at least one strand and four result in supermutants on both strands (i.e., SDMs). In addition, transitions that create a CpG dinucleotide in which the C is methylated can be assessed on both strands. All insertions or deletions within the amplified sequences can form SDMs. Methylation also introduces complexity, as methylated or hydroxymethylated C bases are not converted to U bases by bisulfite treatment. The BiSeqS pipeline takes this into account when it analyzes the data by not assuming that any particular C is methylated or unmethylated (or that every unmethylated C is converted to T by bisulfite treatment). Instead, it considers the possible effects of conversion and methylation and only labels a mutation as a supermutant or SDM if there is no ambiguity. A list of all possible single base substitutions on either strand, within a triplet context and with the mutated base in the middle, is provided in Table 1, below.
For each single base substitution, the capacity of BiSeqS to identify SDMs is also provided in this table. In general terms, all transversions, all insertions and deletions, and a small subset of transitions can be unambiguously scored as SDMs (Table 1). Because the power of BiSeqS lies in SDMs, only mutations that are interpretable in both strands are considered below.
BiSeqS Increases the Specificity of Mutation Calling
We selected eight amplicons within prototypic cancer driver genes to assess BiSeqS performance. For each of the eight amplicons, two forward primers and two reverse primers for each strand were synthesized and tested using the principles described above and in the Materials and Methods. For all amplicons, at least one primer pair for each strand was found capable of specifically amplifying the intended strand with high efficiency, as judged by polyacrylamide gel analysis (
For each of the eight amplicons, we compared the specificity of BiSeqS to that of conventional next generation sequencing (NGS) and molecular barcode-assisted sequencing (i.e., SafeSeqS). We considered only those potential mutations that could be discerned in both strands, as described above. There were a total of 608 bp within these amplicons, yielding a total of 1550 single base substitutions possible. Of these 1550 potential SBS, 1252 (80.8%) were scorable as SDMs; the remainder were transitions that were not scorable for the reasons noted above. There were also many possible indels at each position that could have been observed in the sequencing data, all scorable as SDMs.
In the actual experiment, we could distinguish the strand used as template in the sequencing instrument because of the bisulfite conversion. In light of this, there were actually 2504 mutations (2×the number of bp) that could be scored for both conventional and molecular-barcode assisted sequencing. Of these 2504 potential SBSs, 1865 (74.5% of the total possible mutations) were actually observed upon conventional sequencing (25), highlighting the relatively large number of errors observed unless error correction by SafeSeqS or BiSeqS is applied (data not shown). There was no discernible difference between the two strands with respect to the number of mutations observed, with 907 and 958 mutations observed on the plus and minus strands, respectively. There were also 298 small insertions or deletions observed by conventional NGS.
Application of the molecular barcoding approach to these data considerably reduced the number of mutations, as evident by comparison of
Application of BiSeqS to these data resulted in a further striking reduction in errors. Only four SDMs were observed over all eight amplicons sequenced, as opposed to 1865 and 163 mutations without and with molecular barcoding, respectively (
BiSeqS also reduced errors at indels; there were 364 mutants, 11 supermutants, and zero SDMs observed in the eight amplicons (
Sensitivity of BiSeqS
Massively parallel sequencing allows billions of amplicons to be assessed simultaneously, resulting in theoretical sensitivities of 1 mutation among >1 billion WT templates for any base within an amplicon. The actual sensitivities in clinical samples are limited only by the amount of input DNA and the specificity. In many types of liquid biopsies, such as those from plasma, pancreatic cysts, CSF, and urine, the total DNA available is often <33 ng (7, 9, 12, 39). A sensitivity of 0.01% is therefore adequate for detecting the one or two mutant molecules that may exist among the 10,000 templates contained in 33 ng of human DNA in such samples. The reliability of this detection is limited by the biological and technical specificities, where the queried mutation must be found at far lower frequencies in the normal control samples used for comparison to the tumor. Although the biological issues that might lead to mutations in normal samples cannot be circumvented (40), technical issues can be addressed and overcome through methodological advances such as BiSeqS.
To address the sensitivity of BiSeqS, we evaluated tumor samples containing ten double-stranded mutations (20 mutations if each strand is counted separately) within the eight amplicons described above (data not shown). The proportion of mutations in each of the tumor samples was defined through NGS. We used the DNA from these tumors to create the scenario characteristic of liquid biopsies, wherein a small amount of DNA from neoplastic cells is mixed with a much larger amount of DNA from normal cells in the patient. More specifically, we diluted this tumor DNA with normal leukocytes to achieve minor allele fractions of 0.02% and 0.20% and then used bisulfite treatment to convert the mixtures. We determined the mutant allele fractions of each of the tumor-derived mutations when analyzed with standard NGS, with molecular barcodes, or with BiSeqS, in all cases holding the input DNA to 5,000 template molecules per well, and performing each experiment in six wells. We found that each of the three methods of analysis yielded mutant allele fractions that were similar to those expected from the dilutions (examples in
Although the efficiency of amplification was therefore always high enough to detect the mutant templates, the MAFs of the normal controls limited the interpretation of the sequencing data. We called a mutant call a true mutation when the signal-to-noise ratios (SNRs), defined as the MAF in the tumor specimen divided by the MAF in normal cells, was >10. We averaged the MAF in both strands for this calculation when considering standard NGS or molecular barcode-assisted NGS.
BiSeqS Simultaneously Detects Methylation Status on Both Strands
Cytosine bases in 5′-CpG dinucleotides that are methylated are protected from conversion to uracil during bisulfite treatment, allowing BiSeqS to detect the methylation status of the plus and minus strands simultaneously. Although not the primary purpose of BiSeqS, this discrimination could prove useful for the analysis of methylation that occurs at low levels, either for basic research or clinical purposes. Although bisulfite treatment and specially-designed primers have often been used to evaluate methylation in the past for a variety of clinical purposes (41-43), the combination of molecular barcoding with simultaneous amplification of both strands provides unprecedented sensitivity in this type of analysis.
To demonstrate the ability of BiSeqS to discriminate the methylation status on both strands simultaneously, we evaluated a region of the TP53 gene that contains a known methylated CpG at hg19 position 7,572,973 to 4. Greater than 90% of the UIDs on both strands were found to be methylated at the C at the plus strand of position 7,572,973 and the C opposite the G on the minus strand at position 7,572,974. Greater than 99.8% of the C residues that were not at 5′-CpG dinucleotides within this amplicon were found to be converted to T's, providing an essential control for interpreting the extent of methylation. We then searched for evidence of double-stranded methylation within all eight amplicons evaluated in this study in normal WBCs. There were two 5′-CpG residues within the 608 bp that could be evaluated. Of these, we found that both CpG's were methylated on both strands, with the fraction of methylated alleles ranging from 92.10% to 96.10% (data not shown).
The disclosure of each reference cited is expressly incorporated herein.
Circulating Cell-Free DNA from Ovarian Cancers. Sci Rep 6:29831.
This application claims the benefit of U.S. Provisional Application Ser. No. 62/476,234, filed Mar. 24, 2017, the disclosure of which is incorporated herein by reference in its entirety.
This invention was made with government support under CA62924 awarded by the U.S. National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/022664 | 3/15/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62476234 | Mar 2017 | US |