The instant application contains a Sequence Listing which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Oct. 26, 2012, is named 381596US.txt and is 828,319 bytes in size.
1. Field of the Invention
The present invention relates to the field of nucleic analysis, and, more particularly, to methods for contacting fragmented nucleic acids, such as genomic DNA with probes and enzymes whereby selected portions of the genomic DNA are amplified and assayed.
2. Related Art
Presented below is background information on certain aspects of the present invention as they may relate to technical features referred to in the detailed description, but not necessarily described in detail. That is, individual parts or methods used in the present invention may be described in greater detail in the materials discussed below, which materials may provide further guidance to those skilled in the art for making or using certain aspects of the present invention as claimed. The discussion below should not be construed as an admission as to the relevance of the information to any claims herein or the prior art effect of the material described.
Next generation DNA sequencing (NGS) has revolutionized genetics by enabling one to routinely sequence human genomes, either in their entirety or specific subsets. While NGS advances have dramatically increased our ability to identify disease-related genetic variants, the widespread application of NGS-based approaches to clinical populations faces some limitations. Citing an example, NGS-based discovery of cancer mutations for large translational and clinical studies is severely restricted by the availability of clinical samples from which one can extract high quality genomic DNA. The vast majority of cancers samples like gastric and colorectal cancer are processed with formalin fixed paraffin embedding (FFPE) of tissues. For clinical pathology laboratories, this is a preservation method because (1) it maintains morphological features of the tumor, (2) enables histopathologic examination with a number of staining processes and (3) can be stored indefinitely at room temperature. However, the fixation process causes irreversible damage to the sample genomic DNA via cross linkages and increased fragmentation. As a result, genomic DNA extracted from FFPE material is often of poor quality. Furthermore, FFPE-extracted genomic DNA is generally in a single stranded form because of the need for high temperature incubations to melt the paraffin. Therefore, the analysis of FFPE-derived genomic DNA using PCR-based assays is difficult. Overall, these issues restrict our ability to conduct clinical population genetic studies and genetic diagnostic development using these valuable samples.
A variety of methods have been developed to enrich specific regions of the human genome. These include in-solution hybridization enrichment, multiplexed-PCR and targeted circularization approaches. Hybrid selection methods apply immobilized oligonucleotides on either microarrays [1-3] or beads [4] to enrich genomic targets from a modified DNA sample. In multiplex-PCR [5], complex primer sets can be utilized to selectively amplify targeted regions prior to modifying DNA for the sequencer. Highly parallel simplex PCR reactions can be conducted with microdroplet technology [6]. In-solution oligonucleotide-based approaches such as molecular inversion probes (MIPs) capture targets by DNA synthesis across the target and ligation that result in circularization of the capture oligonucleotides [7, 8]. Citing another in-solution approach, targeted genomic circularization (TGC) directly captures a genomic DNA target by converting it into a target specific circle using in-solution capture oligonucleotides [9].
There are limitations with all of the previously described capture methods on genomic DNA from FFPE samples. For example, hybridization enrichment has been applied to cancer samples for single nucleotide variation (SNV) detection [10]. For example, Kerick et al. used the Agilent in-solution hybridization method to investigate reproducibility of SNV detection comparing genomic DNA from FFPE to flash-frozen samples. They demonstrated a false positive rate of approximately 1% when using sequencing coverage greater than 20× coverage. This translates into 1 false mutation caller for every 100 variants identified. In addition, hybridization-based methods have high levels of off-target capture, involve complex workflows that require additional PCR amplification and sample preparation steps. MIP technology has potential advantages for degraded genomic DNA from FFPE samples, but the capture reaction is inefficient for larger targets beyond 200 bps and the assay is extremely complicated in its implementation [11]. Furthermore, with MIPs, the captured regions contain 20 bps of the oligonucleotide-derived sequences and the rest is the reverse-complement of the template DNA, not the original DNA strand. This requires some degree of bioinformatic processing to eliminate synthetic sequence. Capture with the targeted genomic circularization relies on the presence of existing restriction sites in double stranded DNA and requires multiple restriction enzymes which increase the number of reactions needed for a given sample [9]. This can limit the efficiency of capture coverage due to the absence of a suitable restriction site. Furthermore, TGC-capture requires double stranded DNA for restriction enzyme fragmentation while FFPE-derived genomic DNA is generally single stranded. Whole genome amplification using random primers followed by an end-repair step can be used to sequence FFPE-derived genomic DNA, but these amplification steps can skew the representation of certain region even before the capture reaction.
Dahl et al., “Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments,” Nucleic Acids Res. 33 e71 (2005), discloses a method for multiplex amplification which uses a general primer pair motif and a vector oligonucleotide selector probe, where the circularization procedure starts with digestion of the DNA to generate targets.
US patent publication 2008/0199916, by Zheng et al., published Aug. 21, 2008, entitled “Multiplex targeted amplification using flap nuclease,” discloses the use of UDG (uracil-DNA glycosylase) and a flap exonuclease.
PG Pub 2007/0128635 by Macevicz, entitled “Selected Amplification of Polynucleotides,” discloses a method in which fragments and selection oligonucleotides are combined in a reaction mixture comprising the following enzymatic activities: (i) a 5′ flap endonuclease activity, (ii) a DNA polymerase lacking strand displacement activity, (iii) a 3′ single stranded exonuclease activity, and (iv) a ligase activity.
WO 2008/033442 A2, “Methods And Compositions For Performing Low Background Multiplex Nucleic Acid Amplification Reactions,” by Fredriksson et al., discloses a method of amplifying target nucleic acids involving circularizing target amplicons in an amplified composition; and selecting for said circularized target amplicons in said amplified composition.
The following brief summary is not intended to include all features and aspects of the present invention, nor does it imply that the invention must include all features and aspects discussed in this summary.
The present invention comprises, in certain aspects, methods and materials for detection and analysis of a large number of random fragments of DNA in a sample. The methods can be used for targeted resequencing of DNA. In certain aspects, the present methods employ a mixture of single-stranded polynucleotide capture probes, a number of universal single stranded oligonucleotides (second polynucleotides) each having the same sequence and hybridizing to a portion of the various capture probes; and a mixture comprising exonucleases and a ligase.
In certain aspects, the present invention comprises a composition in the form of a reaction mixture useful for preparing a population of double stranded DNA molecules from a sample containing single stranded polynucleic acids, comprising, preferably in a suitable buffer: (a) a plurality of single stranded capture probes, each capture probe containing (i) 5′ and 3′ end capture arms complementary to specific portions of a polynucleic acid in the sample and (ii) an invariant sequence between the capture arms, whereby a circular structure comprising a specific capture probe and a polynucleic acid sample molecule having regions complementary to the capture arms is formed in the buffer; (b) a plurality of second (“universal”) single stranded polynucleotides having a sequence complementary to the invariant sequence and having amplification sites for amplification of a polynucleic acid in a circular structure; and (c) a 5′ exonuclease, a 3′ exonuclease, and a ligase. While “each” capture probe will contain the defined features, it is not to be implied that “every” capture probe in a composition must have these features.
The single stranded polynucleic acids in the composition may comprise random fragments of human genomic DNA. The fragments may be fixed by crosslinking and embedded in a wax, which makes the composition well suited for dealing with degraded DNA from FFPE samples.
The composition also comprises at least one of amplification primers and a polymerase for amplification. The amplification sites of the composition comprise PCR primer sites, which may be spaced on the universal polynucleotides about 120 to 250 bases apart.
In certain embodiments, the composition (reaction mixture) comprises capture probes having a three part construction: two capture arms on the flanks which are able to capture specific single-stranded genomic DNA and a sequence between the two capture arms which is termed a “universal” sequence in that it is essentially the same (“invariant”) among the different probes. The capture probes may be present in the composition as a set of at least 500 different probes, at least 600 different probes, at least 700 different probes, or at least 1000 different probes, each probe having capture arms complementary to different portions of a single stranded polynucleic acid in the sample and having the same universal probe sequence between the two capture arms.
In certain aspects, the present invention also comprises a method for analyzing single stranded polynucleotides from a sample, comprising the steps of: (a) adding to the sample a plurality of capture probes, each capture probe containing capture arms designed to be complementary to specific portions of a polynucleic acid in the sample and a universal probe sequence between the arms, whereby a circular structure comprising a specific capture probe and a polynucleic acid sample molecule is formed in the buffer; (b) adding to the sample a plurality of universal polynucleotides having a sequence complementary to the universal probe sequence and having amplification sites for amplification of a polynucleic acid in a circular structure; and (c) adding to the sample containing capture probes and universal polynucleotides a mixture of a 5′ exonuclease, a 3′ exonuclease, and a ligase under conditions whereby exonucleases remove bases from the single stranded polynucleotides to form a new 5′ end thereof and a new 3′ end thereof, and the ligase ligates the new 5′ end to the new 3′ end.
The composition and method described above may also comprise a 5′ exonuclease, which may be Exonuclease I; a 3′ exonuclease, which may be a polymerase or a thermostable polymerase; and a ligase, which may be a thermostable DNA ligase. As described below, the capture arms may hybridize to various portions of the DNA in the sample, leaving “flaps”, which are removed by the exonucleases.
In certain aspects, the present invention further contemplates a method for analyzing single stranded polynucleotides from a sample, comprising the steps of: (a) adding to the sample a plurality of capture probes, each capture probe containing capture arms complementary to specific portions of a polynucleic acid in the sample and a universal probe sequence between the arms, whereby a circular structure comprising a specific capture probe and a polynucleic acid sample molecule is formed in the buffer; (b) adding to the sample a plurality of universal polynucleotides having a sequence complementary to the universal probe sequence and having amplification sites for amplification of a polynucleic acid in a circular structure; and (c) adding to the sample containing capture probes and universal polynucleotides a mixture of a 5′ exonuclease, a 3′ exonuclease, and a ligase under conditions whereby exonucleases remove bases from the single stranded polynucleotides to form a new 5′ end thereof and a new 3′ end thereof, and the ligase ligates the new 5′ end to the new 3′ end; (d) adding to the sample a polymerase and polymerase primers; and (e) conducting a polymerase chain reaction using the polymerase primers for amplification of a portion of a single stranded polynucleotide captured by a corresponding capture probe.
The above method may further comprise the step of sequencing amplified polynucleotides from step (e). The polymerase chain reaction conducted step (e) may utilize an annealing temperature of between about 45 degrees Celsius and 55 degrees Celsius.
The analyzing of the single stranded polynucleotides from a sample may comprise analyzing polynucleotides from a preserved tissue sample or analyzing polynucleotides from a preserved tissue sample and analyzing polynucleotides from a fresh sample from the same individual.
In certain aspects, the present invention also comprises the preparation of a composition as described herein using a kit. The kit may comprise a set of capture probes and universal oligos. Other reagents, such as enzymes may also be included in the kit. An exemplary set of 628 capture polynucleotides is described in the accompanying sequence listing.
Described herein is a novel DNA targeting and enrichment method particularly suited for analysis of samples containing fragmented single stranded nucleic acids, such as genomic DNA fragments in a biopsy sample. The method results in highly multiplexed amplification of selected portions of the sample nucleic acid, i.e., the reaction mixture may contain hundreds or thousands of different capture probes for amplification of sample DNA regions spanned by the capture probes. The amplified portions from the reaction may be further analyzed, e.g. by sequencing the amplified portions.
The present method is an improvement of a previously described technique that required double stranded DNA as input and required that the targeting oligonucleotide probes be placed adjacent to certain restriction sites. For the present approach, the hybridization arms of the capture oligonucleotides do not require a restriction site and the input DNA can be single stranded. This improves the flexibility and the coverage of the design. An important feature of the present capture approach involves using single stranded DNA as input material. Given the need for high heat during processing, the majority of formalin fixed and paraffin embedded (FFPE) derived genomic DNA molecules are generally single stranded. The present approach has a major advantage compared to other methods that rely exclusively on enzymatic manipulations of double stranded genomic DNA. The capture performance is comparable when using genomic DNA derived from flash-frozen versus FFPE processed tissue. Eighty five percent of the heterozygote SNV detected from high quality genomic DNA extracted flash-frozen samples were also detected in targeted resequencing data from the matched FFPE samples. The number of false positive FFPE-specific SNV calls are exceptionally low at one per every 12 Kb of targeted genomic sequence.
While multiplexed capture assays for hundreds of genomic regions in the present examples is described herein, it is believed the reaction could be scaled to thousands. As published, efficient capture using pools of 5,000 oligonucleotides for restriction enzyme-based targeted circularization has been achieved and it is believed that this new method will scale similarly. For most of the results presented here, we used 4 indexed samples per lane of sequencing (2 flash frozen and 2 FFPE samples). Targeted resequencing projects involving hundreds of exons in hundreds of FFPE samples are therefore achievable and may be implemented with minimal additional steps in a next generation sequencer such as the Illumina HiSeq or GAIIx. In addition, the application of the present approach is demonstrated using the Illumina MiSeq system which is designed for rapid analysis.
An innovative approach to capture genomic targets from archival genomic DNA with in-solution polynucleotides is described. This approach is fundamentally different than other methods given that it only requires random fragments of single stranded genomic DNA as commonly seen in FFPE samples, is highly scalable for multiplexed target coverage, and does not rely on any whole genome amplification. The capture assay is straightforward, relatively fast and can be implemented with standard molecular biology equipment. The robust performance of the capture assay and comparisons of SNV detection using genomic DNA derived from matched flash-frozen and FFPE samples is demonstrated.
The technology described utilizes oligonucleotide-mediated genomic capture without the need for double stranded template and the reliance on exiting restriction sites. It also alleviates the need to synthesize the complementary stranded of the template DNA, which can result in significant limits such as the target size.
Another novel aspect of this capture process is its ability to add desired sequences (such as the adapter sequences required for cluster generation on the Illumina® sequencing system) to DNA fragments without the need for the multi-step process normally associated with such manipulation. This can greatly simplify and accelerate the construction of sequencing libraries. That is, the original
Denatured single-stranded genomic DNA 102 having a 5′ end and a 3′ end is combined with a pool of polynucleotides, termed “capture probes,” that mediate targeted circularization of the regions of interest. Since the size of DNA 102 is unknown and variable (“random”), portions of the DNA 102 will extend 5′ and 3′ from the hybridization sites, as shown in step 1. The capture probes are single stranded DNA molecules that may be e.g. 80 bases long, or in the range of 40 to 300 bases long. A single capture probe will have 5′ capture arm 104, a middle portion 105 (“universal probe sequence”) and a 3′ capture arm 106 (
Genomic DNA in the sample can come from either flash-frozen or FFPE processed tissue samples. Each capture arm 104, 106 from a single capture probe anneals to a predetermined sequence in a specific genomic DNA fragment 102 containing the complementary sequences. After hybridization, a single-stranded target-specific structure is formed which has 5′ single stranded extension 111 and 3′ single stranded extension 112 of the original genomic target single stranded DNA (
Once the circle is complete, universal PCR primers 110 can be used to amplify the intervening target genomic DNA fragment, creating a pool of linear amplicons that can be sequenced (Step 3). The primers are oriented, as shown in
As shown by arrows 110 in
A variety of buffers can be used with the present compositions. They can contain, e.g. 100 mM Tris-Cl, 500 mM KCl; 600 mM Tris-Cl, 170 mM (NH4)2SO4, 0.1% Tween-20; 375 mM Tris-Cl, 200 mM (NH4)2SO4, 0.1% Tween-20, etc.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are described. Generally, nomenclatures utilized in connection with, and techniques of, cell and molecular biology and chemistry are those well-known and commonly used in the art. Certain experimental techniques, not specifically defined, are generally performed according to conventional methods well known in the art and as described in various general and more specific references that are cited and discussed throughout the present specification. For purposes of clarity, the following terms are defined below.
Ranges:
For conciseness, any range set forth is intended to include any sub-range within the stated range, unless otherwise stated. A sub-range is to be included within a range even though no sub-range is explicitly stated in connection with the range. As a nonlimiting example, a range of 120 to 250 includes a range of 120-121, 120-130, 200-225, 121-250 etc. The term “about” has its ordinary meaning of approximately and may be determined in context by experimental variability. In case of doubt, “about” means a variation within 5% of a stated numerical value.
The term “polynucleotide” corresponds to either double-stranded or single-stranded cDNA or genomic DNA or RNA, containing at least 10 contiguous nucleotides. Single stranded polynucleic acid sequences are always represented in the current invention from the 5′ end to the 3′ end. Polynucleic acids according to the invention may be prepared by any method known in the art for preparing polynucleic acids (e.g. the phosphodiester method for synthesizing oligonucleotides as described by Agarwal et al. (1972), the phosphotriester method of Hsiung et al. (1979), or the automated diethylphosphoroamidite method of Baeucage et al. (1981)). Alternatively, the polynucleic acids of the invention may be isolated fragments of naturally occurring or cloned DNA or RNA.
The term “oligonucleotide” refers to a single stranded nucleic acid comprising two or more nucleotides, and less than 300 nucleotides. The exact size of an oligonucleotide depends on the ultimate function or use of said oligonucleotide. For use as a probe or primer the oligonucleotides are preferably about 5-50 nucleotides long.
The oligonucleotides and polynucleotides according to the present invention can be formed by cloning of recombinant plasmids containing inserts including the corresponding nucleotide sequences, if need be by cleaving the latter out from the cloned plasmids upon using the adequate nucleases and recovering them, e.g. by fractionation according to molecular weight. The probes according to the present invention can also be synthesized chemically, e.g. by automatic synthesis on commercial instruments sold by a variety of manufacturers.
The nucleotides as used in the present invention may, in certain aspects, be ribonucleotides, deoxyribonucleotides and modified nucleotides such as inosine or nucleotides containing modified groups which do not essentially alter their hybridisation characteristics. Moreover, it is obvious to the man skilled in the art that any of the below-specified probes can be used as such, or in their complementary form, or in their RNA form (wherein T is replaced by U).
The oligonucleotides used as primers or probes may also comprise or consist of nucleotide analogues such as phosphorothioates (Matsukura et al., 1987). alkylphosphorothioiates (Miller et al., 1979) or peptide nucleic acids (Nielsen et al., 1991; Nielsen et al., 1993) or may contain intercalating agents (Asseline et al., 1984).
The term “probe” refers to single stranded sequencespecific oligonucleotides which have a sequence which is sufficiently complementary to hybridize to the target sequence to be detected. Preferably said probes are 70%, 80%, 90%, or more than 95% homologous to the exact complement of the target sequence to be detected. These target sequences are either genomic DNA or messenger RNA, or amplified versions thereof. Preferably, these probes are about 5 to 50 nucleotides long, more preferably from about 10 to 30 nucleotides.
The term “hybridizes to” refers to preferably stringent hybridizations conditions, allowing hybridisation between complementary nucleic acid sequences showing at least 90%, 95% or more homology with each other.
The term “primer” refers to a single stranded DNA oligonucleotide sequence capable of acting as a point of initiation for synthesis of a primer extension product which 5 is complementary to the nucleic acid strand to be copied. The length and the sequence of the primer must be such that they allow to prime the synthesis of the extension products. Preferably the primer is about 5-50 nucleotides long. Specific length and sequence will depend on the complexity of the required DNA or RNA targets, as well as on the conditions of primer use such as temperature and ionic strength. The fact that amplification primers do not have to match exactly with the corresponding template sequence to warrant proper amplification is amply documented in the literature. The amplification method used can be either polymerase chain reaction, target polynucleotide amplification methods such as self-sustained sequence replication (3SR) and strand-displacement amplification (SDA); methods based on amplification of a signal attached to the target polynucleotide, such as “branched chain” DNA amplification; methods based on amplification of probe DNA, such as ligase chain reaction (LCR) and QB replicase amplification (QBR); transcription-based methods, such as ligation activated transcription (LAT), nucleic acid sequence-based amplification (NASBA), amplification under the trade name INVADER, and transcription-mediated amplification (TMA); and various other amplification methods, such as repair chain reaction (RCR) and cycling probe reaction (CPR). Preferred methods can be multiplexed, i.e. a number of amplifications of different sequences can be run in the same reaction mixture at the same time.
The term “complementary” nucleic acids as used in the current invention means that the nucleic acid sequences can form a perfect base paired double helix with each other.
The term “FFPE” refers to formalin-fixed, paraffin-embedded (FFPE) tissue samples. Commercial solutions of formadehyde in water are commonly called formalin. Formalin preserves or fixes tissue or cells by reversibly cross-linking primary amino groups in proteins with other nearby nitrogen atoms in protein or DNA through a —CH2— linkage.
Tissue samples are typically placed into molds along with liquid embedding material (such as agar, gelatine, or wax) which is then hardened. This is achieved by cooling in the case of paraffin wax and heating (curing) in the case of the epoxy resins. The acrylic resins are polymerised by heat, ultraviolet light, or chemical catalysts. The hardened blocks containing the tissue samples are then ready to be sectioned.
Another aldehyde that can be used for fixation is glutaraldehyde. It operates in a similar way to formaldehyde by causing deformation of the alpha-helix structures in proteins. However, glutaraldehyde is a larger molecule, and so its rate of diffusion across membranes is slower than formaldehyde.
Samples that may be used in the present invention include medical samples, forensic samples, museum or archeological samples, and other archival collections, which need not be FFPE preserved. There are many preservation methods that have been applied to tissues, including alcohol preservation, formalin treatment, freezing and sequestration in waxes and other materials. In addition, forensic or archeological samples may contain degraded ssDNA that has not been consciously preserved at all.
The term “5′ exonuclease” or “5′ end nuclease” refers to an enzyme that has activity 5′ to 3′ direction to remove a single stranded DNA having a 5′ end. It may do this through exonuclease or endonuclease activity, i.e. cleavage at a point where the ssDNA separates from its complementary strand. The 5′ exonuclease enzymes used herein preferably degrade single stranded DNA, not double stranded DNA. The preferred 5′ exonuclease is a DNA polymerase that has the ability to cleave a DNA hairpin where a 5′ end of DNA to be cleaved is a single strand adjacent to a double strand, which may result from formation of an exogenous duplex, such as hybridization to a primer. For details, see Lyamichev et al. “Structure-Specific Endonucleolytic Cleavage of Nucleic Acids by Eubacterial DNA Polymerases,” Science 260:778-783 (1993), describing this activity in DNAP-Ecl and DNAP-Taq (from Thermus aquaticus) polymerases.
The term “3′ exonuclease” or “3′ end nuclease” refers to an enzyme having activity in the 3′ to 5′ direction to remove a single stranded DNA portion having a 3′ end. As with the 5′ exonuclease, the enzyme will only act on ssDNA and may do this by either exonuclease or endonuclease activity. This activity is found as DNA proofreading in certain DNA polymerases. It allows the enzyme to check each nucleotide during DNA synthesis, and excise mismatched nucleotides in the 3′ to 5′ direction. The proofreading domain also enables a polymerase to remove unpaired 3′ overhanging nucleotides to create blunt ends. Protocols such as high-fidelity PCR, 3′ overhang polishing and high-fidelity second strand synthesis require the presence of a 3′→5′ exonuclease.
The preferred 3′ exonuclease is Exo I. Exonuclease I (Exo I), the product of the sbcB gene of E. coli, is an exodeoxyribonuclease that hydrolyzes single-stranded (ss)DNA stepwise in a 3′ to 5′ direction. 1-3 Hydrolysis generates deoxyribonucleoside 5′-monophosphates and a terminal dinucleotide diphosphate. The enzyme requires magnesium (optimal Mg++ concentration is 10 mM) and the presence of a free 3′-hydroxyl terminus. Exonuclease I is active under a wide variety of buffer conditions, allowing addition of the enzyme directly into most reaction mixes. Heat inactivation results from incubation at 80° C. for 15 minutes.
The term “ligase” refers to an enzyme that catalyzes formation of a phosphodiester bond between the 5′ phosphate of one strand of DNA and the 3′ hydroxyl of the other. This enzyme is used to covalently link or ligate fragments of DNA together. An example of a DNA ligase is one derived from the T4 bacteriophage. T4 DNA ligase requires ATP as a cofactor. The presently preferred ligase is Ampligase® ligase (registered trademark of Epicentre Technologies), a thermostable DNA ligase that catalyzes NAD-dependent ligation of adjacent 3′-hydroxylated and 5′-phosphorylated termini in duplex DNA structures that are stable at high temperatures.
For convenience, certain polynucleotides are referred to herein as “capture probes,” meaning single stranded polynucleotides of relatively small size, e.g. 40-4000 bases, which are prepared (e.g. synthetically) to contain defined features. These include certain “universal” sequences, which are so designated because they are essentially identical as between different polynucleotides designed for the stated purpose, whereas other sequences in the capture probes will vary among a number of different possibilities to capture different targets. That is, the capture probes contain a “universal probe sequence” which contains a single sequence common to all capture probes. In this way, the “universal polynucleotides” may have a single sequence that is complementary to the universal sequence in the capture probes.
Genomic DNA from NA18507 was obtained from Corriel Cell Repositories. Intestinal tissue samples were obtained from under an IRB protocol approved by Stanford University. These samples were either immediately snap frozen in liquid nitrogen and stored at −80° C. or preserved as formalin-fixed, paraffin-embedded (FFPE) blocks. Total nucleic acids were extracted from the flash-frozen tissue using the SQ DNA/RNA/Protein Kit from Omega Bio-Tek. Following complete RNase A digestion, the DNA (herein referred as dsDNA) was analyzed by argarose gel electrophoresis and quantified by a fluorescence assay using SYBR Gold (Invitrogen). For FFPE samples, DNA was isolated using the BiOstic® FFPE Tissue DNA Isolation Kit from Mo Bio Laboratories. The quantity and quality of the preparations were by OD260 and qPCR analysis across 3 different genomic loci. Only single stranded DNA (ssDNA) samples with a difference in Ct values of equal or less than 4.0 or approximately 15% genome equivalence between the flash-frozen and FFPE samples were used for subsequent analysis.
Capture polynucleotides with the properties optimal for FFPE capture were chosen from a larger, previously described set (Natsoulis et al. 2011, Ref. 9). As disclosed there, the oligonucleotide sequences can be downloaded from the Human OligoExome, a database which provides gene exons annotated by the Consensus Coding Sequencing Project (CCDS). The database is available at oligoexome.Stanford.edu. 628 capture oligonucleotides resulting in amplicons ranging from 150 to 250 bp were chosen from this set. 2,512 sequences containing sequences of the 5′ targeting arm, 3′ targeting arm, amplicon, and target oligonucleotide for each of the 628 capture oligonucleotides were compiled. Targeting arms were positioned in regions without SNPs per dbSNP. Details on the design parameters and on the capture characteristics of the targeting arms are provided by Natsoulis et al. [9].
The accompanying sequence listing sets forth the sequences of the 5′ targeting arm, the 3′ targeting arm, the amplicon sequence and the universal oligonucleotide used, including uridine substitutions for the 628 capture probes used in the examples. In the table below, Column 1 is the chromosome number targeted; column 2 is the position of the 5′ end of the targeted sequence; col. 3 is the polarity of the targeted strand; column 4 (SEQ ID NOs) is sequence of the 20 bp 5′ targeting arm; column 5 (SEQ ID NOs) is the sequence of the 3′ 20 bp selector; column 6 (SEQ ID NOs) lists the sequences of the amplicons and column 7 (SEQ ID NOs) lists the sequences of the targeting oligonucleotides (“universal probes”) including uridine substitutions; and column 8 is the identifier (which may also be checked at the Stanford OligoExome web site).
The 5′ end and the 3′ end of the capture oliogonucleotides were blocked and did not contain phosphate or hydroxyl groups and 10 thymines were substituted with uracils to facilitate fragmentation and purification of the splint oligonucleotides after circularization. All oligonucleotides were synthesized at the Stanford Genome Technology Center (Stanford, Calif.). In an alternative design we substituted the central 40 bp of the capture oligonucleotide with a sequence comprising the Illlumina® sequencer adapter sequence. This has the advantage of creating amplicons ready for sequencing in a single amplification reaction, thus greatly facilitating the workflow. IIlumina® adapter sequences are available to anyone using their products; any approximately 35 bases, designed to allow attachment of the DNA to be sequenced to the surface of the flow cells used. Other sequencing systems would use other adapters.
High quality genomic DNA from flash-frozen tissues was first sonicated for 10 minutes in the Bioruptor to a size of 500-1000 bps. The hybridication reactions contained 0.5 μg dsDNA or 3-4 μg ddDNA and 50 pM of each of the capture oligonucleotides. After a brief denaturation step, the mixture was incubated in the PCR machine using a touchdown protocol ranging from 70-50° C. and 30-60 minutes for each step. Then a mixture of the cleavage enzymes (ExoI and Taq) and circularization enzyme (Ampligase or Taq ligase) were added to each tube and the reactions were incubated for 1 hour at 37° C. followed by a touchup protocol from 50-72° C. for 30 minute at each step. Excess oligonucleotides in the reactions were cleaved by uracil excision. After a brief purification using the Spin-20 columns, the captured DNA fragments were amplified using the high-fidelity Phusion polymerase and either the generic primer (e.g. ID 102) [9] or IIlumina PE-primers for 38-39 cycles. The PCR products were purified using the Fermentas kit.
The captured target DNA amplified with the generic PCR primers were ligated to PE-adapters after “A-tailing” and gel purified. They were then amplified for 10-12 cycles using the PE primers and re-purified from agarose gel. For DNA fragments captured with built-in PE primer sites, they were first purified away from the primer-dimers by gel electrophoresis and re-amplified for 5 cycles using the short PE primers. After quantitation by the SYBR based fluorescence assay, the libraries were sequenced on Illumina HiSeq or GAIIx using standard conditions.
10 pM of PCR amplified library and 1.5 pM of circularized DNA were sequenced using the Illumina Genome Analyzer IIx. Circular library obtained from 1 μg of starting material was introduced to the sequencing experiment. After sample dilution using hybridization buffer, 20% of the prepared sample (representing 200 ng of starting material) was hybridized in the flow cell.
Sequence reads were aligned to the human genome version hg19 using ELAND software. The target regions were defined as the ranges from each target specific site to 41 bases upstream or downstream of it (depending on the orientation of the capture oligonucleotide). The interval of 41 bases was selected because the read length in these experiments was 42. In a paired-end experiment the target region contained both ends of the circularized fragments, while single-read sequencing targeted only 3′ ends of the circularized fragments. To assess the specificity of the capture, the numbers of sequence reads mapping inside and outside the target region were compared. To illustrate the uniformity of the assay, the reads that aligned perfectly with the specific capture sequences were counted. Read counts were then sorted and normalized using the median sequence yield value from each experiment. The genomic distance between the target specific sites indicates the circle size. In addition, guanine and cytosine proportions within the target sites were determined. The present capture oligonucleotide contains two target specific sites and each site was analyzed separately. To analyze the annealing properties during circularization-hybridization reaction, target specific sites within a single capture oligonucleotide as high or low G+C were classified. Circle sizes and G+C proportions with the sequence yields for each oligonucleotide were then plotted.
In a proof of principle experiment, we used a set of previously described capture oligonucleotides [9]. Because we had determined that amplicon size was an important parameter for this type of selective circularization, we chose a subset of 628 capture oligonucleotides, each targeting a 150-250 base region. The assay targets a total of 123,982 bases. We compared the yield and the reproducibility of targeting reactions using DNA extracted from either fresh frozen tissue or FFPE blocks of three individuals. Both fresh frozen and FFPE samples are derived from normal colon according to the pathology reports.
The resulting capture amplicons from matched genomic DNA samples derived from either flash-frozen or FFPE material were concatenated using T4 DNA ligase and mechanically fragmented prior to library preparation. Replicate sequencing was conducted in triplicate to identify sequencing specific errors. The fragmented amplicons ligated to a 4-plex paired-end indexing adapters for two samples from individuals 751 and 761 [13]. The four libraries were combined and sequenced in three separate lanes of an Illlumina GAIIx sequencer. For matched samples from individual 780, paired end sequencing was conducted on both the flash tissue and FFPE derived material in separate full sized lanes. Sequence reads were aligned to the human genome reference. Given the replicate sequencing and matched samples, there were a total of 14 separate sequencing data sets. Each was analyzed separately (Table 1).
Overall, the sequence coverage is very reproducible among the replicates for each individual's samples. As noted in Table 1 the sequence coverage at 10× coverage ranges from 79% to 92% and is 5 to 10% lower for the FFPE derived than for the flash tissue derived samples. The uniformity of capture between the two types of starting material and for all three patient's DNA was compared (
It was determined that the sensitivity of detection of heterozygote SNVs in the targeted resequencing from FFPE versus flash-frozen derived DNA. As described previously, SNV calling from each dataset was conducted [9]. The results of previous published analysis were advantageously used, demonstrating that the variant calling accuracy improves when relying on calls that can be established from both the forward or reverse strand (e.g. double-stranded) [9]. Of the 83 heterozygotes in high quality genomic DNA from flash-frozen tissue, 71 were also called from the FFPE-derived DNA for individual 751 (85%). Similar sensitivity values for the other two patients (84% and 85% respectively for individuals 761 and 780) were obtained.
Given that matched samples from normal tissue of the same individual are used, differences between the SNV-calling results between FFPE versus flash-frozen derived DNA is attributable to FFPE-induced damage. Sequencing-related errors were eliminated based on the triplicate resequencing of each sample. As previously published, a straightforward statistical method to identify differences between matched samples which were previously applied to normal tumor pairs [9] was developed. At any given sequence position, the present method imposes that the difference in the second most frequent bases between the two samples exceeds 10% for both forward and reverse strand aligning reads. The 14 datasets were analyzed as seven matched pairs comparing sequence data from matched FFPE versus flash-frozen derived genomic DNA samples. The analysis yielded an average of 10.2 FFPE-specific calls (standard deviation being 4.2) per pair within the 102 Kb target (N=73 total positions for all pairs representing 45 unique positions). This results in one false positive call per every 12 Kb of targeted DNA. The FFPE-specific calls are replicated amongst the datasets that were sequenced in triplicate (patients 751 and 761) indicating that these errors were not attributable to the sequencing chemistry or processing but inherently found in the FFPE-derived DNA. There was no overlap between patients amongst these FFPE specific calls.
The pattern of FFPE-specific substitution errors were examined (Table 2). For substitutions, there are twelve combinations when considering all possibilities. Thirty one changes were transitions and 14 were transversions. Only 4 categories of substitutions among the 12 different substitutions were observed. This represented 44 out of the 45 observed cases. Nearly all of the observed changes obey the consensus G or C→A or T. The C→T and G→A transitions are compatible with cytosine deamination which is a common FFPE processing artifact [10].
0
12
18
1
The above table shows that the chemical treatment involved in the FFPE process causes far fewer single base changes than are normally observed between individuals in the form of SNPs. Further, these chemical modifications are predictable as most likely being G->A or C->T. This means that the present methodology can be useful in an SNP analysis of genomic DNA from an FFPE sample.
It is noted that while just one position per 12 kb of targeted sequence results in an FFPE specific calls that passed a statistical significance cutoff for significance and was found in both the forward and reverse strands of capture sequence. From either FFPE or flash-frozen derived genomic DNA, a number of positions had suggestions of a variant but were typically seen only the forward or the reverse strand. Using the variant calling method which imposes double-stranded representation, these positions were effectively eliminated as false positive calls (
Having obtained promising results from the initial capture oligonucleotides, an improved bioinformatic pipeline for in silico capture oligonucleotide design was developed. The present design process optimizes the placement of the targeting arms according to the following considerations: (1) it attempts to place the 20 bp targeting arms in positions unique over the genome and that have no single mismatch neighbor, (2) identifying capture arms with GC content between 30% and 60%, (3) the size distribution of the target genomic regions approximating 220 bases in length. The new design process was applied to the targeting 80 exons from six cancer genes. A total of 288 capture oligonucleotides were synthesized for this six gene capture assay and these pooled oligonucleotides were used on three matched normal and tumors samples from the same individual. One DNA sample was obtained from flash-frozen tumor tissue, one sample was obtained from an FFPE section and a third normal DNA sample was obtained peripheral lymphocytes. Significantly improved performance metrics were noted using these optimized capture parameters.
Further optimization of the present process was carried out to show amplicon length obtained at different temperatures with the 628 capture oligonucleotides used. Ranges from 50 deg. to 60 deg. annealing temperatures showed no size bias between an amplicon length of 150-250 bp. Annealing temperature of 50 deg. was shown to yield a higher number of amplified targets. Also, consistent coverage across the amplicon lengths between 150 and 250 bp was shown. It was also shown that the process was tolerant of hairpin structures that can form in ssDNA that is being captured by the present capture probes.
As another novel feature, the sequencing library adapter sequences were incorporated into the universal vector sequence. This enabled a sequencing read library with a single amplification step to be generated, thus significantly reducing the complexity of the workflow used for next generation sequencing instruments such as the Illumina HiSeq, GAIIx, MiSeq, Life Sciences Solid, Ion Torrent, Pacific Biosciences system and the Roche 454 sequencer among others.
The present compositions may be provided in kit form, comprising a set of capture probes and universal oligonucleotides. Primers and a polymerase for amplification may also be included in the kit.
The above specific description is meant to exemplify and illustrate the invention and should not be seen as limiting the scope of the invention, which is defined by the literal and equivalent scope of the appended claims. Any patents or publications mentioned in this specification are intended to convey details of methods and materials useful in carrying out certain aspects of the invention which may not be explicitly set out but which would be understood by workers in the field. Such patents or publications are hereby incorporated by reference to the same extent as if each was specifically and individually incorporated by reference and contained herein, as needed for the purpose of describing and enabling the method or material referred to.
This application claims priority from U.S. Provisional Patent Application No. 61/560,412 filed on Nov. 16, 2011, which is hereby incorporated by reference in its entirety.
This invention was made with government support under contracts 2P01HG000205 and R21CA 140089-01A1 awarded by the National Institutes of Health. The government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
61560412 | Nov 2011 | US |