Allele profiling (see US Publ'n No. 2015/0376693; U.S. Pat. No. 9,133,567, U.S. Ser. No. 15/202,035, all incorporated by reference) employs two reactions, the first of these a PCR (or other) amplification (“A”) reaction, and the second a discrimination (“D”) reaction. By incorporating source tags in first reaction products, allele profiling permits the pooling of these products as inputs to the second reaction to form a multiplicity of allele-specific second products that also incorporate marker tags identifying alleles. Allele profiling permits the simultaneous analysis of multiple alleles, over multiple exons, for multiple samples.
Sieving of nucleic acids (see US Publ'n No. 2015/0315568; U.S. Pat. No. 8,932,989, both incorporated by reference), rather than delivering results for all samples in a given sample set, permits the selection, from that set, of nucleic acids having (or not having) desired alleles or allele patterns. As with allele profiling, sieving relies on the incorporation of source and marker tags, into respective first and second reaction products.
Allele profiling and sieving, in addition to providing for the pooling of multiple first products into pooled second reactions, also provide for combining multiple samples in pooled first reactions. As the resulting pooled pools may leave unresolved ambiguities, these methods also provide a disambiguation step to identify allele configurations of samples in pools, or pools of pools, with unresolved allele configurations of interest (see US Publ'n Nos. 2015/0376693; 2015/0315568; U.S. Pat. Nos. 9,133,567; 8,932,989).
Sanger sequencing, in its prevalent commercial embodiment, comprises a PCR amplification reaction, to produce a single reaction product for each of usually two amplified alleles, a cleanup step to remove unused primers and dNTPs, followed by a “labeling” reaction to produce second products, in the form of fragments of all lengths, each terminated by (random) incorporation of base-specifically labeled, di-deoxy nucleotides. These second reaction products are then analyzed by capillary electrophoresis such that heterozygous samples produce two distinct peaks at any variable sequence position in which alleles differ in composition.
The present invention discloses an embodiment of allele profiling, or nucleic acid sieving, with pooled Sanger sequencing as a first (aka “screening”) stage. As with Sanger sequencing generally, this embodiment comprises amplifying a single sequence, delineated by forward and reverse primers which may represent a single exon, or a segment thereof, or a contiguous stretch of multiple exons and introns. The amplification products (“amplicons”) produced from a pool of samples each comprise the amplified sequence, and these are next converted into fragments in the standard Sanger labeling reaction. Ambiguities will arise in the subsequent analysis in the form of superposed peaks at any heterozygous position of interest, as the origin of the variant signal is not uniquely attributable to a specific sample, or samples, in the pool. These ambiguities may be resolved by the allele profiling process described in US Publ'n No. 2015/0376693; U.S. Pat. No. 9,133,567; and U.S. Ser. No. 15/202,035, with a smaller number, d, of samples per pool of first products, wherein d is an integer and d≥1, and wherein the pool of origin is identified by incorporating source tags into the first reaction products, to permit pooling of these first reaction products (as described for allele profiling in the foregoing references).
This present embodiment of allele profiling (and sieving) using pooled Sanger sequencing has the advantage that it can detect unanticipated variable sites in the amplified sequence. It retains the advantage of efficient disambiguation by the use of source tags enabling the pooling of first reaction products as inputs for the second (“discrimination”) reaction.
In a further embodiment, the invention provides for pooled Sanger sequencing of amplicons incorporating source tags to generate shifted sequence traces from pooled samples that permit the identification of alleles for at least two combined Sanger labeling reaction products. Source tags may differ in length or in base composition so as to induce known shifts in the predicted peak positions of interest, thereby permitting disambiguation, as shown herein.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The term “variable site of interest” is defined as a polymorphic site or SNP; or an insertion or deletion mutation.
The term “disambiguating” means resolving an ambiguity; which occurs if, based on the results in question, at least one sample in a particular pool cannot be identified as either normal or variant at a variable site of interest. Disambiguation encompasses any method of sequencing or genotyping or allele identification, including but not limited to using allele-specific primers.
The term “reaction steps” refers to steps involved in amplifying, labeling, or extending a primer, amplicon or oligonucleotide chain.
The present invention provides for the use of pooled Sanger sequencing as the first (“screening”) stage in allele profiling or sieving which may comprise a second (“disambiguation”) stage, as illustrated, for two samples, in
In a first embodiment, the present invention provides for combining two or more amplicons in a Sanger labeling (aka “cycle sequencing”) reaction using differentially fluorescently labeled dideoxy-nucleotides, and subsequent analysis of these pooled labeled products, preferably by capillary electrophoresis. As with standard Sanger sequencing, a single (contiguous) sequence (typically comprising a single exon) is analyzed. As with allele profiling and nucleic acid sieving, samples that contain at least one variant allele in one or more (known or unknown) position in the sequence, generally will introduce an ambiguity associated with two or more superimposed peaks at variable positions in the sequence.
However, contrary to the general practice of discarding as contaminated “double sequence” data (see “DNA Sequencing Troubleshooting Guide” by Eurofins Genomics, available on its website; “DNA SEQUENCING SANGER: TECHNICALS SOLUTIONS GUIDE” by Secugen, available on its website) the superimposed sequence traces generated by pooled samples may be decoded, so as to achieve disambiguation, as elaborated herein.
The process of the present invention will be a useful complement to the previously disclosed processes of allele profiling and nucleic acid sieving for the analysis of alleles and mutations, especially if all or the most prevalent mutations or alleles of interest reside on a single exon, as they do, in the case of mutations, for example, for β-thalassemia or for cystic fibrosis, or, in the case of alleles, for the polymorphic genes encoding the human leukocyte antigens (“HLA”) or the Rh antigens. More generally, the present invention also will be useful for analyzing a single amplicon (or other construct or reaction products) comprising multiple exons or sections thereof. Such amplicons may be generated by amplification with primer flanking the region in the sequence comprising a variable site, or multiple variable sites, of interest. Either of these flanking primers may be used in the subsequent Sanger labeling reaction of the combined samples, as illustrated in
Thus, pooled Sanger sequencing of the β-thalassemia gene for a pool of two samples would produce, for IVSI-5 (G>C), the by far most commonly observed β-thalassemia mutation in Pakistanis (Ansari2011 “Molecular epidemiology of β-thalassemia in Pakistan: far reaching implications” Int J Mol Epidemiol Genet. 2(4): 403-408), the following expected “read” (see also,
The presence of a mutation in the pool will be readily detected, in the form of a “het” signal, characterized here by peaks in two color channels; however, ambiguity generally will remain as to the identity of the sample(s) carrying the mutation, as the following separately pooled configurations: (GG|GC), (GC|GG), (GG|CC), and (CC|GG) all will produce the same heterozygous signature in the “compound” sequence trace (though peak intensities may provide additional information). Similarly, insertions and deletions are readily detected, as illustrated in
Gain in Operational Efficiency—
The probability of encountering at least one heterozygous configuration in a pool of d samples may be estimated from the (assumed known) population frequencies of anticipated variant alleles (as discussed in the allele profiling and sieving references) so as to determine the optimal d, subject to the constraint that the d-fold dilution of samples incurred by pooling will set an upper limit, d≤dmax, to the extent of practical pooling.
In comparison to standard Sanger sequencing for mutation analysis, pooled Sanger sequencing will produce a gain reflecting the reduction in the number, N, of sequencing runs in the single sample format to a number not greater than N/d+d*(N/d)*prob (at least one mutation in d samples), where d denotes the number of samples in a pool. Assuming bi-allelic genes, prob (at least one mutation in 2*d alleles)=1−(1−f)2d where f represents the probability that a sample comprises at least one of a set of variant alleles or mutations of interest.
For example, taking the carrier frequency for β-thalassemia to be f=1/30, reflecting the combined abundances of the most commonly observed mutations in South-Asian populations (Ansari2011), the expression yields 0.13, 0.23, 0.33 and 0.42, respectively, for d=2, 4, 6 and 8. Thus, for d=4, 96 β-thalassemia samples, combined into 96/4 pools, would yield 0.23*24 pools requiring disambiguation; if performed on individual samples (d=1), this scenario would entail performing an additional 0.23*24*4=22.8 or roughly 23 runs (assuming, in the worst case, that no two ambiguities are encountered in the same pool); thus, the total number of wells processed would be no greater than 96/4+23=47, a gain of roughly 2 (=96/47).
Allele profiling (including the use of source tags) would further reduce the number of requisite additional runs, by another factor of 4, to roughly 6 (=23/4), the factor of 4 reflecting the pooling of source-tagged first reaction products (as described in the allele profiling references).
A more detailed comparison would break out gains in first and second reactions—standard Sanger sequencing would require 96 first reactions (namely PCR amplification) plus 96 second reactions (namely: labeling), for a total of 192 reactions, the resulting fragments requiring 96 capillaries for analysis. In contrast, assuming d=4, allele profiling with pooled Sanger sequencing as the first stage, would require: 96/4 first reactions, 96/4 second reactions, analyzed in 24 capillaries, plus, for disambiguations: 23 first reactions plus 6 second reactions, analyzed in 6 capillaries.
Thus, the invention yields a substantial gain in process efficiency for carrier screening given the typically low carrier frequencies for inherited disorders. An example showing the application of pooled Sanger sequencing to molecular sieving for RHCE alleles is given below.
Disambiguation by Allele Profiling—
Ambiguities may be resolved, in accordance with the allele profiling process previously disclosed, by using allele-specific amplification at heterozygous positions of interest, either one at a time or several at a time. In a preferred embodiment, DNA from the constituent samples of ambiguous pools is amplified using one or more pairs of fluorescently labeled primers directed to the alleles at heterozygous positions, paired with source-tagged primers, as illustrated in
These allele-specific amplification reactions may be performed using genomic DNA from individual samples, or a set of amplicons independently generated by random priming of the genome or selected genomic regions of these samples. In the latter case, the amplification of a specific sequence of interest will be accomplished in a small number of cycles generating source-and-marker-tagged products.
Disambiguation by Pooled Sanger Sequencing of Source-Tagged Products—
In a further embodiment of the invention, the Sanger labeling reaction is performed with pools of first reaction products comprising source (“S”) tags, wherein source tags identifying the first reaction products produce predetermined relative shifts in the expected sequence traces, either by changing fragment length or (by one of several methods well known in the art) electrophoretic mobility.
In one embodiment, to change fragment length, the source tags have a common 5′ subsequence which may comprise the entire sequence of the shortest tag, and the “universal” labeling primer is complementary to that subsequence, as illustrated in
In another embodiment of the invention, the source tags differ in composition, by one or more base(s), at the end forming the junction with the gene- or exon-specific primer(s) as in
Thus, given a post-PCR pool of first reaction products, the choice of labeling primer, in accordance with these embodiments, permits either screening of multiple such first products for ambiguities reflecting heterozygous configurations, or disambiguating such configurations, by introducing peak shifts and labeling only a subset of pooled first reaction products.
The process of the invention is in contrast to the analysis of Sanger sequence traces comprising signals from multiple samples by decomposition into constituents, with reference of a “dictionary” holding constituent peak patterns (or representations thereof). Superpositions of two or more such peak patterns are then compared to the observed pattern to infer the composition of the mixture, as shown for sequencing of the 16S rRNA gene for mixtures of bacterial pathogens (the Pathogenomix website; see also Kommeda12008 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576573/pdf/0213-08.pdf). This approach is akin to that of decomposing genotypes into constituent alleles, where alleles are defined over multiple variable sites, step in the standard analysis of genotypes for highly polymorphic genes such as HLA.
Non-Integer Peak Shifts—
Ambiguity remains when the mutation or variable position in the shifted peak pattern of a second sample is superimposed on an identical base in the sequence of the first sample. For example, if, in
However, by constructing source-tags so as to introduce “non-integer” peak shifts (that is: shifts by non-integer multiples of the nominal peak-to-peak spacing of 1 base), this ambiguity is avoided: the presence of a second “C” peak, shifted by a non-integer displacement from the first, would unambiguously indicate the presence of a “het” configuration.
Non-integer peak shifts may be produced by using source tags that, for given length, differ in base composition. Thus, it has long been known that tag composition alters the electrophoretic mobilities of oligonucleotides (Frank 1979—“DNA chain length markers and the influence of base composition on electrophoretic mobility of oligodeoxyribonucleotides in polyacrylamide-gels” Nucleic Acids Research Vol. 8 pp. 2069-87). Alternatively, the fluorescent dyes used in commercial Sanger sequencing kits are well known to introduce differential peak shifts (requiring correction as a pre-processing step when aligning the traces recorded in the different color channels and normalization with respect to a “size ladder” included in each reaction). In addition, chemical modifications with “drag tags” also have been described. Other chemical modifications including methylation also are available to introduce peak shifts.
Key Process Steps—
The analysis of sequence traces recorded from a pool of source-tagged samples would proceed as follows:
1—Detecting Variants by Pooled Sanger Sequencing: Sieving for RHCE Exon 5 Alleles
Exon 5 of this gene comprises several important alleles including (ISBT—https://tinyurl.com/vca97t93)
The administration of red cells that are not properly matched for the phenotype determined at this locus contributes to the risk of alloantibody formation. The rapid determination of especially the alleles at this locus, for recipients and donors or red cells, therefore has substantial clinical significance.
To apply the “screening” (or “sieving”) method of the invention using pooled Sanger sequencing, combine DNA samples from at least 2 individuals (d≥2) for amplification using standard primers flanking exon 5, where d is determined as disclosed; then, commit the resulting amplicons to the Sanger labeling reaction performed with either of the PCR primers, and analyze the resulting labeled products by capillary electrophoresis. As with molecular sieving generally, the abundance of the variant alleles in the table, for the population of interest, determines the probability of a variant and an associated ambiguity, and thus determines the expected number of pools that are unambiguous for one or more of the listed alleles. Constituent samples of these pools may be selected in accordance with desired allele patterns; for example, pools comprising candidate donor samples that are homozygous E—may be selected for immediate assignment to recipients with existing anti-E antibodies.
2—Detecting Variants by Pooled Sanger Sequencing: Screening for β-Thalassemia Mutations
Exon 1 of this gene comprises several of the most commonly observed mutations including substitutions, insertions and deletions. Illustrated here is the detection of an insertion in a pool of two samples of which one is normal, and the other is homozygous for the codon 8/9 (+G) mutation, producing the following expected “read” (see also
A “G-T” het configuration in the expected position of the insert, highlighted here by a bold-faced “G”, along with additional predictable downstream hets, indicates the presence of the insert in at least one allele. The corresponding sequence traces, in
Analogously, deletions, such as the 4-base deletion in codon 41/42-CTTT, another common β-thalassemia mutation, would be readily detected by the appearance of predictable het configurations. A simple substitution will produce a characteristic het at the expected position.
The example illustrates the case d=2, with 2 copies of the variant allele in the pool. The value of d is limited only by detection sensitivity which must be such that 1 copy of a variant allele is reliably detected in a pool comprising 2d copies: thus, it is the detection sensitivity that ultimately limits the value of d≤dmax.
3—Disambiguation by Using Source Tags: β-Thalassemia Mutations—
As illustrated in
4—Disambiguation by Using Source Tags: Cystic Fibrosis Mutations—
A further example is that of assigning the G542X (G>T) mutation in exon 11 of the cystic fibrosis gene. Of the expected sequence “reads” for four possible configurations, shown in
It is worth pointing out that the use of a labeling primer directed to the S22 tag sequence permits the labeling of only the S22-tagged sample, even when both samples are in the pool: in fact, this is how the trace in the middle panel of
Number | Date | Country | |
---|---|---|---|
62580784 | Nov 2017 | US |