QUANTITATIVE AMPLICON SEQUENCING FOR MULTIPLEXED COPY NUMBER VARIATION DETECTION AND ALLELE RATIO QUANTITATION

Information

  • Patent Application
  • 20220098642
  • Publication Number
    20220098642
  • Date Filed
    January 02, 2020
    4 years ago
  • Date Published
    March 31, 2022
    2 years ago
Abstract
Provided herein are methods of quantitative amplicon sequencing, for labeling each strand of targeted genomic loci in a DNA sample with an oligonucleotide barcode sequence by polymerase chain reaction, and amplifying the genomic region(s) for high-throughput sequencing. The methods can be used for the simultaneous detection of copy number variation (CNV) in a set of genes of interest, by quantifying the frequency of extra copies of each gene. In addition, these methods provide for the quantitation of the allele ratio of different genetic identities for targeted genomic loci using multiplexed PCR. In addition, these methods provide for the detection of mutations and quantitation of the variant allele frequency.
Description
REFERENCE TO A SEQUENCE LISTING

The instant application contains a Sequence Listing, which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Nov. 26, 2019, is named RICEP0058WO_ST25.txt and is 145.6 kilobytes in size.


BACKGROUND
1. Field

The present invention relates generally to the fields of molecular biology and medicine. More particularly, it concerns compositions and methods for multiplexed copy number variation detection and allele ration quantitation using quantitative amplicon sequencing.


2. Description of Related Art

Copy number variations (CNVs) are important cancer biomarkers contributing to cancer formation and progression. They are present in a significant percentage of tumors, between 3% and 98% depending on the cancer type. Many CNVs confer sensitivity or resistance to targeted therapies, for example, MET amplification confers increased sensitivity to MET TKIs in non-small cell lung cancer, and PTEN deletion confers BRAF inhibitor resistance in melanoma. In tumor samples, CNV of a specific gene may exist only in a small fraction (<10%) of cells, due to tumor heterogeneity and normal cell contamination.


Unlike mutations and indels, CNVs have no unique sequence, thus detection of CNV requires accurate quantitation. This quantitation is difficult due to stochasticity in sampling of DNA molecules. For example, the standard deviation (a) of sampling 1200 molecules per locus (i.e. 1200 haploid genomic copies from 600 normal cells, 4 ng of genomic DNA) can be estimated by Poisson distribution: σ=√{square root over (1200)}=35, corresponding to 3% of molecule number. In this case, detecting 1% of extra copies is not possible. Theoretically, increasing the number of input molecules or analyzing more loci can equally decrease the variance, and the a can be estimated as σ=√{square root over (haploid genomic copies×loci number)}. If genomic copy number or loci number increase by 100×, σ will be decreased to 0.3%, and 1% of extra copies will be detectable.


Current standard method for CNV detection in molecular diagnostics is in situ hybridization (ISH), which can determine CNV status based on observation of a small number of cells. However, ISH technologies lack the ability to perform simultaneous analysis of multiple genomic regions, due to the limited number of distinguishable colors in both fluorescence and bright-field microscopy. Additionally, ISH is a complex process that needs to be performed by specialized labs, preventing it from being widely adopted.


Another method for CNV detection is droplet digital PCR (ddPCR), which is a PCR-based method for absolute quantitation of DNA molecules. However, its limit of detection (LoD) for CNV is about 20% extra copies with a large number of replicated experiments. Like ISH, ddPCR also suffers from an inability to be multiplexed due to the limited number of fluorescence channels. Microarray-based methods, including array comparative genomic hybridization and SNP arrays, are highly multiplexed methods used for screening of large CNVs and aneuploidies. However, they are not as good in detecting smaller CNVs <40 kb or low-frequency CNVs at <30% extra copies.


Next-generation sequencing (NGS) is a high-throughput technology that has seen rapidly decreasing costs over the past decade. NGS is popular in the field of cancer molecular diagnostics. Highly multiplexed mutation detection with an LoD of <0.1% variant allele frequency has been achieved and commercialized on NGS platforms. However, current LoD of NGS methods for CNV detection is not as good: whole-exome sequencing (WES) has been used for CNV discovery at a level of ≈30% extra copies, but is expensive, and requires even more NGS reads (with a proportional increase in cost) to achieve lower LoD. Smaller hybrid-capture panels, such as the FoundationOne commercial panel, can reach an LoD of ≈30% extra copies at lower costs.


In NGS panels for diagnostics, target enrichment is needed to reduce NGS reads wasted on unrelated genomic regions. Two popular methods for target enrichment are hybrid-capture and multiplexed PCR. Current NGS-based CNV panels are mostly hybrid-capture-based, which means target regions are captured by biotinylated nucleic acid probes and separated from the rest of the genome using streptavidin magnetic beads. Hybrid-capture panels have low on-target rates when the panel size is small, so most panels are >100 kb (i.e. >1000 probes or loci); this is due to nonspecific binding of unwanted DNA on bead surfaces, probes, and captured targets. Due to the large number of loci, the coverage of hybrid-capture panels is not uniform: the 95% and 5% percentile loci differ by at least 30-fold, which introduces another layer of bias in quantitation. Hybrid-capture panels also suffer from low conversion rates (i.e., the percentage of input molecules sequenced) caused by imperfect end-repair and ligation, causing biased sampling processing and contributing to variation.


SUMMARY

Provided herein are methods of quantitative amplicon sequencing, for labeling each strand of targeted genomic loci in a DNA sample with an oligonucleotide barcode sequence by polymerase chain reaction, and amplifying the genomic region(s) for high-throughput sequencing. The methods can be used for the simultaneous detection of copy number variation (CNV) in a set of genes of interest, by quantitating the frequency of extra copies of each gene. In addition, these methods provide for the quantitation of the allele ratio of different genetic identities for targeted genomic loci using multiplexed PCR.


In one embodiment, provided herein are methods for preparing targeted regions of genomic DNA for high-throughput sequencing, the method comprising: (a) obtaining a genomic DNA sample; (b) amplifying at least a portion of the genomic DNA sample by performing two cycles of PCR using: (i) a first oligonucleotide comprising, from 5′ to 3′, a first region, a second region having a length between 0 and 50 nucleotides (e.g., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 nucleotides), a third region comprising at least four degenerate nucleotides (e.g., 4, 5, 6, 7, 8, 9, 10, 11, or 12 degenerate nucleotides), and a fourth region comprising a sequence that is complementary to a first target genomic DNA region; and (ii) a second oligonucleotide comprising, from 5′ to 3′, a fifth region, a sixth region having a length between 0 and 50 nucleotides (e.g., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 nucleotides), and a seventh region comprising a sequence that is complementary to a second target genomic DNA region; (c) amplifying a product of step (b) by performing at least three cycles of PCR with an annealing temperature that is 0-10° C. (e.g., 1-10, 2-10, 3-10, 4-10, 5-10, 1-9, 1-8, 1-7, 1-6, 1-5, 2-9, 2-8, 2-7° C. or any range or value derivable therein) higher than an annealing temperature used in step (b) and using: (i) a third oligonucleotide comprising a sequence that is able to hybridize to the reverse complement of at least a portion of the first region; and (ii) a fourth oligonucleotide comprising a sequence that is able to hybridize to the reverse complement of at least a portion of the fifth region; and (d) amplifying a product of step (c) by performing at least one cycle of PCR using a fifth oligonucleotide comprising, from 5′ to 3′, an eighth region, a ninth region having a length between 0 and 50 nucleotides (e.g., 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 nucleotides), and a tenth region comprising a sequence that is complementary to a third target genomic DNA region, wherein the third target genomic DNA region is at least one nucleotide closer to the first target genomic DNA region than the second target genomic DNA region.


In some aspects, methods are methods for preparing between 1 and 10,000 targeted regions (e.g., at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 250, 500, 750, 1,000, 2,000, 3,000, 4,000 or 5,000 and at most 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 750, 500, 250, 100, 75, or 50 targeted regions, or any range or value derivable therein) of genomic DNA for high-throughput sequencing. In some aspects, the third region is a unique molecular identifier (UMI). In some aspects, the third target genomic DNA region is 1-10 (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10) bases closer to the first target genomic DNA region than the second target genomic DNA region. In some aspects, the first region and the eighth region are universal primer binding sites. In some aspects, the first region and the eighth region comprise a full or partial NGS adapter sequence. In some aspects, the fifth region comprises a sequence that cannot be found in the human genome. In some aspects, the fifth region comprises a sequence that is different from an NGS adapter sequence. In some aspects, the melting temperatures of the first region and the fifth region are 0-10° C. (e.g., 1-10, 2-10, 3-10, 4-10, 5-10, 1-9, 1-8, 1-7, 1-6, 1-5, 2-9, 2-8, 2-7° C. or any range or value derivable therein) higher than the melting temperatures of the fourth region and the seventh region. In some aspects, the degenerate nucleotides in the third region each independently are one of A, T, or C. In some aspects, none of the degenerate nucleotides in the third region are G. In some aspects, there is a population of first oligonucleotides each having a unique third region.


In some aspects, the methods further comprise purifying the product of step (c). In some aspects, purifying comprises SPRI purification or column purification. In some aspects, the methods further comprise purifying the product of step (d). In some aspects, purifying comprises SPRI purification or column purification. In some aspects, the methods further comprise (e) amplifying the product of step (d) by PCR using primers that hybridize to the first region and the eighth region, wherein the primers comprise an index sequence for next-generation sequencing. In some aspects, the methods further comprise purifying the product of step (e). In some aspects, purifying comprises SPRI purification or column purification. In some aspects, the methods further comprise (f) performing high-throughput DNA sequencing of the produce of step (e). In some aspects, high-throughput DNA sequencing comprises next-generation sequencing.


In some aspects, the first target genomic DNA region and the second target genomic DNA region are on opposite strands of the genomic DNA. In some aspects, the first target genomic DNA region and the second target genomic DNA region are separated by between 40 nucleotides and 500 nucleotides (e.g., by 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, or 500 nucleotides, or any value derivable therein). In some aspects, step (b) comprises an extension time of about 30 minutes (e.g., 27, 28, 29, 30, 31, 32, or 33 minutes). In some aspects, step (c) comprises an extension time of about 30 seconds (e.g., 27, 28, 29, 30, 31, 32, or 33 seconds). In some aspects, step (d) comprises an extension time of about 30 minutes (e.g., 27, 28, 29, 30, 31, 32, or 33 minutes).


In one embodiment, provided herein are methods for quantifying the frequency of extra copies (FEC) of at least one target gene, the method comprising: (a) obtaining a genomic DNA sample; (b) preparing the genomic DNA for high-throughput sequencing according to a method of any one of the present embodiments, wherein the sequences of the fourth region, the seventh region, and the tenth region hybridize to the at least one target gene; (c) performing high-throughput sequencing according to a method of any one of the present embodiments; and (d) calculating the FEC for the at least one target gene based on the sequencing information obtained in step (c).


In some aspects, the methods are methods for quantifying the FEC for a set of target genes, wherein the set of target genes comprises between 2 and 1000 target genes (e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 250, 500, or 750, and at most 1,000, 900, 800, 750, 700, 650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 150, 100, 75, 50, 25, 20, 15, 10, 9, 8, 7, 6, 5, 4, or 3 targeted regions, or any range or value derivable therein). In some aspects, step (b) is performed using a population of first oligonucleotides, a population of second oligonucleotides, and a population of fifth oligonucleotides, wherein a portion of each of the populations of first, second, and fifth oligonucleotides comprise fourth, seventh, and tenth regions, respectively, that are complementary to one of the set of target genes. In some aspects, each of the fourth, seventh, and tenth regions comprises sequences that are only found once in the human genome. In some aspects, each first oligonucleotide that hybridizes to one target gene has a unique third region compared to each other first oligonucleotide that hybridizes to the same target gene. In some aspects, step (b) is performed using a first oligonucleotide, a second oligonucleotide, and a fifth oligonucleotide comprising fourth, seventh, and tenth regions, respectively, that are complementary to a reference gene. In some aspects, step (b) prepares a portion of each target gene or reference gene for high-throughput sequencing, wherein the portion is between 40 nucleotides and 500 nucleotides (e.g., by 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, or 500 nucleotides, or any value derivable therein) long. In some aspects, FEC is defined as:








F

E

C

=







Copies





of





the





target





genomic





region

-






Haploid





genomic





copies





Haploid





genomic





copies


.










In some aspects, step (d) comprises: (i) aligning NGS reads to the targeted portions of each target gene and grouping the NGS reads into subgroups based on the loci to which they align; (ii) dividing the NGS read at each locus based on their UMI sequences such that all NGS reads carrying the same UMI sequence are grouped as one UMI family; (iii) removing UMI families resulting from PCR errors or NGS errors; (iv) counting the number of unique UMI sequences at each locus; and (v) calculating the FEC based on the number of unique UMI sequences for each locus in each target gene and reference gene. In some aspects, step (d)(iii) comprises removing UMI sequences that do not meet the UMI degenerate base design. In some aspects, step (d)(iii) comprises removing UMI families with a UMI family size less than Fmin, wherein the UMI family size is the number of reads carrying the same UMI, wherein Fmin is between 2 and 20 (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20). In some aspects, step (d)(iv) comprises removing UMI sequences that differ by only one or two bases from another UMI sequence with a larger family size.


In some aspects, FEC is defined as:








F





E





C

=


k






i
=
1

u



N

Tar
,
i







j
=
1

w






i
=
1

v



N

Ref
,
i
,
j






-
1


,




where Σi=1u NTar,i is the sum of unique UMI number for all or part of the target gene loci, u is the number of loci to consider, u is no more than the total number of loci in the target gene; Σj=1wΣi=1vNRef,i,j is the sum of unique UMI number for all or part of Reference loci, v is the number of loci to consider for one reference, v is no more than the total number of loci in the reference; w is the number of reference to consider, w is no more than the total number of reference; and k is determined by experimental calibration. In some aspects, the FEC is used to identify the copy number variation (CNV) status of the target gene.


In one embodiment, provided herein are methods for quantifying the allele ratio of different genetic identities for an at least one target genomic locus, the method comprising: (a) obtaining a genomic DNA sample; (b) preparing the genomic DNA for high-throughput sequencing according to a method of any one of the present embodiments, wherein the sequences of the fourth region, the seventh region, and the tenth region hybridize to the genomic DNA near the at least one target genomic locus; (c) performing high-throughput sequencing according to a method of any one of the present embodiments; and (d) calculating the allele ratio of different genetic identities for the at least one target genomic locus on the sequencing information obtained in step (c).


In some aspects, the methods are methods for quantifying the allele ratio of different genetic identities for a set of target genomic loci, wherein the set of target genomic loci comprises between 2 and 10,000 target genomic loci (e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 250, 500, 750, 1,000, 2,000, 3,000, 4,000 or 5,000 and at most 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 750, 500, 250, 100, 75, or 50 target genomic loci, or any range or value derivable therein). In some aspects, step (b) is performed using a population of first oligonucleotides, a population of second oligonucleotides, and a population of fifth oligonucleotides, wherein a portion of each of the populations of first, second, and fifth oligonucleotides comprise fourth, seventh, and tenth regions, respectively, that are complementary to the genomic DNA near the at least one of the set of target genomic loci. In some aspects, each of the fourth, seventh, and tenth regions comprises sequences that are not able to hybridize with non-target regions of the genomic DNA under the conditions of step (b). In some aspects, each first oligonucleotide that hybridizes to the genomic DNA near one target genomic locus has a unique third region compared to each other first oligonucleotide that hybridizes to the genomic DNA near the same target genomic locus. In some aspects, each target genomic locus is between 40 nucleotides and 500 nucleotides (e.g., by 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, or 500 nucleotides, or any value derivable therein) long.


In some aspects, step (d) comprises: (i) aligning NGS reads to the targeted genomic loci and grouping the NGS reads into subgroups based on the loci to which they align; (ii) dividing the NGS read at each locus based on their UMI sequences such that all NGS reads carrying the same UMI sequence are grouped as one UMI family; (iii) removing UMI families resulting from PCR errors or NGS errors; (iv) calling the genetic identity for each remaining UMI family; (v) counting the number of unique UMI sequences at each locus; and (vi) calculating the allele ratio. In some aspects, step (d)(iii) comprises removing UMI sequences that do not meet the UMI degenerate base design. In some aspects, step (d)(iii) comprises removing UMI families with a UMI family size less than Fmin, wherein the UMI family size is the number of reads carrying the same UMI, wherein Fmin is between 2 and 20 (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20). In some aspects, step (d)(iii) comprises removing UMI sequences that differ by only one or two bases from another UMI sequence with a larger family size. In some aspects, step (d)(iv) comprises calling the genetic identity only if at least 70% (e.g., 70%, 75%, 80%, 85%, 90%, 95%, or 98%) of the reads in a UMI family are the same on the genetic locus of interest. In some aspects, the allele ratio is defined as Rallele=N1/N2, where N1 is unique UMI number for the first genetic identity, and N2 is unique UMI number for the second genetic identity.


In some aspects, step (d)(iv) comprises identifying the consensus sequence of each UMI family. In some aspects, the consensus sequence is the sequence appearing the highest number of times in the UMI family. In some aspects, step (d)(iv) further comprises comparing the consensus sequence to the wild-type sequence for that locus, thereby identifying mutations in the consensus sequence. In some aspects, the methods further comprise calculating the variant allele frequency (VAF) of the identified mutation. In some aspects, the VAF of the identified mutation is defined as Number of UMI families with the mutation/Total number of UMI families.


As used herein, “essentially free,” in terms of a specified component, is used herein to mean that none of the specified component has been purposefully formulated into a composition and/or is present only as a contaminant or in trace amounts. The total amount of the specified component resulting from any unintended contamination of a composition is therefore well below 0.05%, preferably below 0.01%. Most preferred is a composition in which no amount of the specified component can be detected with standard analytical methods.


As used herein the specification, “a” or “an” may mean one or more. As used herein in the claim(s), when used in conjunction with the word “comprising,” the words “a” or “an” may mean one or more than one.


The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.


Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.


Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.





BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.



FIG. 1. Schematic of QASeq primers design and experimental workflow. Each primer set contains 3 different oligos: a Specific Forward Primer (SfP), a Specific Reverse Primer A (SrPA), and a Specific Reverse Primer B (SrPB). Each QASeq panel only needs one Universal Forward Primer (UfP) and one Universal Reverse Primer (UrP). There can be additional bases at 5′-end of region 1 or region 5 in UfP or UrP. For one recommended workflow, the DNA sample is firstly mixed with all the SfP, SrPA, DNA polymerase, dNTPs, and PCR buffer. 2 cycles of long-extension PCR are performed for addition of UMI on all target loci. Next, in order to amplify the molecules while preventing addition of multiple UMIs onto the same original molecule, the annealing temperature is raised by about 8° C. in PCR amplification for about 7 cycles using UfP and UrP (short-extension, about 30 s); note that addition of UfP and UrP into the reaction is an open-tube step on the thermocycler. After purification using SPRI magnetic beads or columns, SrPB primers, DNA polymerase, dNTPs, and PCR buffer are mixed with the PCR product for adapter replacement; after 2 cycles of long extension (about 30 min), the NGS adapters are only added onto the correct PCR products, not the primer dimers or non-specific products. After another purification using SPRI magnetic beads or columns, standard NGS index PCR is performed; libraries are normalized and loaded onto an Illumina sequencer.



FIG. 2. Simulation of UMI cross-binding energy. Using (H)20 instead of (N)20 or (SWW)6SW as UMI sequences reduces the mean cross-binding energy, indicating fewer primer-dimer interactions. Here 500 simulations were performed for each UMI pattern; in each simulation, 2 sequences that are consistent with the pattern were randomly generated, and the cross-binding ΔG° between these sequences were calculated assuming 60° C. and 0.18 M K+.



FIGS. 3A-B. Spacer between primer and UMI reduces PCR bias. (FIG. 3A) Workflow for evaluating the significance of spacer between primer and UMI. Three sets of primers, with no spacer (set 1), with a 5 nt spacer between the forward primer and UMI and a 5 nt spacer between the reverse primer and UMI (set 2), or with a 12 nt spacer between the forward primer and UMI and a 11 nt spacer between the reverse primer and UMI (set 3) were used to amplify input molecules separately. Indices were added before NGS analysis by Illumina MiSeq. (FIG. 3B) Experimental UMI family size distribution histograms for the three sets of primers. UMI sequences that did not match UMI design pattern were removed.



FIGS. 4A-B. Data analysis for UMI-based absolute quantitation for CNV. (FIG. 4A) Data analysis workflow for CNV detection. NGS reads in the FASTQ output file are analyzed to generate CNV status as results. FEC of a target gene will be calculated as








F

E

C

=


k






i
=
1

u



N

Tar
,
i







j
=
1

w






i
=
1

v



N

Ref
,
i
,
j






-
1


,




where Σi=1u NTar,i is the sum of unique UMI number for all or part of the target gene loci, u is the number of loci to consider: Σj=1wΣi=1v NRef,i,j is the sum of unique UMI number for all or part of Reference loci, v is the number of loci to consider for one reference; w is the number of reference to consider; and k is determined by experimental calibration. CNV status is determined based on FEC. (FIG. 4B) Definition of UMI family size and unique UMI number in data analysis: UMI family size is the number of reads carrying the same UMI sequence, and unique UMI number is the total count of different UMI sequences at one loci.



FIG. 5. Example of experimental UMI family size distribution. Example UMI family size distribution of 10 ERBB2 amplicons and 10 Reference amplicons in the same NGS library. We used a normal cell line gDNA NA18562 (purchased from Coriell) as template input for the 20-plex QASeq experiment; input sample contains 2500 haploid genomic copies. The prepared NGS library was sequenced using 1.5 million reads, by Illumina MiSeq Reagent Kit v3 (150-cycle). The fractions of accepted and discarded UMIs are shown as pie charts. Among all the UMIs, about 20% are discarded due to PCR or sequencing errors (i.e. G bases are found in poly(H) UMIs); about 40% are discarded due to small family size (<3).



FIG. 6. Example of experimental unique UMI number for different loci. Example unique UMI number of each locus, corresponding to data shown in FIG. 5; white bars are ERBB2 amplicons, and grey bars are Reference amplicons. Input sample contains 2500 haploid genomic copies. The prepared NGS library was sequenced using 1.5 million reads, by Illumina MiSeq Reagent Kit v3 (150-cycle).



FIG. 7. Experimental calibration results on normal cell line gDNA NA18562 and simulated theoretical standard deviation limit. Standard deviation of CNV ratio (σCNV ratio) is plotted against input molecule number. LoD can be approximated as 3σCNV ratio. We performed 5 replicated experiments for each different input amount (75, 250, 750, and 2500 haploid genomic copies); experimental results are plotted as cross symbols. A simulation was performed assuming Poisson distribution of sampled molecule number; the simulated σCNV ratio (plotted as dashed line) is the theoretical lower limit due to stochasticity of sampling.



FIGS. 8A-C. Example of experimental results of CNV detection on FFPE samples. We tested 2 lung cancer FFPE slides from the same tumor, in which ERBB2 CNVs are not likely to occur. Input extracted DNA samples contain 2500 haploid genomic copies for each NGS library. The prepared NGS library was sequenced using 1.5 million reads, by Illumina MiSeq Reagent Kit v3 (150-cycle). (FIG. 8A) Example distribution of UMI family size is plotted for amplicons ERBB2_1 and Reference_1; the fractions of accepted and discarded UMIs are shown as pie charts. (FIG. 8B) Example unique UMI numbers for each amplicon region. White bars are ERBB2 amplicons; grey bars are Reference amplicons. (FIG. 8C) CNV ratios are plotted for 2 FFPE slides from the same lung cancer tumor. CNV of ERBB2 is not detected in these FFPE slides using QASeq based on previous calibration data. Mean and LoD=3σCNV ratio are calculated based on the data of 750-genomic copy input cell line gDNA libraries (see FIG. 7), which have similar unique UMI numbers to the FFPE samples.



FIGS. 9A-E. Primer dimer reduction using primary experimental workflow. (FIG. 9A) The simplest workflow we have tested was a one-pot reaction: after UMI addition, index primers were directly added into the reaction as an open-tube step on the thermocycler, and index PCR (i.e. universal PCR) was performed afterwards. On-target rate was low (0.5%) for this workflow; off-target NGS reads were mostly primer dimers. (FIG. 9B) A SPRI purification step was added after 6 cycles of universal PCR to reduce primer dimer; on-target rate was improved to 20%. (FIG. 9C) A size selection step using agarose gel was added after index PCR to further reduce primer dimer; on-target rate was improved compared to FIG. 9B but still lower than 50%. (FIG. 9D) Primary experimental workflow including both adapter replacement and purification after universal PCR have high average on-target rate of 66%. (FIG. 9E) Source of primer dimers in workflow FIGS. 9A-D.



FIGS. 10A-C. Example workflows that do not require NGS index PCR. (FIG. 10A) The index and P5 sequences are added onto the 5′ of UfP; the other index and P7 sequences are added onto the 5′ of SrPB. The amplicons obtained from adapter replacement contain P5, P7, and dual index, thus are ready for sequencing. (FIG. 10B) The index and P7 sequences are added onto the 5′ of SrPB, and the index primer is added together with SrPB in the adapter replacement step. The amplicons are ready for sequencing. (FIG. 10C) The index and P5 sequences are added onto the 5′ of SW; a primer bearing the P5 sequence is used as UfP in the universal PCR step. The other index and P7 sequences are added onto the 5′ of SrPB. The amplicons are ready for sequencing.



FIG. 11. A variant of QASeq primer design and workflow. Each primer set contains 3 different oligos: a Specific Forward Primer (SfP), a Specific Reverse Primer A (SrPA), and a Specific Reverse Primer B (SrPB). Compared to the original design, SrPA only needs the template-binding region, and Universal Reverse Primer (UrP) is not necessary. Each QASeq panel only needs one Universal Forward Primer (UfP); there can be additional bases at 5′-end of region 1 in UfP. Compared to the original experimental workflow, more cycles of PCR are needed in the universal PCR step; ≥10 cycles are recommended.



FIGS. 12A-B. Data analysis for QASeq-based allele ratio quantitation. (FIG. 12A) Data analysis workflow for allele ratio quantitation. NGS reads in the FASTQ output file are analyzed to generate allele ratio between different genetic identities. Allele ratio for each targeted locus is calculated as Rallele=N1/N2, where N1 is unique UMI number for the first genetic identity, and N2 is unique UMI number for the second genetic identity. (FIG. 12B) Genetic identity calling for each UMI family based on majority vote.



FIG. 13. Example of experimental results of CNV detection on spike-in clinical FFPE samples. Two previously characterized FFPE DNA samples (1 “normal” sample and 1 “ERBB2 amplified abnormal” sample) were mixed to generate 2.5%, 5%, and 10% ERBB2 FEC samples. The “normal” sample has an ERBB2 FEC of 0%, and the “ERBB2 amplified abnormal” sample has an ERBB2 FEC of 78%. The experimental normalized FEC values were plotted against expected ERBB2 FEC. The “normal” sample was tested in 5 replicates, and the LoD of the 100-plex CNV panel was estimated as 3 standard deviation of the “normal” sample. CNV in 2.5%, 5%, and 10% ERBB2 FEC samples were successfully detected, because their calculated FEC were outside the 3 standard deviation range.



FIG. 14. Bioinformatics workflow for mutation quantitation using QASeq. Shown is a summary of the data processing workflow for mutation quantitation.



FIG. 15. Observed molecule number for the 179-plex comprehensive panel. The input was 8.3 ng (5000 expected molecule number) of 100% Multiplex I Wild Type cfDNA Reference Standard (Horizon Discovery). The conversion rate has an average of 62%; 97% of the plexes have >10% conversion rate.



FIG. 16. Error rates for the 179-plex comprehensive panel. The input was 8.3 ng of 100% Multiplex I Wild Type cfDNA Reference Standard (Horizon Discovery); the same sample was tested in triplicate. Error rates in 3840 different loci (after error correction using UMI) were plotted. Highest error rates were 0.23%, 0.20%, and 0.23%, and average error rates were 0.006%, 0.005%, and 0.005% for the 3 replicates.



FIG. 17. Mutation quantitation results for the 179-plex comprehensive panel. Sample used was 0.3% cfDNA Reference Standard (created by mixing 0.1% Multiplex I cfDNA Reference Standard and 1% Multiplex I cfDNA Reference Standard from Horizon Discovery) tested in triplicates. The experimental VAF of 6 mutations were generally consistent with the expected VAF; the difference was mostly due to stochasticity in sampling a small number (≤9) of mutation molecules.





DETAILED DESCRIPTION

Provided herein are methods of quantitative amplicon sequencing, for labeling each strand of targeted genomic loci in the original DNA sample with an oligonucleotide barcode sequence by polymerase chain reaction, and amplifying the genomic region(s) for high-throughput sequencing. Also provided herein are methods to allow the simultaneous detection of copy number variation (CNV) in a set of genes of interest, by quantitating the frequency of extra copies of each gene. Quantitation of the allele ratio of different genetic identities for targeted genomic loci using multiplexed PCR is also provided by the disclosed methods. These methods can be applied to the detection of CNV for gene(s) of interest in tumor samples, guiding the choice of targeted therapy, and helping the understanding of cancer formation and progression.


Current standard method for prenatal diagnosis of monogenic diseases is to sequence the fetal genetic material obtained from invasive and risky chorionic villus sampling or amniocentesis. Genetic noninvasive prenatal testing (NIPT) of monogenic disease is based on the circulation of fetal-derived cell-free DNA (cfDNA) in maternal plasma. Due to the presence of background maternal DNA, it becomes challenging to confidently detect the allele ratio change arising from fetal cfDNA, especially when the maternal DNA is heterozygous at the locus of interest. Droplet digital PCR (ddPCR) has been used to quantify the allele ratio between mutant alleles carrying disease-causing mutations and wild type alleles for NIPT (Lun et al., 2008), but the practical feasibility is limited by precision and reliability of the technology. QASeq enables absolute quantitation of DNA molecule by adding unique molecular identifier to each strand of original input molecules, and can be applied to allele ratio quantitation for NIPT. As such, QASeq can also be used for allele ratio quantitation. Allele ratio quantitation aims to quantify the ratio of DNA molecules with different genetic identities. Accurate allele ratio quantitation is key to NIPT of monogenic diseases, such as β-thalassemia and cystic fibrosis.


I. FREQUENCY OF EXTRA COPIES OF CNVS

The frequency of extra copies (FEC) of a CNV in a genomic DNA sample is defined herein as:






FEC
=






Copies





of





the





target





genomic





region

-






Haploid





genomic





copies





Haploid





genomic





copies






A positive value of FEC indicates amplification of the target genomic region in the sample, and a negative value of FEC indicates deletion of the target genomic region in the sample.


While QASeq can be used to quantitate FEC, it does not provide information on the percentage of cells containing CNV in the tumor tissue sample. For example, if 1% of cells in a tumor sample contain 4 copies of ERBB2, and the rest 99% of cells contain 2 copies, the FEC is 1%; if 0.5% of cells in the sample contain 6 copies of ERBB2, and the rest 99.5% of cells contain 2 copies, the FEC is still 1%. Additionally, QASeq does not provide information on the genomic locations of the extra copies.


II. MULTIPLEXED PCR PANEL DESIGN

In a QASeq multiplexed PCR panel, one target gene needs M (M=1˜1000) sets of primers, each amplifying a non-overlapping small region (40 nt to 500 nt, usually ≤200 nt) in the target gene region. If the panel has multiple target genes, the number of primer sets used for each gene is similar (≈M). The panel also contains a similar number (≈M) of primer sets amplifying reference genomic regions. The reference loci serve as internal standards for the amount of genomic DNA (gDNA) loaded, so that accurate quantitation of DNA concentration in the sample is not needed. At least one reference primer set may be used for each panel. Because increasing the number of input molecules or loci in the target gene can both decrease variations in random sampling, a greater number of primer sets per gene can be used to improve the LoD for sample types containing smaller amounts of DNA; the number of reference primer sets needs to be increased proportionally in this case.


Each primer set contains three different oligos: a Specific Forward Primer (SfP), a Specific Reverse Primer A (SrPA), and a Specific Reverse Primer B (SrPB) (see FIG. 1). SfP comprises, from 5′ to 3′, regions 1, 2, 3, and 4. Region 4 is the template-binding region; region 3 is the UMI; region 1 is a full or partial NGS adapter; region 2 is an optional spacer region (typically 0-15 nt) added for uniform amplification of UMIs. SrPA comprises, from 5′ to 3′, regions 5, 6, and 7. Region 7 is the template-binding region; region 5 is the custom adapter (i.e., a sequence that is different from the NGS adapters and cannot be found in the human genome) for universal amplification; region 6 is an optional spacer region (typically 0-15 nt) added for uniform amplification of different loci. SrPB comprises, from 5′ to 3′, regions 8, 9, and 10. Region 10 is the template-binding region, the 3′-end of which is closer to region 4 than region 7 by at least 1 base; region 8 is a full or partial NGS adapter; region 9 is an optional spacer region (typically 0-15 nt) added for uniform amplification of different loci. Each QASeq panel only needs one Universal Forward Primer (UfP) and one Universal Reverse Primer (UrP). UfP comprises region 1, and UrP comprises region 5; there can be additional bases at the 5′-end of region 1 or region 5 in UfP or UrP. The melting temperature (Tm) of template-binding regions 4, 7, and 10 are about the same as the PCR annealing temperature, and the Tm of UfP and UrP are not lower than regions 4, 7, and 10 in the experimental PCR conditions.


When designing primers, single nucleotide polymorphisms (SNPs) with significant minor allele frequency (MAF) should be avoided in the primer-binding regions, so that the primers' binding affinities will not likely be affected by nucleotide sequence variations in different patient samples. In addition, whole human genome nucleotide sequences should be searched to ensure that the primers are not prone to nonspecific amplification of non-target regions.


In an example panel targeting CNVs of ERBB2 in Formalin-Fixed Paraffin-Embedded (FFPE) specimen of tumor samples, 10 sets of primers, each amplifying a 60 to 70 nt amplicon, were designed in the ERBB2 gene region. In addition, 10 sets of reference primers were designed, each amplifying a region in a different housekeeping gene from different chromosomes (Table 1). Primers were designed automatically using a Matlab code to satisfy the above-mentioned design principles while minimize primer interactions. In addition, non-pathogenic SNPs with >0.2% MAF in the population were avoided. Online tool Primer-BLAST was used to ensure that each primer set only has one amplicon in the human genome. Primer sequences are shown in Table 2.









TABLE 1







Locations of amplicons











Amplicon name
Chromosome
Gene







ERBB2 1~10
Chr. 17
ERBB2



Reference 1
Chr. 1
PSMB2



Reference 2
Chr. 3
RPL32



Reference 3
Chr. 5
RACK1



Reference 4
Chr. 6
TBP



Reference 5
Chr. 9
VCP



Reference 6
Chr. 11
HMBS



Reference 7
Chr. 12
NACA



Reference 8
Chr. 15
B2M



Reference 9
Chr. 19
GPI



Reference 10
Chr. 20
TOP1

















TABLE 2







Primer sequences in an exemplary QASeq panel











SEQ


Name
Sequence
ID NO:












SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHAATCA
1


ERBB2-1
TAAAAGCTAACATATAGCCTGGG






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGCTGA
2


ERBB2-2
CTTGGGGACACAGG






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCTTTG
3


ERBB2-3
CAAGATGGAGGTTGCA






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCTTGC
4


ERBB2-4
CCTACCAGCCTCTC






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCCACA
5


ERBB2-5
ACTGGAATCTGACGC






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGGCTG
6


ERBB2-6
CGGATTGTGCG






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCAGAT
7


ERBB2-7
ATAAGGGCCAAAAGTTACACA






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHTGCTT
8


ERBB2-8
TGGTCTCCCTTTTTGC






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGAATG
9


ERBB2-9
AAATTAAACAGGGCTTGGC






SfP-
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHAAAGA
10


ERBB2-10
AAAAAAAAAAGAATATGGGTCCAGA






SfP-Ref-1
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGCACA
11



ACATTTTGTCTCCGGAAAATA






SfP-Ref-2
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGACAA
12



ATGCCCAGAAATGGAACTTA






SfP-Ref-3
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHATGCG
13



GTTTCACCATTGGC






SfP-Ref-4
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCCCAA
14



GGAATTGAGGAAGTTGCT






SfP-Ref-5
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHACTGG
15



AATGCTGTTCCTTACAATCA






SfP-Ref-6
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGGCCT
16



GGATAGGCAGCTTG






SfP-Ref-7
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCAACT
17



CAGACTATTCAGGAATACGTTT






SfP-Ref-8
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHTCCAT
18



CCGACATTGAAGTTGACTTA






SfP-Ref-9
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHGAATG
19



AAGCCCTAATCCCTTAAGC






SfP-Ref-10
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHHCAGGC
20



AGAGGAAATATCGTTGAC






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTATGCATGCAAAAC
21


ERBB2-1
ACCACAAAC






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCAGATCTGGCCCA
22


ERBB2-2
GCACC






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCCTGGCAGGCACT
23


ERBB2-3
CTCG






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCCTAAGGTCAAAT
24


ERBB2-4
CCTAGGGGG






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCGGGGCTCTGGTC
25


ERBB2-5
ATTGC






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTTCAGCGGGTCTCC
26


ERBB2-6
ATTGTCTA






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTGCTTGGTGGTTAA
27


ERBB2-7
GAGACTGTGG






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCCATTTACCCCTC
28


ERBB2-8
ACAACAACCA






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTCGAGTAACAACAG
29


ERBB2-9
TCACTGCTC






SrPA-
GGATATTCCTTTCTACTCTTTGACATCATCTATGTTTTTCCATG
30


ERBB2-10
TTCTAACACCGT






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTGCTCCAGATGGGC
31


1
AGCAC






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTTTGGCAGTCTTTA
32


2
AGATCCATAGAAATAC






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTACTTTGGAAGGCA
33


3
GAGGCG






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTTGGAACTCGTCTC
34


4
ACTATTCAATTTTT






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTCTGCTTGTGGATG
35


5
AGGCCATA






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTAGGCAGTCACTGT
36


6
TCCTTTCC






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTATGCATTTACTTC
37


7
TGAAACAGTCCTT






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTCAAGTCTGAATGC
38


8
TCCACTTTTTCA






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTGTCTCATTCTAGA
39


9
AAGAAGTTAACTCATTATACA






SrPA-Ref-
GGATATTCCTTTCTACTCTTTGACATCATCTAGGAATCAACAAA
40


10
TGACAAGGCAAAT






SrPB-
AGACGTGTGCTCTTCCGATCTCATGCAAAACACCACAAACAGTT
41


ERBB2-1
C






SrPB-
AGACGTGTGCTCTTCCGATCTGATCTGGCCCAGCACCTTAA
42


ERBB2-2







SrPB-
AGACGTGTGCTCTTCCGATCTCTCTCGGTGGATCTGCATAACAT
43


ERBB2-3







SrPB-
AGACGTGTGCTCTTCCGATCTGTCAAATCCTAGGGGGTAATACG
44


ERBB2-4
A






SrPB-
AGACGTGTGCTCTTCCGATCTCTGGTCATTGCAGAGACCTCT
45


ERBB2-5







SrPB-
AGACGTGTGCTCTTCCGATCTTCTCCATTGTCTAGCACGGC
46


ERBB2-6







SrPB-
AGACGTGTGCTCTTCCGATCTAGACTGTGGAGTCTGAAACTCAG
47


ERBB2-7







SrPB-
AGACGTGTGCTCTTCCGATCTCCCCTCACAACAACCAGACG
48


ERBB2-8







SrPB-
AGACGTGTGCTCTTCCGATCTGTCACTGCTCTGTAGAAAGCCT
49


ERBB2-9







SrPB-
AGACGTGTGCTCTTCCGATCTATGTTCTAACACCGTGATCTGGAT
50


ERBB2-10







SrPB-Ref-
AGACGTGTGCTCTTCCGATCTATGGGCAGCACAGTGGG
51


1







SrPB-Ref-
AGACGTGTGCTCTTCCGATCTGCAGTCTTTAAGATCCATAGAA
52


2
ATACTCTT






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTCAGAGGCGAGTGGATCACTT
53


3







SrPB-Ref-
AGACGTGTGCTCTTCCGATCTACTATTCAATTTTTTCCTAGAG
54


4
CATCTCC






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTGGCCATAGAAAGGGTAGTGTTG
55


5
AA






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTTTCCTTTCCTCCTCCTCCCAT
56


6







SrPB-Ref-
AGACGTGTGCTCTTCCGATCTGCATTTACTTCTGAAACAGTCC
57


7
TTAATG






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTAATGCTCCACTTTTTCAATTCT
58


8
CTCT






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTATTCTAGAAAGAAGTTAACTCA
59


9
TTATACACAGT






SrPB-Ref-
AGACGTGTGCTCTTCCGATCTACAAATGACAAGGCAAATGAGA
60


10
CAT






UfP
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
61





UrP
CCTATGGTAGTTAAATGTACATTGGATATTCCTTTCTACTCTT
62



TGACATCATCT
















TABLE 3







Primer sequences in the 179-plex comprehensive panel











SEQ ID


Name
Primer sequence
NO:












UfP
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
61





UrP
CCTATGGTAGTTAAATGTACATTGGATATTCCTTTCTAC
62



TCTTTGACATCATCT






fP-ERBB2-1
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
63



CCTTAGACAACTACCTTTCTACGGAC






fP-ERBB2-10
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
64



TGCTTTGGTCTCCCTTTTTGC






fP-ERBB2-11
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
65



AAAGAAAAAAAAAAAGAATATGGGTCCAGA






fP-ERBB2-12
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
66



CGAGGCGATAGGGTTAAGGG






fP-ERBB2-13
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
67



CTTCTAGTCGCAATTGAAGTACCAC






fP-ERBB2-14
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
68



CCTCACCCCTTGTCAACTTTTC






fP-ERBB2-15
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
69



GTCTGGTGCTTTAGCCCAAAG






fP-ERBB2-16
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
70



AAAGCAAAGCTATATTCAAGACCACAT






fP-ERBB2-17
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
71



GGCATTGTCTGCCAGTCCG






fP-ERBB2-18
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
72



TCCTTTAGCTCGTGGAATCTCAAG






fP-ERBB2-19
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
73



CTGGGGCATTCCAACTAGAACT






fP-ERBB2-2
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
74



ATTCCAGTGGCCATCAAAGTGT






fP-ERBB2-20
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
75



GGGAAAACCATTATTTGATATTAAAACAAATAGG






fP-ERBB2-21
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
76



AGGAAGTATAAGAATGAAGTTGTGAAGC






fP-ERBB2-22
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
77



CTCCCCGCTCCCCTTCA






fP-ERBB2-23
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
78



AGCCTGGGCCAGGTATACT






fP-ERBB2-24
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
79



ACTCTGTCCTCTGCAGGAACT






fP-ERBB2-25
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
80



GTATGGGTTTTACAAATTGCAGCAAATA






fP-ERBB2-26
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
81



CAAAGCATGTTTAATTTTCTCGTGGTT






fP-ERBB2-27
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
82



GCGTGAGGGGCCAGTGT






fP-ERBB2-28
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
83



GGACACAGGTCATTTTACTGTAGTATTC






fP-ERBB2-29
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
84



CCACCCGTTCTGACCCTC






fP-ERBB2-3
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
85



CAGGAAGCATACGTGATGGCT






fP-ERBB2-30
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
86



ACCTGCAGTGTGCAAGGG






fP-ERBB2-31
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
87



GCGTCTGTGTTTCCGCTAAATC






fP-ERBB2-32
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
88



AAGATCTCCAAGTACTGGGGAAC






fP-ERBB2-33
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
89



TGGCCTTCACCGTCATTGAAA






fP-ERBB2-34
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
90



GCAGATATAAGGGCCAAAAGTTACAC






fP-ERBB2-35
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
91



CAGCTGGCTCTCACACTGAT






fP-ERBB2-36
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
92



CCACCCCTGTTCTCCGATG






fP-ERBB2-37
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
93



CTCCTAAATGTTAGCTTTTATTCTATAGCCT






fP-ERBB2-38
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
94



AGTCTCTGCCTTCTACTCTCTACC






fP-ERBB2-39
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
95



GCCTTTGGTGGGTGGGG






fP-ERBB2-4
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
96



GATGAGCTACCTGGAGGATGTG






fP-ERBB2-40
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
97



CAGCCAGTTCCCTGGTTCA






fP-ERBB2-41
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
98



CCCTTCAGACTATGAAAAGGTTCTAAG






fP-ERBB2-42
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
99



ACAGTGCTGGCAATGTTTATCAC






fP-ERBB2-43
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
100



GGGTGGTTCCCAGAATTGTTG






fP-ERBB2-44
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
101



CTTCAAAGTTCTGGTGTCGGG






fP-ERBB2-45
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
102



TGACCTGTGGGTGGAAATTTTG






fP-ERBB2-46
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
103



AGAGGGTTCTGATTGCCTACAAG






fP-ERBB2-47
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
104



GGGATCCTCATCAAGCGACG






fP-ERBB2-48
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
105



CCCTTTTACAGTCAAAGTCCAAAGC






fP-ERBB2-49
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
106



GGGTCGTCAAAGACGTTTTTGC






fP-ERBB2-5
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
107



GATGGCGCTGGAGTCCATT






fP-ERBB2-50
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
108



ACCTGTCCTAAGGAACCTTCCT






fP-ERBB2-6
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
109



CTTTGCAAGATGGAGGTTGCA






fP-ERBB2-7
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
110



CTTGCCCTACCAGCCTCTC






fP-ERBB2-8
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
111



CCACAACTGGAATCTGACGC






fP-ERBB2-9
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
112



GGCTGCGGATTGTGCG






fP-Mut1
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
113



GGGACCCACTCCATCGAGA






fP-Mut2
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
114



GGAGTATTTCATGAAACAAATGAATGATGC






fP-Mut4
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
115



GCCGCCAGGTCTTGATGTACT






fP-Mut5
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
116



CTCACCATCGCTATCTGAGCAG






fP-Mut6
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
117



TCGTCAAGGCACTCTTGCCTA






fP-mut10
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
118



GAGTATTTGGATGACAGAAACACTTT






fP-mut11
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
119



CACACGCAAATTTCCTTCCAC






fP-mut12
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
120



CCGCTCATGATCAAACGCTCTAA






fP-mut13
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
121



TCCATGATCAGGTCCACCTTCT






fP-mut14
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
122



TCACTCTCTCTCTGCGCATTC






fP-mut15
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
123



AGTAACAAAGGCATGGAGCATCT






fP-mut16
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
124



GTGGGGTGAGATTTTTGTCAACTT






fP-mut17
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
125



AGCAGTATCAGTAGTATGAGCAGC






fP-mut18
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
126



TTCTGATGTGCTTTGTTCTGGATTT






fP-mut19
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
127



TGAGCCAAATGTGTATGGGTGA






fP-mut20
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
128



GTTGCACATTCCTCTTCTGCATTT






fP-mut21
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
129



GGTGCATTTGTTAACTTCAGCTCTG






fP-mut22
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
130



TAGGTTTCTGCTGTGCCTGAC






fP-mut23
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
131



GGTGGGCTTAGATTTCTACTGACTACTA






fP-mut24
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
132



TCTAGGATTCTCTGAGCATGGC






fP-mut25
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
133



ACTAAATAGGAAAATACCAGCTTCATAGAC






fP-mut26
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
134



ACACTCTTGTGCTGACTTACCA






fP-mut27
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
135



AACATCAGGGAATTCATTTAAAGTAAATAGC






fP-mut28
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
136



GGAACCAAATGATACTGATCCATTAGATTC






fP-mut29
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
137



AGATTCTAAACTGCCAAGTCATGC






fP-mut30
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
138



TGCCTGTAGTAATCAAGTGTCTCATTT






fP-mut31
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
139



CAACCAAAGTCTTTGTTCCACCTT






fP-mut32
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
140



AGACATCATCTGGATTATACATATTTCGC






fP-mut33
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
141



AAAGAGCTAACATACAGTTAGCAGC






fP-mut34
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
142



TTCCATTCTAGGACTTGCCCC






fP-mut35
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
143



ACTGAATTCTCCTCAGATGACTCC






fP-mut36
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
144



ACAGACACTCCTTGTTCAGCA






fP-mut37
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
145



TGTGTATATAATTATTTCTTACCCTATTCGAGTC






fP-mut38
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
146



TTGCTGTCATTTGGACTGGGAA






fP-mut39
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
147



CAAAAAGATACCCACCTTTCCTCCA






fP-mut3new
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
148



CAATTTCTACACGAGATCCTCTCTCT






fP-mut40
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
149



GACATTACGGGCTGCCAAATC






fP-mut41
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
150



CTCAGACACACACCCAGCAA






fP-mut42
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
151



ACTTACTTTATAAACCGTTCCAAAAGCA






fP-mut43
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
152



TGAGTAATGTACTTACTACAATTTTCAGCTT






fP-mut44
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
153



GCTGTTGTCAGTAATATAGATGTTTCCTG






fP-mut45
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
154



AACTAGGGCAGGCACGC






fP-mut46
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
155



GGATAATAAAAGAGAGAAATCACAGACATACAA






fP-mut47
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
156



AGGCATATCGATCCTCATAAAGTTTTG






fP-mut48
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
157



TCCAGGTTGCCCATGACAAC






fP-mut49
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
158



AGTGCCAGAAGGAACCCAC






fP-mut50
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
159



AAGTGTTACTCAAGAAGCAGAAAGG






fP-mut51
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
160



AAAATCCCTTTGGGTTATAAATAGTGCA






fP-mut52
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
161



ATGTGTTTTATAATTTAGACTAGTGAATATTTTTCTTTG






fP-mut53
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
162



CTGGAAAAATGGCTTTGAATCTTTGG






fP-mut54
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
163



TGGAAAAGCTCATTAACTTAACTGACAT






fP-mut55
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
164



TCCTTGGGATTACGCTCCCT






fP-mut56
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
165



ACCCAGTGGAGAAGCTCCC






fP-mut57
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
166



AGGTGAGAAAGTTAAAATTCCCGTC






fP-mut58
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
167



AGGCAGATGCCCAGCAGG






fP-mut59
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
168



CCTCCACCGTGCAGCTCAT






fP-mut60
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
169



AGCCAGGAACGTACTGGTG






fP-mut61
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
170



ACAATGTCACCACATTACATACTTACC






fP-mut62
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
171



ACAGGCTCCCAGACATGACA






fP-mut63
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
172



TTCAGATATTTCTTTCCTTAACTAAAGTACTCA






fP-mut64
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
173



TGTTTGTTTTGTTTTAAGGTTTTTGGATTC






fP-mut65
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
174



TCCTAAGTGCAAAAGATAACTTTATATCACTT






fP-mut66
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
175



GTTGCAGCAATTCACTGTAAAGCT






fP-mut67
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
176



CACAAGAGGCCCTAGATTTCTATGG






fP-mut68
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
177



TTGAGTTCCCTCAGCCGTTAC






fP-mut69
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
178



TCTTCATACCAGGACCAGAGGAA






fP-mut7
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
179



AGTGAGCCCTGCTCCCC






fP-mut70
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
180



GCGTGCAGATAATGACAAGGAATATCT






fP-mut71
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
181



GGTTTTCATTTTAAATTTTCTTTCTCTAGGTGAA






fP-mut72
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
182



TCGTGGCCATGAATGAATTCTCTA






fP-mut73
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
183



TCTACAACAAGCTAACTTTCCAGCT






fP-mut74
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
184



GTGATGTTCCTCCCTCATCTCTAA






fP-mut75
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
185



AGCAACATTGATGGATTTGTGAACT






fP-mut76
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
186



ACGATTGGCTGAAGTACCAGAC






fP-mut77
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
187



TGGACACGACAACAACCAGC






fP-mut78
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
188



GCAACTTACACGTGGACGAC






fP-mut79
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
189



TCCCTCTTATTGTTCCCTACAGATTG






fP-mut8
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
190



GTGCGCCGGTCTCTCC






fP-mut9
ACACGACGCTCTTCCGATCTTCTTHHHHHHHHHHHHHHH
191



TGACCTGGAGTCTTCCAGTGT






fP-Ref-1
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
192



GCTTGTCTAAGGAAAAAACTTGATTATTTTGTAA






fP-Ref-10
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
193



AGGAAGACGCTTGGTTGGG






fP-Ref-11
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
194



TGAGACCTATAATGCTAAGGAAATTTCTTTAC






fP-Ref-12
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
195



TGGAAGCGTTCGTTCCATCC






fP-Ref-13
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
196



GCTTGGCATCTGTTCTTGCTTTAA






fP-Ref-14
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
197



AAAATTCTGCAAAAATAAAGGCCAAGA






fP-Ref-15
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
198



TTTATTGCATGTCCTCATCCACAG






fP-Ref-16
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
199



ATTGGGGATGTCCAGAATAAATTCAG






fP-Ref-17
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
200



CAAACAGTTCAGTGACTTGCCC






fP-Ref-18
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
201



CAGGCATCTCACCTCTCTTCC






fP-Ref-19
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
202



GGCATTATTCCAGTATTGTAGAAGAAGAA






fP-Ref-2
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
203



CTACATTACCAGTAGAACAGAACTAGTCTA






fP-Ref-20
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
204



GCTAACCGTGCTTTCCTCTTTCAT






fP-Ref-21
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
205



CCCTAGCAGAAGCCGACCA






fP-Ref-22
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
206



CAAACCACCACTTATTTCTTTATTTTATCCT






fP-Ref-23
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
207



TCATTTTGTAGTCATTGTAAAACTCTTATGC






fP-Ref-24
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
208



AGAGTAGCGACATGCAAATGATCT






fP-Ref-25
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
209



CCCAGCATTTGTTATATAGGCATCTT






fP-Ref-26
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
210



GAGTAGACAGGGAAATATAGAAGCCT






fP-Ref-27
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
211



CCATCCTTTTTAGTGCTGTCCTCA






fP-Ref-28
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
212



TCAGCTGGCCTAGCAGTTC






fP-Ref-29
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
213



GCATAATTTATAATGAAAACAAATACATTCTCACAG






fP-Ref-3
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
214



CTTATTTGCATTTGTGGCATAATATGAAAC






fP-Ref-30
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
215



CTAGGGTTAGTCAGGTGGTTCAA






fP-Ref-31
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
216



CAGAAGGGCTCTCACTGGG






fP-Ref-32
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
217



GCCTTATATTATTCCCTTTGAACCTTACAATAAT






fP-Ref-33
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
218



CCTGCAGCGGGAGTTTTCA






fP-Ref-34
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
219



CGGCGCCACGTGTTCA






fP-Ref-35
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
220



TGGCCCATTTTAACCTTTTTTTTTTAAAGTA






fP-Ref-36
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
221



TGTGAACAGCCAGAAGCGAT






fP-Ref-37
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
222



CCAGCCCCTGATCCTACCAG






fP-Ref-38
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
223



AGTAACTGAACGACGAATTCTTTGTAA






fP-Ref-39
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
224



CAGCTCCCACCACAGTGC






fP-Ref-4
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
225



CACACAGCGGGCTCTCA






fP-Ref-40
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
226



TCCCTTCTCCTACACTTCCTCC






fP-Ref-41
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
227



ACTACAGGAGCAACTGCCAC






fP-Ref-42
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
228



AGAGGTAGGGATTATTAGCCCCAT






fP-Ref-43
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
229



TGTGTATCCAACAGGAACTCCAAA






fP-Ref-44
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
230



CTGGTCTTAAAATGTCCTGGGGA






fP-Ref-45
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
231



ACGCCCGGCCATCTCA






fP-Ref-46
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
232



TTTTCACTGTTTCCTACAAGAAAATGC






fP-Ref-47
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
233



GAAACCTGGATTTTTGAAATCTAGTGTTTAA






fP-Ref-48
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
234



GCAAGGACGGAAATAGGTAAATGT






fP-Ref-49
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
235



CGAGGCACTGCGTTTGG






fP-Ref-5
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
236



AGCAGATGGGTTGAGAGTTGG






fP-Ref-50
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
237



TCCAATCTCTATCTGTTAGAAGTCTCC






fP-Ref-6
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
238



GGCTCTGATTTCCGCCCAAT






fP-Ref-7
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
239



TGCAAAGATTGTAGGAGCTCTGTA






fP-Ref-8
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
240



AATTAGATAAAAAGCATCCACAGAGGAG






fP-Ref-9
ACACGACGCTCTTCCGATCTATCAHHHHHHHHHHHHHHH
241



AGTCTTTAACAATGAGAGTCAAACCATT






rPin-ERBB2-1
AGACGTGTGCTCTTCCGATCTGTGCAGGGGGCAGACGA
242





rPin-ERBB2-10
AGACGTGTGCTCTTCCGATCTCCCCTCACAACAACCAGA
243



CG






rPin-ERBB2-11
AGACGTGTGCTCTTCCGATCTATGTTCTAACACCGTGAT
244



CTGGAT






rPin-ERBB2-12
AGACGTGTGCTCTTCCGATCTTGGAAAACACTTCAGTTT
245



GCTCATTAA






rPin-ERBB2-13
AGACGTGTGCTCTTCCGATCTGCAAAGGTTCTACCCCGC
246



AT






rPin-ERBB2-14
AGACGTGTGCTCTTCCGATCTGGCTACTTCTTACTCATT
247



CCAACCC






rPin-ERBB2-15
AGACGTGTGCTCTTCCGATCTCCATCACCAGCTAGTCTG
248



AGTC






rPin-ERBB2-16
AGACGTGTGCTCTTCCGATCTCCCCGTTTTATCTGTGAC
249



TCTTTG






rPin-ERBB2-17
AGACGTGTGCTCTTCCGATCTCCATCCTCTCTGCATCCC
250



AAATC






rPin-ERBB2-18
AGACGTGTGCTCTTCCGATCTGGCAGGTGTTATCATTCC
251



CCATTT






rPin-ERBB2-19
AGACGTGTGCTCTTCCGATCTGGGCCTCCTTATTTTTAT
252



GTGCTAAAT






rPin-ERBB2-2
AGACGTGTGCTCTTCCGATCTAGGGTGGAGGGGCTTACG
253





rPin-ERBB2-20
AGACGTGTGCTCTTCCGATCTAGCTTGCATCCTACTCCA
254



TCC






rPin-ERBB2-21
AGACGTGTGCTCTTCCGATCTTCCCCTGGTTTCTCCGGT
255





rPin-ERBB2-22
AGACGTGTGCTCTTCCGATCTCGACCCCGCCAGAAGC
256





rPin-ERBB2-23
AGACGTGTGCTCTTCCGATCTGCATGCAAAACACCACAA
257



ACAGTT






rPin-ERBB2-24
AGACGTGTGCTCTTCCGATCTGGCTACCTCCCTCTGTTT
258



ATGG






rPin-ERBB2-25
AGACGTGTGCTCTTCCGATCTAAAATATGAAGGAGTTCT
259



GCAAGATTAAAAG






rPin-ERBB2-26
AGACGTGTGCTCTTCCGATCTGGTTCATACAGCAGGAAT
260



ATGGGTAAT






rPin-ERBB2-27
AGACGTGTGCTCTTCCGATCTAGGACAGGCACAACTACC
261



CT






rPin-ERBB2-28
AGACGTGTGCTCTTCCGATCTAGCAGAAAAGCCAATACT
262



TCCCT






rPin-ERBB2-29
AGACGTGTGCTCTTCCGATCTAACACCACAGGCTCTACG
263



G






rPin-ERBB2-3
AGACGTGTGCTCTTCCGATCTCCCAGAAGGCGGGAGACA
264



TA






rPin-ERBB2-30
AGACGTGTGCTCTTCCGATCTCAGGGAGAAGCCTGACTG
265



AAG






rPin-ERBB2-31
AGACGTGTGCTCTTCCGATCTGGTGGACAGGGGACATGA
266



TCA






rPin-ERBB2-32
AGACGTGTGCTCTTCCGATCTGGAACACTGCCACCCCC
267





rPin-ERBB2-33
AGACGTGTGCTCTTCCGATCTCCCCCTGGTTAGCAGTGG
268





rPin-ERBB2-34
AGACGTGTGCTCTTCCGATCTAACTCAGCCCCATCACTC
269



AC






rPin-ERBB2-35
AGACGTGTGCTCTTCCGATCTGGAGGGGCATGGCTTACA
270



G






rPin-ERBB2-36
AGACGTGTGCTCTTCCGATCTCGGCTCTGACAATCCTCA
271



GAA






rPin-ERBB2-37
AGACGTGTGCTCTTCCGATCTGGTCTCAAAAACAAAACG
272



AAAGGTAAA






rPin-ERBB2-38
AGACGTGTGCTCTTCCGATCTACTGACAGGGGATATAGG
273



GACA






rPin-ERBB2-39
AGACGTGTGCTCTTCCGATCTAGTCCTTGTTCACGGATA
274



GCAT






rPin-ERBB2-4
AGACGTGTGCTCTTCCGATCTGTTCCGAGCGGCCAAGTC
275





rPin-ERBB2-40
AGACGTGTGCTCTTCCGATCTCCGCAGGGGACTTTTAGG
276



G






rPin-ERBB2-41
AGACGTGTGCTCTTCCGATCTCTAGCACAGCCACAGTCA
277



CA






rPin-ERBB2-42
AGACGTGTGCTCTTCCGATCTCATTTAGTTGTCTTTAAA
278



TTGAAATGCATGAA






rPin-ERBB2-43
AGACGTGTGCTCTTCCGATCTCCTTGTCATCCAGGTCCA
279



CA






rPin-ERBB2-44
AGACGTGTGCTCTTCCGATCTACTCTAACTTGACCCCCT
280



TATTCCT






rPin-ERBB2-45
AGACGTGTGCTCTTCCGATCTACAGGAATGTACACCTGA
281



TGATTTTG






rPin-ERBB2-46
AGACGTGTGCTCTTCCGATCTCTGCCTTGGCTCCCCG
282





rPin-ERBB2-47
AGACGTGTGCTCTTCCGATCTCAGTCTCCGCATCGTGTA
283



CT






rPin-ERBB2-48
AGACGTGTGCTCTTCCGATCTCTGTGCCCAGCTTAATTT
284



TGTACA






rPin-ERBB2-49
AGACGTGTGCTCTTCCGATCTGGGGTGTCAAGTACTCGG
285



G






rPin-ERBB2-5
AGACGTGTGCTCTTCCGATCTACACATCACTCTGGTGGG
286



TGAA






rPin-ERBB2-50
AGACGTGTGCTCTTCCGATCTTGGACCCCTTCCAGCCA
287





rPin-ERBB2-6
AGACGTGTGCTCTTCCGATCTCTCTCGGTGGATCTGCAT
288



AACAT






rPin-ERBB2-7
AGACGTGTGCTCTTCCGATCTGTCAAATCCTAGGGGGTA
289



ATACGA






rPin-ERBB2-8
AGACGTGTGCTCTTCCGATCTCTGGTCATTGCAGAGACC
290



TCT






rPin-ERBB2-9
AGACGTGTGCTCTTCCGATCTTCTCCATTGTCTAGCACG
291



GC






rPin-Mut1
AGACGTGTGCTCTTCCGATCTCTCACAGTAAAAATAGGT
292



GATTTTGGTCT






rPin-Mut2
AGACGTGTGCTCTTCCGATCTAAGATCCAATCCATTTTT
293



GTTGTCCAG






rPin-Mut4
AGACGTGTGCTCTTCCGATCTGTCTGACGGGTAGAGTGT
294



GC






rPin-Mut5
AGACGTGTGCTCTTCCGATCTCACATGACGGAGGTTGTG
295



AGG






rPin-Mut6
AGACGTGTGCTCTTCCGATCTCTGAAAATGACTGAATAT
296



AAACTTGTGGTAGT






rPin-mut10
AGACGTGTGCTCTTCCGATCTCCAGTTGCAAACCAGACC
297



TCA






rPin-mut11
AGACGTGTGCTCTTCCGATCTTCCTCACTGATTGCTCTT
298



AGGTCT






rPin-mut12
AGACGTGTGCTCTTCCGATCTCCAACAAGGCACTGACCA
299



TC






rPin-mut13
AGACGTGTGCTCTTCCGATCTGAGCGCCAGACGAGACC
300





rPin-mut14
AGACGTGTGCTCTTCCGATCTCGGTGGATATGGTCCTTC
301



TCTTC






rPin-mut15
AGACGTGTGCTCTTCCGATCTCGGTGGGCGTCCAGCA
302





rPin-mut16
AGACGTGTGCTCTTCCGATCTTGGTCAATGGAAGAAACC
303



ACCA






rPin-mut17
AGACGTGTGCTCTTCCGATCTCATCTTCAACCTCTGCAT
304



TGAAAGT






rPin-mut18
AGACGTGTGCTCTTCCGATCTAACAGCTACCCTTCCATC
305



ATAAGT






rPin-mut19
AGACGTGTGCTCTTCCGATCTCTGTTTTTAGCAAAAGCG
306



TCCAG






rPin-mut20
AGACGTGTGCTCTTCCGATCTAGGTTTCAAAGCGCCAGT
307



CA






rPin-mut21
AGACGTGTGCTCTTCCGATCTGTAACAAGCCAAATGAAC
308



AGACAAGT






rPin-mut22
AGACGTGTGCTCTTCCGATCTAGTTGTTCTAGCAGTGAA
309



GAGATAAAGA






rPin-mut23
AGACGTGTGCTCTTCCGATCTAAAGCACCTAAAAAGAAT
310



AGGCTGAG






rPin-mut24
AGACGTGTGCTCTTCCGATCTAGGTAGATCTGAATGCTG
311



ATCCC






rPin-mut25
AGACGTGTGCTCTTCCGATCTGGATCTGATTCTTCTGAA
312



GATACCGTTAA






rPin-mut26
AGACGTGTGCTCTTCCGATCTGATTTATCTGCTCTTCGC
313



GTTGAA






rPin-mut27
AGACGTGTGCTCTTCCGATCTACTGTTTCATATACTTCA
314



TCTTCTAGGACA






rPin-mut28
AGACGTGTGCTCTTCCGATCTGGAGATTTTGTCACTTCC
315



ACTCTC






rPin-mut29
AGACGTGTGCTCTTCCGATCTTTGAATTTGACAAAACCA
316



TTTCCTCATTT






rPin-mut30
AGACGTGTGCTCTTCCGATCTGTTTCAGGACATCCATTT
317



TATCAAGTTTC






rPin-mut31
AGACGTGTGCTCTTCCGATCTGTTAATATTCCTAACACA
318



CTGTTCAACTCT






rPin-mut32
AGACGTGTGCTCTTCCGATCTTTCTAGTCTCTTTTGTTG
319



GGCCT






rPin-mut33
AGACGTGTGCTCTTCCGATCTCTTATCAAAACTGAAAAA
320



TTACAATGAAAGGTTT






rPin-mut34
AGACGTGTGCTCTTCCGATCTCTTTATTGCCAGTAAATT
321



GTAACATTCGT






rPin-mut35
AGACGTGTGCTCTTCCGATCTCAAGTTCTTCGTCAGCTA
322



TTGAATTACT






rPin-mut36
AGACGTGTGCTCTTCCGATCTCCTTCTCTCCACATATGT
323



TTCTCTTATTAA






rPin-mut37
AGACGTGTGCTCTTCCGATCTCATCCCACCTCCCATCTA
324



TACTTC






rPin-mut38
AGACGTGTGCTCTTCCGATCTTTTGTGTCTGATGGGCAA
325



TCTTTC






rPin-mut39
AGACGTGTGCTCTTCCGATCTTTTTGGGCTAGCCAGACT
326



CTTG






rPin-mut3new
AGACGTGTGCTCTTCCGATCTGAATCTCCATTTTAGCAC
327



TTACCTGTG






rPin-mut40
AGACGTGTGCTCTTCCGATCTTTGATATTTTTCAGGGAA
328



TGATGTACCTG






rPin-mut41
AGACGTGTGCTCTTCCGATCTGAAATCATGGTATTGCAT
329



TTTTTTCTTACAG






rPin-mut42
AGACGTGTGCTCTTCCGATCTAGCACCCAATCAAGCTCA
330



ACT






rPin-mut43
AGACGTGTGCTCTTCCGATCTACTCTTCAGCACAATCAA
331



CCAGA






rPin-mut44
AGACGTGTGCTCTTCCGATCTGCATCACCTCTCTACAGT
332



TCCAGT






rPin-mut45
AGACGTGTGCTCTTCCGATCTGCTGAATGTTAACATTAA
333



TGCTTATTTTACC






rPin-mut46
AGACGTGTGCTCTTCCGATCTAGTGACTGCTGCCATCGA
334



G






rPin-mut47
AGACGTGTGCTCTTCCGATCTGCTACGTGTTAGTGGCTC
335



TTAATCA






rPin-mut48
AGACGTGTGCTCTTCCGATCTATAAACTGAGCTCTCTCT
336



CTGACC






rPin-mut49
AGACGTGTGCTCTTCCGATCTTGAAGCCGGCGACAGG
337





rPin-mut50
AGACGTGTGCTCTTCCGATCTGGTTCAATTACTTTTAAA
338



AAGGGTTGAAAAAG






rPin-mut51
AGACGTGTGCTCTTCCGATCTATTTGACTTTACCTTATC
339



AATGTCTCGAA






rPin-mut52
AGACGTGTGCTCTTCCGATCTGTGTCTGTGTAATCAAAC
340



AAGTTTATATTTCC






rPin-mut53
AGACGTGTGCTCTTCCGATCTAGTAACACCAATAGGGTT
341



CAGCAA






rPin-mut54
AGACGTGTGCTCTTCCGATCTAAAGAGTCTCAAACACAA
342



ACTAGAGTC






rPin-mut55
AGACGTGTGCTCTTCCGATCTTTTATTGTATTTGCATAG
343



CACAAATTTTTGTT






rPin-mut56
AGACGTGTGCTCTTCCGATCTCCGTGCCGAACGCACC
344





rPin-mut57
AGACGTGTGCTCTTCCGATCTGCAAAGCAGAAACTCACA
345



TCGA






rPin-mut58
AGACGTGTGCTCTTCCGATCTCTCCAGGAAGCCTACGTG
346



ATG






rPin-mut59
AGACGTGTGCTCTTCCGATCTCGGACATAGTCCAGGAGG
347



CAG






rPin-mut60
AGACGTGTGCTCTTCCGATCTGCATGGTATTCTTTCTCT
348



TCCGCA






rPin-mut61
AGACGTGTGCTCTTCCGATCTGGGCAGATTACAGTGGGA
349



CAA






rPin-mut62
AGACGTGTGCTCTTCCGATCTGGATACAGGTCAAGTCTA
350



AGTCGAATC






rPin-mut63
AGACGTGTGCTCTTCCGATCTCCTGTATACGCCTTCAAG
351



TCTTTCT






rPin-mut64
AGACGTGTGCTCTTCCGATCTGCAAGCATACAAATAAGA
352



AAACATACTTACAG






rPin-mut65
AGACGTGTGCTCTTCCGATCTTCTGCAATTAAATTTGGC
353



GGTGT






rPin-mut66
AGACGTGTGCTCTTCCGATCTCGATGTAATAAATATGCA
354



CATATCATTACACC






rPin-mut67
AGACGTGTGCTCTTCCGATCTCAGGAAGAGGAAAGGAAA
355



AACATCAA






rPin-mut68
AGACGTGTGCTCTTCCGATCTGATATTTCTCCCAATGAA
356



AGTAAAGTACAAAC






rPin-mut69
AGACGTGTGCTCTTCCGATCTTCGATTTCTTGATCACAT
357



AGACTTCCAT






rPin-mut7
AGACGTGTGCTCTTCCGATCTGGGCGTGAGCGCTTCG
358





rPin-mut70
AGACGTGTGCTCTTCCGATCTCTTAAAATTTGGAGAAAA
359



GTATCGGTTGG






rPin-mut71
AGACGTGTGCTCTTCCGATCTAGCCTCTGGATTTGACGG
360



C






rPin-mut72
AGACGTGTGCTCTTCCGATCTGATGGCAAACTTCCCATC
361



GTAG






rPin-mut73
AGACGTGTGCTCTTCCGATCTGGGACAGCTGGCTACACA
362



A






rPin-mut74
AGACGTGTGCTCTTCCGATCTAGGCCCTGACACAGGATG
363



T






rPin-mut75
AGACGTGTGCTCTTCCGATCTCCCATTGAGGCCGGTGAT
364





rPin-mut76
AGACGTGTGCTCTTCCGATCTTTGACCATCACCATGTAG
365



ACATCA






rPin-mut77
AGACGTGTGCTCTTCCGATCTAGCTGTCTCTCTCCCAGT
366



TCATT






rPin-mut78
AGACGTGTGCTCTTCCGATCTCCCATGGCAAACACCATG
367



AG






rPin-mut79
AGACGTGTGCTCTTCCGATCTCACCATGTGTGACTTGAT
368



TAGCAG






rPin-mut8
AGACGTGTGCTCTTCCGATCTGTGGTAATCTACTGGGAC
369



GGAAC






rPin-mut9
AGACGTGTGCTCTTCCGATCTTCCACTACAACTACATGT
370



GTAACAGTT






rPin-Ref-1
AGACGTGTGCTCTTCCGATCTGTAACAGTAGGTGTTTCA
371



ATATGACTTTTATT






rPin-Ref-10
AGACGTGTGCTCTTCCGATCTCTCCCCTCCTCCATAGGA
372



ACTT






rPin-Ref-11
AGACGTGTGCTCTTCCGATCTACATACCAGGTTCTGCGC
373



TT






rPin-Ref-12
AGACGTGTGCTCTTCCGATCTATCAAGGCACCGCTCTAA
374



CTT






rPin-Ref-13
AGACGTGTGCTCTTCCGATCTATCCCGGTGTGCATTTGA
375



GA






rPin-Ref-14
AGACGTGTGCTCTTCCGATCTGGGCTATGGGGGCTTCCT
376





rPin-Ref-15
AGACGTGTGCTCTTCCGATCTGATGTGCCCTGACATCAG
377



AAATATAC






rPin-Ref-16
AGACGTGTGCTCTTCCGATCTAGTGTTGATCTGAAGGAA
378



CTTCCT






rPin-Ref-17
AGACGTGTGCTCTTCCGATCTTGGGACCATGTTTGGCCA
379



T






rPin-Ref-18
AGACGTGTGCTCTTCCGATCTTCCCATCATTGCTGCTGT
380



CA






rPin-Ref-19
AGACGTGTGCTCTTCCGATCTCAAACACGTGTGATCAAT
381



AGTACCAT






rPin-Ref-2
AGACGTGTGCTCTTCCGATCTTCTCATATCAGAACTTAA
382



ATACATAGCAGTAG






rPin-Ref-20
AGACGTGTGCTCTTCCGATCTGGGGAAGGAAGATGTCAC
383



ATTATGA






rPin-Ref-21
AGACGTGTGCTCTTCCGATCTGCATGCGCAAGAGCTACC
384



C






rPin-Ref-22
AGACGTGTGCTCTTCCGATCTACGATAAAATTCTCTTAT
385



CTTGAAGGATTGAT






rPin-Ref-23
AGACGTGTGCTCTTCCGATCTAGTGTTTCTGATATTGAA
386



AAATTTTAAGTGCT






rPin-Ref-24
AGACGTGTGCTCTTCCGATCTTTTCATCCTTCGCACATG
387



TATACTG






rPin-Ref-25
AGACGTGTGCTCTTCCGATCTCTGGAGCAGATGACTCAC
388



ATTTC






rPin-Ref-26
AGACGTGTGCTCTTCCGATCTAGGGGGCTTGGTCTTTTT
389



TCT






rPin-Ref-27
AGACGTGTGCTCTTCCGATCTCACCTTTTTTAACAACCG
390



GATCTAGT






rPin-Ref-28
AGACGTGTGCTCTTCCGATCTGAGGCCCTGTAATCTGTA
391



TTTTAACC






rPin-Ref-29
AGACGTGTGCTCTTCCGATCTCCTTAATATCAGACTTCC
392



CAGCCTTC






rPin-Ref-3
AGACGTGTGCTCTTCCGATCTGGAGCTCTGAGACAGGAA
393



CC






rPin-Ref-30
AGACGTGTGCTCTTCCGATCTTGGCAAAGCAGAAGACAA
394



TAGTAGA






rPin-Ref-31
AGACGTGTGCTCTTCCGATCTCCCTTTCAGGGAGTCCTG
395



TACA






rPin-Ref-32
AGACGTGTGCTCTTCCGATCTTTTTCGTTACTGTAAAAT
396



GGGAATGTTC






rPin-Ref-33
AGACGTGTGCTCTTCCGATCTCGGTGAACTTTCGGGAAA
397



GG






rPin-Ref-34
AGACGTGTGCTCTTCCGATCTCCCACGTACAAGAGGATT
398



TCAAAGT






rPin-Ref-35
AGACGTGTGCTCTTCCGATCTAGTGTGAATGTACTTAAT
399



GACACTTAGC






rPin-Ref-36
AGACGTGTGCTCTTCCGATCTGTGAGGCAGGTGCTCACT
400



T






rPin-Ref-37
AGACGTGTGCTCTTCCGATCTCTGGTGTTCTTTTATACC
401



CATTTTTTCTTTA






rPin-Ref-38
AGACGTGTGCTCTTCCGATCTCTGTTGCTCTTGACTCTG
402



AGCT






rPin-Ref-39
AGACGTGTGCTCTTCCGATCTCCTCAGGTCCTTGTGGCT
403



AAC






rPin-Ref-4
AGACGTGTGCTCTTCCGATCTAGGAGCCGTGGGAATCAA
404



AA






rPin-Ref-40
AGACGTGTGCTCTTCCGATCTCAGCATGGCAAGGCAACT
405



T






rPin-Ref-41
AGACGTGTGCTCTTCCGATCTTGAGGGACAGAAAATCAG
406



GTCG






rPin-Ref-42
AGACGTGTGCTCTTCCGATCTGGCTAATGAGTTGATCTC
407



TCTGAGC






rPin-Ref-43
AGACGTGTGCTCTTCCGATCTAAAAGAAAACAAAGGACA
408



TAGATTTTCCC






rPin-Ref-44
AGACGTGTGCTCTTCCGATCTAGAGTGCTCAAACCTTGG
409



GAA






rPin-Ref-45
AGACGTGTGCTCTTCCGATCTGCTATCATGCCATGAAGA
410



ATATTCATATATTCATA






rPin-Ref-46
AGACGTGTGCTCTTCCGATCTAGAGAACCCACTTGGGAC
411



CA






rPin-Ref-47
AGACGTGTGCTCTTCCGATCTACCATATTCTTAATTTTT
412



AAAATTCACAGCCA






rPin-Ref-48
AGACGTGTGCTCTTCCGATCTCTCTGTCGTAAGTCAAGT
413



CTTTGTG






rPin-Ref-49
AGACGTGTGCTCTTCCGATCTCTATCGAATCAGAATGCA
414



AAGCAAATT






rPin-Ref-5
AGACGTGTGCTCTTCCGATCTCGTTTCGGATACTCAGTC
415



TCTGAA






rPin-Ref-50
AGACGTGTGCTCTTCCGATCTACAAATTACCTAAACTGA
416



CTCAAGAAGAA






rPin-Ref-6
AGACGTGTGCTCTTCCGATCTGGCTCCTTTCGTGAGCGA
417



AG






rPin-Ref-7
AGACGTGTGCTCTTCCGATCTAGAGGTAGTGGAGGTCAA
418



GGT






rPin-Ref-8
AGACGTGTGCTCTTCCGATCTTGACTTGCGTTCATCTTG
419



TTATTTAAAC






rPin-Ref-9
AGACGTGTGCTCTTCCGATCTCCTGAAAAGGTAGGTTGG
420



TGC






rPout-ERBB2-1
GGATATTCCTTTCTACTCTTTGACATCATCTTCACCTCT
421



TGGTTGTGCAGG






rPout-ERBB2-10
GGATATTCCTTTCTACTCTTTGACATCATCTCATTTACC
422



CCTCACAACAACCAG






rPout-ERBB2-11
GGATATTCCTTTCTACTCTTTGACATCATCTCCATGTTC
423



TAACACCGTGATCTG






rPout-ERBB2-12
GGATATTCCTTTCTACTCTTTGACATCATCTCATGGAAA
424



ACACTTCAGTTTGCTC






rPout-ERBB2-13
GGATATTCCTTTCTACTCTTTGACATCATCTACAGCAAA
425



GGTTCTACCCCG






rPout-ERBB2-14
GGATATTCCTTTCTACTCTTTGACATCATCTCCAGGCTA
426



CTTCTTACTCATTCCAA






rPout-ERBB2-15
GGATATTCCTTTCTACTCTTTGACATCATCTACCTCCAT
427



CACCAGCTAGTCT






rPout-ERBB2-16
GGATATTCCTTTCTACTCTTTGACATCATCTGGTGCCCC
428



CGTTTTATCTGT






rPout-ERBB2-17
GGATATTCCTTTCTACTCTTTGACATCATCTCCAAGCAA
429



ACCCATCCTCTCTG






rPout-ERBB2-18
GGATATTCCTTTCTACTCTTTGACATCATCTAGGAGGCA
430



GGTGTTATCATTCC






rPout-ERBB2-19
GGATATTCCTTTCTACTCTTTGACATCATCTTTTATCTG
431



AAATTCAAATTTAACTGGGCC






rPout-ERBB2-2
GGATATTCCTTTCTACTCTTTGACATCATCTGGAGAGGG
432



TGGAGGGGCT






rPout-ERBB2-20
GGATATTCCTTTCTACTCTTTGACATCATCTGGGGAGCT
433



TGCATCCTACTC






rPout-ERBB2-21
GGATATTCCTTTCTACTCTTTGACATCATCTGGCTCCCC
434



TGGTTTCTCC






rPout-ERBB2-22
GGATATTCCTTTCTACTCTTTGACATCATCTACACCCGA
435



CCCCGCC






rPout-ERBB2-23
GGATATTCCTTTCTACTCTTTGAcATcATcTTGTTCTAG
436



GATTAAAGGAGAATGCATG






rPout-ERBB2-24
GGATATTCCTTTCTACTCTTTGACATCATCTCCATAGAA
437



GGCTACCTCCCTCT






rPout-ERBB2-25
GGATATTCCTTTCTACTCTTTGACATCATCTGGAATTAA
438



AATATGAAGGAGTTCTGCAAG






rPout-ERBB2-26
GGATATTCCTTTCTACTCTTTGACATCATCTTTAAAAGT
439



TAAGACAAGACAGGTTCATACA






rPout-ERBB2-27
GGATATTCCTTTCTACTCTTTGACATCATCTCCCAAGGA
440



CAGGCACAACTAC






rPout-ERBB2-28
GGATATTCCTTTCTACTCTTTGACATCATCTCTCACAGC
441



AGAAAAGCCAATACTT






rPout-ERBB2-29
GGATATTCCTTTCTACTCTTTGACATCATCTCCGATAAA
442



CACCACAGGCTCTA






rPout-ERBB2-3
GGATATTCCTTTCTACTCTTTGACATCATCTGTCAGGCA
443



GATGCCCAGAAG






rPout-ERBB2-30
GGATATTCCTTTCTACTCTTTGACATCATCTCTCCAAGT
444



CATGCCACCTCA






rPout-ERBB2-31
GGATATTCCTTTCTACTCTTTGACATCATCTAGGTGGAC
445



AGGGGACATGA






rPout-ERBB2-32
GGATATTCCTTTCTACTCTTTGACATCATCTCTGAAATA
446



GGAACACTGCCACC






rPout-ERBB2-33
GGATATTCCTTTCTACTCTTTGACATCATCTCAAAGCCT
447



CCCCCTGGTTAG






rPout-ERBB2-34
GGATATTCCTTTCTACTCTTTGACATCATCTTGTGGAGT
448



CTGAAACTCAGCC






rPout-ERBB2-35
GGATATTCCTTTCTACTCTTTGACATCATCTCAGGGAGG
449



GGCATGGC






rPout-ERBB2-36
GGATATTCCTTTCTACTCTTTGACATCATCTCTGAGACT
450



CACGGCTCTGAC






rPout-ERBB2-37
GGATATTCCTTTCTACTCTTTGACATCATCTCTAAATTC
451



GGTCTCAAAAACAAAACGAA






rPout-ERBB2-38
GGATATTCCTTTCTACTCTTTGACATCATCTCCACACTG
452



ACAGGGGATATAGG






rPout-ERBB2-39
GGATATTCCTTTCTACTCTTTGAcATcATcTcATACAAG
453



TCCTTGTTCACGGATAG






rPout-ERBB2-4
GGATATTCCTTTCTACTCTTTGACATCATCTGACCAGCA
454



CGTTCCGAGC






rPout-ERBB2-40
GGATATTCCTTTCTACTCTTTGACATCATCTAGGGACCG
455



CAGGGGAC






rPout-ERBB2-41
GGATATTCCTTTCTACTCTTTGACATCATCTCCCTAGCA
456



CAGCCACAGTC






rPout-ERBB2-42
GGATATTCCTTTCTACTCTTTGACATCATCTTTTTCTCA
457



TTTAGTTGTCTTTAAATTGAAATGC






rPout-ERBB2-43
GGATATTCCTTTCTACTCTTTGACATCATCTGGCAGCCC
458



TTGTCATCCAG






rPout-ERBB2-44
GGATATTCCTTTCTACTCTTTGACATCATCTCACCCTGA
459



CTCTAACTTGACCC






rPout-ERBB2-45
GGATATTCCTTTCTACTCTTTGACATCATCTCATGGGTA
460



CAGGAATGTACACCT






rPout-ERBB2-46
GGATATTCCTTTCTACTCTTTGACATCATCTTCTAAAAC
461



CTGCCTTGGCTCC






rPout-ERBB2-47
GGATATTCCTTTCTACTCTTTGACATcATcTCAGCAGTC
462



TCCGCATCGT






rPout-ERBB2-48
GGATATTCCTTTCTACTCTTTGACATCATCTCACTGTGC
463



CCAGCTTAATTTTGT






rPout-ERBB2-49
GGATATTCCTTTCTACTCTTTGACATCATCTCCCTGGGG
464



TGTCAAGTACTC






rPout-ERBB2-5
GGATATTCCTTTCTACTCTTTGACATCATCTCATAACTC
465



CACACATCACTCTGGT






rPout-ERBB2-50
GGATATTCCTTTCTACTCTTTGACATCATCTTGTTCCTC
466



TTCCAACGAGGC






rPout-ERBB2-6
GGATATTCCTTTCTACTCTTTGACATCATCTCAGGCACT
467



CTCGGTGGATC






rPout-ERBB2-7
GGATATTCCTTTCTACTCTTTGACATCATCTCCTAAGGT
468



CAAATCCTAGGGGGTAATA






rPout-ERBB2-8
GGATATTCCTTTCTACTCTTTGACATCATCTCCGGGGCT
469



CTGGTCATTG






rPout-ERBB2-9
GGATATTCCTTTCTACTCTTTGACATCATCTGTTCAGCG
470



GGTCTCCATTGT






rPout-Mutl
GGATATTCCTTTCTACTCTTTGACATCATCTTGAAGACC
471



TCACAGTAAAAATAGGTGATT






rPout-Mut2
GGATATTCCTTTCTACTCTTTGACATCATCTTGTGGAAG
472



ATCCAATCCATTTTTGTTG






rPout-Mut4
GGATATTCCTTTCTACTCTTTGACATCATCTGAGGGTCT
473



GACGGGTAGAGT






rPout-Mut5
GGATATTCCTTTCTACTCTTTGACATCATCTACAGCACA
474



TGACGGAGGTTG






rPout-Mut6
GGATATTCCTTTCTACTCTTTGACATCATCTTTATAAGG
475



CCTGCTGAAAATGACTGAA



rPout-mut10
GGATATTCCTTTCTACTCTTTGACATCATCTGACCCCAG
476



TTGCAAACCAGAC






rPout-mut11
GGATATTCCTTTCTACTCTTTGACATCATCTTCTGATTC
477



CTCACTGATTGCTCTTAG






rPout-mut12
GGATATTCCTTTCTACTCTTTGACATCATCTCGGGGGCT
478



CAGCATCCA






rPout-mut13
GGATATTCCTTTCTACTCTTTGACATCATCTCAAACAGT
479



AGCTTCCCTGGGT






rPout-mut14
GGATATTCCTTTCTACTCTTTGACATCATCTCAGGACTC
480



GGTGGATATGGTC






rPout-mut15
GGATATTCCTTTCTACTCTTTGACATCATCTGGCGCATG
481



TAGGCGGTG






rPout-mut16
GGATATTCCTTTCTACTCTTTGACATCATCTGGAGATGT
482



GGTCAATGGAAGAAAC






rPout-mut17
GGATATTCCTTTCTACTCTTTGACATCATCTTCGTGTTG
483



GCAACATACCATCT






rPout-mut18
GGATATTCCTTTCTACTCTTTGACATCATCTCTTCTAAC
484



AGCTACCCTTCCATCAT






rPout-mut19
GGATATTCCTTTCTACTCTTTGACATCATCTAGGAAAGT
485



TCTGCTGTTTTTAGCAAA






rPout-mut20
GGATATTCCTTTCTACTCTTTGACATCATCTCTCAGTAT
486



TTGCAGAATACATTCAAGGT






rPout-mut21
GGATATTCCTTTCTACTCTTTGACATCATCTGAAGAGTA
487



ACAAGCCAAATGAACAGA






rPout-mut22
GGATATTCCTTTCTACTCTTTGACATCATCTTTGATAGT
488



TGTTCTAGCAGTGAAGAGA






rPout-mut23
GGATATTCCTTTCTACTCTTTGACATCATCTACAATTCA
489



AAAGCACCTAAAAAGAATAGG






rPout-mut24
GGATATTCCTTTCTACTCTTTGACATCATCTACAGAAAA
490



AAAGGTAGATCTGAATGCT






rPout-mut25
GGATATTCCTTTCTACTCTTTGACATCATCTAGGATCTG
491



ATTCTTCTGAAGATACCG






rPout-mut26
GGATATTCCTTTCTACTCTTTGACATCATCTTGGATTTA
492



TCTGCTCTTCGCGT






rPout-mut27
GGATATTCCTTTCTACTCTTTGACATCATCTGTATCTAC
493



AACTGTTTCATATACTTCATCTTCT






rPout-mut28
GGATATTCCTTTCTACTCTTTGACATCATCTCAGGCCAA
494



AGACGGTACAACT






rPout-mut29
GGATATTCCTTTCTACTCTTTGACATCATCTCTCTTCTT
495



TTTCCAATTCTTGAATTTGACA






rPout-mut30
GGATATTCCTTTCTACTCTTTGACATCATCTGCAGTTTC
496



AGGACATCCATTTTATCAA






rPout-mut31
GGATATTCCTTTCTACTCTTTGACATCATCTCCAAGTTA
497



ATATTCCTAACACACTGTTCA






rPout-mut32
GGATATTCCTTTCTACTCTTTGACATCATCTAATAAGGC
498



TTCTAGTCTCTTTTGTTGG






rPout-mut33
GGATATTCCTTTCTACTCTTTGACATCATCTACAAGCAC
499



TTATCAAAACTGAAAAATTACAAT






rPout-mut34
GGATATTCCTTTCTACTCTTTGACATCATCTGGCTTAAT
500



AATGTCCTCATTAAGGTCTATC






rPout-mut35
GGATATTCCTTTCTACTCTTTGACATCATCTCAATGCAA
501



GTTCTTCGTCAGCTA






rPout-mut36
GGATATTCCTTTCTACTCTTTGACATCATCTTTTAAACT
502



ATTTCTAACAACGCCTTCTCT






rPout-mut37
GGATATTCCTTTCTACTCTTTGACATCATCTTCAACATC
503



CCACCTCCCATCTA






rPout-mut38
GGATATTCCTTTCTACTCTTTGACATCATCTGAATCATA
504



TTTGTGTCTGATGGGCAAT






rPout-mut39
GGATATTCCTTTCTACTCTTTGACATCATCTAAACCATG
505



TGAAAATCACAGATTTTGG






rPout-mut3new
GGATATTCCTTTCTACTCTTTGACATCATCTCAGAGAAT
506



CTCCATTTTAGCACTTACC






rPout-mut40
GGATATTCCTTTCTACTCTTTGACATCATCTTGAATATC
507



ATTAAGGAACTTGATATTTTTCAGG






rPout-mut41
GGATATTCCTTTCTACTCTTTGACATCATCTAAATTTGA
508



GTTGAAATCATGGTATTGCAT






rPout-mut42
GGATATTCCTTTCTACTCTTTGACATCATCTCACAGCAC
509



CCAATCAAGCTC






rPout-mut43
GGATATTCCTTTCTACTCTTTGACATCATCTGACAACAC
510



TCTTCAGCACAATCAA






rPout-mut44
GGATATTCCTTTCTACTCTTTGACATCATCTAGGGCATC
511



ACCTCTCTACAGTT






rPout-mut45
GGATATTCCTTTCTACTCTTTGACATCATCTGTTTGCTG
512



AATGTTAACATTAATGCTTATTT






rPout-mut46
GGATATTCCTTTCTACTCTTTGACATCATCTACGGACCT
513



TACGTCAGTGACT






rPout-mut47
GGATATTCCTTTCTACTCTTTGACATCATCTGGCTACGT
514



GTTAGTGGCTCTTA






rPout-mut48
GGATATTCCTTTCTACTCTTTGACATCATCTACGGAGAA
515



TAAACTGAGCTCTCTC






rPout-mut49
GGATATTCCTTTCTACTCTTTGACATCATCTCCAAAAAA
516



TGAAGCCGGCGA






rPout-mut50
GGATATTCCTTTCTACTCTTTGACATCATCTTGCCTACT
517



GGTTCAATTACTTTTAAAAAG






rPout-mut51
GGATATTCCTTTCTACTCTTTGACATCATCTCATCAGCA
518



TTTGACTTTACCTTATCAATG






rPout-mut52
GGATATTCCTTTCTACTCTTTGACATCATCTCTAGAGTG
519



TCTGTGTAATCAAACAAGTTT






rPout-mut53
GGATATTCCTTTCTACTCTTTGACATCATCTTGATCCAG
520



TAACACCAATAGGGTTC






rPout-mut54
GGATATTCCTTTCTACTCTTTGACATCATCTAGTGAAAA
521



GAGTCTCAAACACAAACTAG






rPout-mut55
GGATATTCCTTTCTACTCTTTGACATCATCTTTTTTTCC
522



AGTTTATTGTATTTGCATAGCA






rPout-mut56
GGATATTCCTTTCTACTCTTTGACATCATCTCCTTATAC
523



ACCGTGCCGAACG






rPout-mut57
GGATATTCCTTTCTACTCTTTGACATCATCTCACAGCAA
524



AGCAGAAACTCACA






rPout-mut58
GGATATTCCTTTCTACTCTTTGACATCATCTCCCTCCCT
525



CCAGGAAGCCTA






rPout-mut59
GGATATTCCTTTCTACTCTTTGACATCATCTGGAGCCAA
526



TATTGTCTTTGTGTTCC






rPout-mut60
GGATATTCCTTTCTACTCTTTGACATCATCTCTCCTTCT
527



GCATGGTATTCTTTCTC






rPout-mut61
GGATATTCCTTTCTACTCTTTGACATCATCTTGATGGGC
528



AGATTACAGTGGG






rPout-mut62
GGATATTCCTTTCTACTCTTTGACATCATCTTGGATACA
529



GGTCAAGTCTAAGTCG






rPout-mut63
GGATATTCCTTTCTACTCTTTGACATCATCTCAATATTG
530



TTCCTGTATACGCCTTCA






rPout-mut64
GGATATTCCTTTCTACTCTTTGACATCATCTTTGCAAGC
531



ATACAAATAAGAAAACATACTT






rPout-mut65
GGATATTCCTTTCTACTCTTTGACATCATCTTCATACCT
532



ACCTCTGCAATTAAATTTGG






rPout-mut66
GGATATTCCTTTCTACTCTTTGACATCATCTCCCCGATG
533



TAATAAATATGCACATATCA






rPout-mut67
GGATATTCCTTTCTACTCTTTGACATCATCTTGTTTTCC
534



AATAAATTCTCAGATCCAGG






rPout-mut68
GGATATTCCTTTCTACTCTTTGACATCATCTTGGATATT
535



TCTCCCAATGAAAGTAAAGTAC






rPout-mut69
GGATATTCCTTTCTACTCTTTGACATCATCTTGCTATCG
536



ATTTCTTGATCACATAGACT






rPout-mut7
GGATATTCCTTTCTACTCTTTGACATCATCTCGTGGGCG
537



TGAGCGC






rPout-mut70
GGATATTCCTTTCTACTCTTTGACATCATCTCTGACCTT
538



AAAATTTGGAGAAAAGTATCG






rPout-mut71
GGATATTCCTTTCTACTCTTTGACATCATCTCATCTGGT
539



GTTACAGAAGTTGAACTG






rPout-mut72
GGATATTCCTTTCTACTCTTTGACATCATCTCGAAGATG
540



GCAAACTTCCCATC






rPout-mut73
GGATATTCCTTTCTACTCTTTGACATCATCTCTCAGACA
541



CTTACGGGGACAG






rPout-mut74
GGATATTCCTTTCTACTCTTTGACATCATCTGACAGGCC
542



CTGACACAGG






rPout-mut75
GGATATTCCTTTCTACTCTTTGACATCATCTATCTCTAA
543



CCCATTGAGGCCG






rPout-mut76
GGATATTCCTTTCTACTCTTTGACATCATCTACACTTGA
544



CCATCACCATGTAGAC






rPout-mut77
GGATATTCCTTTCTACTCTTTGACATCATCTTACAAGCT
545



GTCTCTCTCCCAGT






rPout-mut78
GGATATTCCTTTCTACTCTTTGACATCATCTCCAGCCCA
546



TGGCAAACAC






rPout-mut79
GGATATTCCTTTCTACTCTTTGACATCATCTCGCTCACC
547



ATGTGTGACTTGAT






rPout-mut8
GGATATTCCTTTCTACTCTTTGACATCATCTTCCTATCC
548



TGAGTAGTGGTAATCTACT






rPout-mut9
GGATATTCCTTTCTACTCTTTGACATCATCTTCTGACTG
549



TACCACCATCCACT






rPout-Ref-1
GGATATTCCTTTCTACTCTTTGACATCATCTAGTAACAG
550



TAGGTGTTTCAATATGACTTTT






rPout-Ref-10
GGATATTCCTTTCTACTCTTTGACATCATCTCCCTCCAG
551



GAGCCCACC






rPout-Ref-11
GGATATTCCTTTCTACTCTTTGACATCATCTACTGCTAC
552



TACATACCAGGTTCTG






rPout-Ref-12
GGATATTCCTTTCTACTCTTTGACATCATCTCTGATCAA
553



GGCACCGCTCTAA






rPout-Ref-13
GGATATTCCTTTCTACTCTTTGACATCATCTCTCCATCC
554



CGGTGTGCAT






rPout-Ref-14
GGATATTCCTTTCTACTCTTTGACATCATCTTCAAGGGC
555



TATGGGGGCTT






rPout-Ref-15
GGATATTCCTTTCTACTCTTTGACATCATCTAGATGTGC
556



CCTGACATCAGAAATA






rPout-Ref-16
GGATATTCCTTTCTACTCTTTGACATCATCTTCACTTAA
557



CCTTCAGTGTTGATCTGA






rPout-Ref-17
GGATATTCCTTTCTACTCTTTGACATCATCTAGGAGTGG
558



GACCATGTTTGG






rPout-Ref-18
GGATATTCCTTTCTACTCTTTGACATCATCTCCATCGCT
559



CCCATCATTGCT






rPout-Ref-19
GGATATTCCTTTCTACTCTTTGACATCATCTCTTTCAAA
560



CACGTGTGATCAATAGTAC






rPout-Ref-2
GGATATTCCTTTCTACTCTTTGACATCATCTCATTCTCA
561



TATCAGAACTTAAATACATAGCAG






rPout-Ref-20
GGATATTCCTTTCTACTCTTTGACATCATCTACAAGTCC
562



ATCTTATAGGGGAAGGA






rPout-Ref-21
GGATATTCCTTTCTACTCTTTGACATCATCTAAATGCAT
563



GAGCATGCGCAA






rPout-Ref-22
GGATATTCCTTTCTACTCTTTGACATCATCTCCACGATA
564



AAATTCTCTTATCTTGAAGGATT






rPout-Ref-23
GGATATTCCTTTCTACTCTTTGACATCATCTGTTCAAAG
565



TGTTTCTGATATTGAAAAATTTTAAGT






rPout-Ref-24
GGATATTCCTTTCTACTCTTTGACATCATCTCCTTTTTC
566



ATCCTTCGCACATGTATA






rPout-Ref-25
GGATATTCCTTTCTACTCTTTGACATCATCTCCCTGGAG
567



CAGATGACTCACA






rPout-Ref-26
GGATATTCCTTTCTACTCTTTGACATCATCTAGGCAGGG
568



GGCTTGGT






rPout-Ref-27
GGATATTCCTTTCTACTCTTTGACATCATCTCACACCTT
569



TTTTAACAACCGGATCT






rPout-Ref-28
GGATATTCCTTTCTACTCTTTGACATCATCTAGGTGAGG
570



CCCTGTAATCTGTA






rPout-Ref-29
GGATATTCCTTTCTACTCTTTGACATCATCTACCTTAAT
571



ATCAGACTTCCCAGCC






rPout-Ref-3
GGATATTCCTTTCTACTCTTTGACATCATCTTTTTGGGA
572



GCTCTGAGACAGG






rPout-Ref-30
GGATATTCCTTTCTACTCTTTGACATCATCTCTGATGGC
573



AAAGCAGAAGACAATA






rPout-Ref-31
GGATATTCCTTTCTACTCTTTGACATCATCTAGCCCTTT
574



CAGGGAGTCCT






rPout-Ref-32
GGATATTCCTTTCTACTCTTTGACATCATCTTCTCTTAA
575



TCTCAGTTTTCGTTACTGTAAAAT






rPout-Ref-33
GGATATTCCTTTCTACTCTTTGACATCATCTCATAGCAC
576



CACTCGGTGAACTT






rPout-Ref-34
GGATATTCCTTTCTACTCTTTGACATCATCTAGGAGTGA
577



GAACCCACGTACA






rPout-Ref-35
GGATATTCCTTTCTACTCTTTGACATCATCTCAACAGTG
578



TGAATGTACTTAATGACACT






rPout-Ref-36
GGATATTCCTTTCTACTCTTTGACATCATCTCCTGTCCT
579



GTGAGGCAGG






rPout-Ref-37
GGATATTCCTTTCTACTCTTTGACATCATCTTGCCCTGC
580



TGGTGTTCTTTTATA






rPout-Ref-38
GGATATTCCTTTCTACTCTTTGACATCATCTTCGCTGCT
581



GCTGTTGCT






rPout-Ref-39
GGATATTCCTTTCTACTCTTTGACATCATCTGGGTCCTC
582



AGGTCCTTGTG






rPout-Ref-4
GGATATTCCTTTCTACTCTTTGACATCATCTGCGTTGGG
583



AACTTCAACTGG






rPout-Ref-40
GGATATTCCTTTCTACTCTTTGACATCATCTTTGTAGCC
584



CAGCATGGCAA






rPout-Ref-41
GGATATTCCTTTCTACTCTTTGACATCATCTTTTGCTTT
585



TGAGGGACAGAAAATCA






rPout-Ref-42
GGATATTCCTTTCTACTCTTTGACATCATCTACTTGGCT
586



AATGAGTTGATCTCTCT






rPout-Ref-43
GGATATTCCTTTCTACTCTTTGACATCATCTAGTTATTT
587



TCAAAAGAAAACAAAGGACATAGATT






rPout-Ref-44
GGATATTCCTTTCTACTCTTTGACATCATCTGCAAAGAG
588



TGCTCAAACCTTGG






rPout-Ref-45
GGATATTCCTTTCTACTCTTTGACATCATCTGTTACTAA
589



TTTTTTTGGCTATCATGCCA






rPout-Ref-46
GGATATTCCTTTCTACTCTTTGACATCATCTGTAGCAGA
590



GAACCCACTTGGG






rPout-Ref-47
GGATATTCCTTTCTACTCTTTGACATCATCTAGCTCAAA
591



CCATATTCTTAATTTTTAAAATTCAC






rPout-Ref-48
GGATATTCCTTTCTACTCTTTGACATCATCTCTCCTCTG
592



TCGTAAGTCAAGTCTTT






rPout-Ref-49
GGATATTCCTTTCTACTCTTTGACATCATCTCAGTCTGG
593



TAAAGTGCTATCGAATC






rPout-Ref-5
GGATATTCCTTTCTACTCTTTGACATCATCTCTGAATAG
594



TCCGTTTCGGATACTCA






rPout-Ref-50
GGATATTCCTTTCTACTCTTTGACATCATCTAAAAACAC
595



AAATTACCTAAACTGACTCAAG






rPout-Ref-6
GGATATTCCTTTCTACTCTTTGACATCATCTAGCGCCTC
596



CCGGCT






rPout-Ref-7
GGATATTCCTTTCTACTCTTTGACATCATCTTGCCAGAG
597



GTAGTGGAGGTC






rPout-Ref-8
GGATATTCCTTTCTACTCTTTGACATCATCTAGTGACTT
598



GCGTTCATCTTGTTATTTA






rPout-Ref-9
GGATATTCCTTTCTACTCTTTGACATCATCTGGAGCCTG
599



AAAAGGTAGGTTGG









III. UMI DESIGN

In the NGS library preparation process, PCR amplification steps can significantly increase the quantitation variation, making it difficult to differentiate small changes in original molecule number. UMI technology may be used to reduce PCR bias and achieve absolute quantitation of original DNA molecules. The concept of UMI is to give every original DNA molecule a different DNA sequence as a “barcode,” so that the origin of each NGS read can be tracked based on the barcode sequence. Given enough NGS reads, the number of unique UMIs found in the NGS output can reflect the number of original DNA molecules. Previously, UMI technology was mostly used for error correction in NGS-based detection of low-frequency mutations; it has also been applied to quantitation. Labeling each original molecule uniquely is achieved by using a large number of different UMI sequences; for example, using 109 different UMI sequences for 100,000 original molecules will generate <0.006% molecules carrying repeated UMIs.


DNA sequences containing degenerate bases, such as poly(N) (i.e., a mix of A, T, C, or G at each position), are often used as UMI sequences. In QASeq, poly(H) (A, T, or C) is used as the UMI because it has weaker cross-binding energy compared to poly(N) or a mix of S (C or G) and W (A or T) bases, as indicated by simulation (FIG. 2). (H)20 contains 3.5×109 different sequences, which are enough for 100,000 molecules as input; (H)15 contains 1.4×107 different sequences, which are enough for 6,000 molecules as input.


IV. SPACER TO REDUCE PCR BIAS

PCR efficiency varies for amplicons with different sequences. Because UMIs consist of many different sequences, a spacer between the primer and the variable UMI region may be used to achieve more uniform PCR efficiency.


NGS was carried out to evaluate the influence of spacer on PCR bias (FIG. 3A). The template molecules have two adaptors on the 5′ and 3′ end for amplification, and a UMI region consists of (D)15 in the middle. Three sets of primers, without any spacer (set 1), with a 5 nt spacer between the forward primer and the UMI and a 5 nt spacer between the reverse primer and the UMI (set 2), or with a 12 nt spacer between the forward primer and the UMI and a 11 nt spacer between the reverse primer and the UMI (set 3) were used to amplify the template separately. Indices were added via PCR before NGS analysis. (D)15 contains 1.4×107 different sequences. Because the input template molecule number is far below the possible sequence number, each unique UMI sequence only has 1 copy before amplification. All NGS reads carrying the same UMI are presumably derived from the same molecule. As such, UMI family size (i.e., the number of reads carrying the same UMI) is an indication of the PCR efficiency.


UMI family size distribution was compared to evaluate the significance of spacers on PCR bias (FIG. 3B). More uniform distribution was observed when the spacer between primers and UMI was longer. In primer set 3, wherein the spacer length was longer than 10 nt at both ends, a significantly improved distribution was achieved.


V. QASeq WORKFLOW

A schematic of the QASeq NGS library preparation workflow is shown in FIG. 1. First, a DNA sample is mixed with all the SW, SrPA, DNA polymerase, dNTPs, and PCR buffer. Two cycles of long-extension (about 30 min) PCR are performed for addition of UMI on all target loci. Afterwards, each strand in one DNA molecule will carry a different UMI. Next, in order to amplify the molecules while preventing addition of multiple UMIs onto the same original molecule, the annealing temperature is raised by about 8° C. and amplification is performed for at least two cycles (e.g., for about seven cycles) using UfP and UrP and with a short-extension (about 30 s). Addition of UfP and UrP into the reaction is an open-tube step on the thermocycler. After purification using SPRI magnetic beads or columns, SrPB primers, DNA polymerase, dNTPs, and PCR buffer are mixed with the PCR product for adapter replacement; after at least one cycle (e.g., two cycles) of long extension (about 30 min), the NGS adapters are only added onto the correct PCR products, not the primer dimers or non-specific products. After another purification using SPRI magnetic beads or columns, standard NGS index PCR is performed; libraries are normalized and loaded onto an Illumina sequencer.


All types of DNA polymerases and PCR supermixes can be used. The standard annealing, extension, and denaturation temperature for the specific polymerase used should be followed (except for the universal PCR step, in which the annealing temperature is raised).


VI. ALTERNATIVE QASEQ WORKFLOWS

The workflow may be performed using SW and SrPB to add UMIs using two cycles of PCR, and then directly adding index primers for index PCR. To test this, twenty sets of SW and SrPB were used in the same reaction. The experimental on-target rate of this method is very low (0.5%), and thus this method may not be useful in an NGS assay for diagnostics (FIG. 9A). The off-target NGS reads were mostly primer dimers. In a second alternative workflow, universal PCR is performed using UW and Urp for six cycles of universal PCR, which is following by a purification step. These additional steps improved the on-target rate to 12-28% (mean on-target rate=20%) for different libraries (FIG. 9B). A third alternative workflow based on the second alternative workflow was tested. For this, a size selection step using agarose gel was added after index PCR to further reduce primer dimers. The experimental mean on-target rate was improved to 42%, but still lower than 50% (FIG. 9C). Primer dimer reduction was achieved using the primary experimental workflow, which includes both adapter replacement and purification after universal PCR, and results in a high mean on-target rate of 66% (FIG. 9D). One source of primer dimers in the above-mentioned workflows is shown in FIG. 9E. If the 3′ part of SW binds to SfPB, or the 3′ part of SfPB binds to SfP, a dimer strand with universal regions at both 5′ and 3′ ends can be generated and thus amplified in the universal or index PCR step.


The primary workflow includes a final index PCR step to add index sequences and the sequencer's P5/P7 sequences to the ends of the amplicon; however, there are alternative workflows that add the abovementioned sequences during UMI addition, universal PCR, or adapter replacement steps, and thus do not require the index PCR step. FIGS. 10A-C shows three examples. First, the index and P5 sequences are added onto the 5′ of UfP; the other index and P7 sequences are added onto the 5′ of SrPB. The amplicons obtained from adapter replacement contain P5, P7, and dual index, and thus are ready for sequencing (FIG. 10A). Second, the index and P7 sequences are added onto the 5′ of SrPB, and this modified SrPB is mixed with the normal P5 index primer in the adapter replacement step (FIG. 10B). Third, the index and P5 sequences are added onto the 5′ of SfP; a primer bearing the P5 sequence is used as UfP in the universal PCR step. The other index and P7 sequences are added onto the 5′ of SrPB (FIG. 10C).


An alternative QASeq primer design and workflow is shown in FIG. 11. Each primer set comprises three different oligos: a Specific Forward Primer (SW), a Specific Reverse Primer A (SrPA), and a Specific Reverse Primer B (SrPB). SW comprises, from 5′ to 3′, regions 1, 2, 3, and 4. Region 4 is the template-binding region; region 3 is the UMI; region 1 is the full or partial NGS adapter; region 2 is an optional spacer region (0-15 nt) added for uniform amplification of UMIs. SrPA comprises region 5, which is the template-binding region. SrPB comprises, from 5′ to 3′, regions 6, 7, and 8. Region 8 is the template-binding region, the 3′-end of which is closer to region 4 than region 5 by at least 1 base; region 6 is the full or partial NGS adapter; region 7 is an optional spacer region (0-15 nt) added for uniform amplification of different loci. Each QASeq panel only needs one Universal Forward Primer (UfP), which comprises region 1; there can be additional bases at 5′-end of region 1 in UfP. The melting temperature (Tm) of template-binding regions 4, 5, and 8 are about the same as the PCR annealing temperature, and the Tm of UfP is not lower than regions 4, 5, and 8 in the experimental PCR conditions. Compared to the original design, SrPA only needs the template-binding region, and Universal Reverse Primer (UrP) is not necessary. In the experimental workflow, more cycles of PCR (e.g., at least 10 cycles) are needed in the universal PCR step under this alternative primer design.


VII. DATA ANALYSIS WORKFLOW

A schematic of the data analysis workflow for CNV detection is shown in FIG. 4A. First, raw NGS reads are aligned to the amplicon regions; an optional adapter trimming can be performed before alignment. Unaligned reads are discarded, and the aligned reads are grouped by the loci they aligned to.


Then, all the reads aligned to the same locus are further divided by the UMI sequences, i.e., reads carrying the same UMI are grouped as one UMI family UMI family size is the number of reads carrying the same UMI, and unique UMI number is the total count of different UMI sequences at one locus (FIG. 4B). Next, all unique UMI families that are likely results of PCR or NGS errors are removed. For example, a UMI sequence that is not consistent with designed UMI pattern (e.g., G bases found in a poly(H) UMI sequence) is an error and should be removed. Additionally, if two UMI sequences only differ by 1-2 bases, the one with a smaller UMI family size is likely mutated from the other, and thus can be optionally removed. After removal of UMI errors, the UMI families with family sizes <Fmin are also removed. Fmin is determined based on the distribution of UMI family size, and Fmin=4 may be used most cases. The unique UMI number (N) after UMI removal is used for the next step.


FEC of a target gene may be calculated as:








F





E





C

=


k






i
=
1

u



N

Tar
,
i







j
=
1

w






i
=
1

v



N

Ref
,
i
,
j






-
1


,




where Σi=1u NTar,i is the sum of unique UMI number for all or part of the target gene loci, u is the number of loci to consider, u is no more than the total number of loci in the target gene; Σj=1w Σi=1v NRef,i,j is the sum of unique UMI number for all or part of Reference loci, v is the number of loci to consider for one reference, v is no more than the total number of loci in the reference; w is the number of reference to consider, w is no more than the total number of reference; and k is determined by experimental calibration. Before testing the QASeq panel on a clinical sample, calibration experiments were performed on DNA samples with well-characterized CNV status of the target gene. gDNA extracted from normal and cancer cell lines with CNV status characterized by ddPCR can be used for calibration. The FEC of normal calibration samples should be 0. The LoD of the assay is also determined by the calibration experiments; LoD is the smallest frequency of extra copies detectable by the assay. When testing a clinical sample, the FEC for a gene of interest will be used to infer the CNV status; if FEC>LoD, the sample is inferred to contain amplification of the target gene; if FEC≤LoD, the sample is inferred to contain deletion of the target gene.


VIII. ALLELE RATIO QUANTITATION

QASeq can be applied to quantifying the allele ratio of different genetic identities for 1-10,000 genomic loci using multiplexed PCR. The multiplexed PCR panel design for targeted genomic loci, and the experimental workflow for labeling each strand of targeted genomic loci with an oligonucleotide barcode sequence by PCR, followed by amplification of the genomic regions for high-throughput sequencing are similar to CNV detection.


A schematic of data analysis workflow for allele ratio quantitation is shown in FIG. 12A. First, raw NGS reads are aligned to the amplicon regions; an optional adapter trimming can be performed before alignment. Unaligned reads are discarded, and the aligned reads are grouped by the loci they aligned to. At each locus, the NGS reads are divided by the UMI sequence; all NGS reads carrying the same UMI sequence are grouped as one UMI family. The unique UMI families with errors in UMI, which are likely results of PCR or NGS errors are removed, as described in Data Analysis Workflow section.


The genetic identity (wild type or mutation) for each remaining UMI family is called based on majority vote; the genetic identity needs to be supported by at least 70% of the members (reads) in the same UMI family. As an example in FIG. 12B, for a UMI family with UMI family size=7, all the 7 reads share the same UMI sequence (displayed as a 2D barcode). The genetic identity at the locus of interest is ‘A’ for 6 reads and ‘G’ for 1 read. Because more than 70% of the reads in the UMI family support ‘A’, the genetic identity for this UMI family is called as ‘A’. The 1 read corresponds to ‘G’ is a result of PCR or NGS error. UMI families without more than 70% reads supporting one consensus genetic identity are discarded.


Next, the unique UMI number N (the total count of different UMI sequences at one locus) is counted for each different genetic identity at the targeted locus; N indicates the number of original strands. Allele ratio of a target locus is calculated as Rallele=N1/N2, where N1 is unique UMI number for the first genetic identity, and N2 is unique UMI number for the second genetic identity.


IX. DEFINITIONS

“Amplification,” as used herein, refers to any in vitro process for increasing the number of copies of a nucleotide sequence or sequences. Nucleic acid amplification results in the incorporation of nucleotides into DNA or RNA. As used herein, one amplification reaction may consist of many rounds of DNA replication. For example, one PCR reaction may consist of 30-100 “cycles” of denaturation and replication.


“Polymerase chain reaction,” or “PCR,” means a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA. In other words, PCR is a reaction for making multiple copies or replicates of a target nucleic acid flanked by primer binding sites, such reaction comprising one or more repetitions of the following steps: (i) denaturing the target nucleic acid, (ii) annealing primers to the primer binding sites, and (iii) extending the primers by a nucleic acid polymerase in the presence of nucleoside triphosphates. Usually, the reaction is cycled through different temperatures optimized for each step in a thermal cycler instrument. Particular temperatures, durations at each step, and rates of change between steps depend on many factors well-known to those of ordinary skill in the art, e.g., exemplified by the references: McPherson et al., editors, PCR: A Practical Approach and PCR2: A Practical Approach (IRL Press, Oxford, 1991 and 1995, respectively).


“Primer” means an oligonucleotide, either natural or synthetic that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3′ end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Usually primers are extended by a DNA polymerase. Primers are generally of a length compatible with its use in synthesis of primer extension products, and are usually are in the range of between 8 to 100 nucleotides in length, such as 10 to 75, 15 to 60, 15 to 40, 18 to 30, 20 to 40, 21 to 50, 22 to 45, 25 to 40, and so on, more typically in the range of between 18-40, 20-35, 21-30 nucleotides long, and any length between the stated ranges. Typical primers can be in the range of between 10-50 nucleotides long, such as 15-45, 18-40, 20-30, 21-25 and so on, and any length between the stated ranges. In some embodiments, the primers are usually not more than about 10, 12, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, or 70 nucleotides in length.


“Incorporating,” as used herein, means becoming part of a nucleic acid polymer.


The term “in the absence of exogenous manipulation” as used herein refers to there being modification of a nucleic acid molecule without changing the solution in which the nucleic acid molecule is being modified. In specific embodiments, it occurs in the absence of the hand of man or in the absence of a machine that changes solution conditions, which may also be referred to as buffer conditions. However, changes in temperature may occur during the modification.


A “nucleoside” is a base-sugar combination, i.e., a nucleotide lacking a phosphate. It is recognized in the art that there is a certain inter-changeability in usage of the terms nucleoside and nucleotide. For example, the nucleotide deoxyuridine triphosphate, dUTP, is a deoxyribonucleoside triphosphate. After incorporation into DNA, it serves as a DNA monomer, formally being deoxyuridylate, i.e., dUMP or deoxyuridine monophosphate. One may say that one incorporates dUTP into DNA even though there is no dUTP moiety in the resultant DNA. Similarly, one may say that one incorporates deoxyuridine into DNA even though that is only a part of the substrate molecule.


“Nucleotide,” as used herein, is a term of art that refers to a base-sugar-phosphate combination. Nucleotides are the monomeric units of nucleic acid polymers, i.e., of DNA and RNA. The term includes ribonucleotide triphosphates, such as rATP, rCTP, rGTP, or rUTP, and deoxyribonucleotide triphosphates, such as dATP, dCTP, dUTP, dGTP, or dTTP.


The term “nucleic acid” or “polynucleotide” will generally refer to at least one molecule or strand of DNA, RNA, DNA-RNA chimera or a derivative or analog thereof, comprising at least one nucleobase, such as, for example, a naturally occurring purine or pyrimidine base found in DNA (e.g., adenine “A,” guanine “G,” thymine “T” and cytosine “C”) or RNA (e.g. A, G, uracil “U” and C). The term “nucleic acid” encompasses the terms “oligonucleotide” and “polynucleotide.” “Oligonucleotide,” as used herein, refers collectively and interchangeably to two terms of art, “oligonucleotide” and “polynucleotide.” Note that although oligonucleotide and polynucleotide are distinct terms of art, there is no exact dividing line between them and they are used interchangeably herein. The term “adaptor” may also be used interchangeably with the terms “oligonucleotide” and “polynucleotide.” In addition, the term “adaptor” can indicate a linear adaptor (either single stranded or double stranded) or a stem-loop adaptor. These definitions generally refer to at least one single-stranded molecule, but in specific embodiments will also encompass at least one additional strand that is partially, substantially, or fully complementary to at least one single-stranded molecule. Thus, a nucleic acid may encompass at least one double-stranded molecule or at least one triple-stranded molecule that comprises one or more complementary strand(s) or “complement(s)” of a particular sequence comprising a strand of the molecule. As used herein, a single stranded nucleic acid may be denoted by the prefix “ss,” a double-stranded nucleic acid by the prefix “ds,” and a triple stranded nucleic acid by the prefix “ts.”


A “nucleic acid molecule” or “nucleic acid target molecule” refers to any single-stranded or double-stranded nucleic acid molecule including standard canonical bases, hypermodified bases, non-natural bases, or any combination of the bases thereof. For example and without limitation, the nucleic acid molecule contains the four canonical DNA bases—adenine, cytosine, guanine, and thymine, and/or the four canonical RNA bases—adenine, cytosine, guanine, and uracil. Uracil can be substituted for thymine when the nucleoside contains a 2′-deoxyribose group. The nucleic acid molecule can be transformed from RNA into DNA and from DNA into RNA. For example, and without limitation, mRNA can be created into complementary DNA (cDNA) using reverse transcriptase and DNA can be created into RNA using RNA polymerase. A nucleic acid molecule can be of biological or synthetic origin. Examples of nucleic acid molecules include genomic DNA, cDNA, RNA, a DNA/RNA hybrid, amplified DNA, a pre-existing nucleic acid library, etc. A nucleic acid may be obtained from a human sample, such as blood, serum, plasma, cerebrospinal fluid, cheek scrapings, biopsy, semen, urine, feces, saliva, sweat, etc. A nucleic acid molecule may be subjected to various treatments, such as repair treatments and fragmenting treatments. Fragmenting treatments include mechanical, sonic, and hydrodynamic shearing. Repair treatments include nick repair via extension and/or ligation, polishing to create blunt ends, removal of damaged bases, such as deaminated, derivatized, abasic, or crosslinked nucleotides, etc. A nucleic acid molecule of interest may also be subjected to chemical modification (e.g., bisulfite conversion, methylation/demethylation), extension, amplification (e.g., PCR, isothermal, etc.), etc.


Nucleic acid(s) that are “complementary” or “complement(s)” are those that are capable of base-pairing according to the standard Watson-Crick, Hoogsteen or reverse Hoogsteen binding complementarity rules. As used herein, the term “complementary” or “complement(s)” may refer to nucleic acid(s) that are substantially complementary, as may be assessed by the same nucleotide comparison set forth above. The term “substantially complementary” may refer to a nucleic acid comprising at least one sequence of consecutive nucleobases, or semiconsecutive nucleobases if one or more nucleobase moieties are not present in the molecule, are capable of hybridizing to at least one nucleic acid strand or duplex even if less than all nucleobases do not base pair with a counterpart nucleobase. In certain embodiments, a “substantially complementary” nucleic acid contains at least one sequence in which about 70%, about 71%, about 72%, about 73%, about 74%, about 75%, about 76%, about 77%, about 77%, about 78%, about 79%, about 80%, about 81%, about 82%, about 83%, about 84%, about 85%, about 86%, about 87%, about 88%, about 89%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, to about 100%, and any range therein, of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization. In certain embodiments, the term “substantially complementary” refers to at least one nucleic acid that may hybridize to at least one nucleic acid strand or duplex in stringent conditions. In certain embodiments, a “partially complementary” nucleic acid comprises at least one sequence that may hybridize in low stringency conditions to at least one single or double-stranded nucleic acid, or contains at least one sequence in which less than about 70% of the nucleobase sequence is capable of base-pairing with at least one single or double-stranded nucleic acid molecule during hybridization.


The term “non-complementary” refers to nucleic acid sequence that lacks the ability to form at least one Watson-Crick base pair through specific hydrogen bonds.


The term “degenerate” as used herein refers to a nucleotide or series of nucleotides wherein the identity can be selected from a variety of choices of nucleotides, as opposed to a defined sequence. In specific embodiments, there can be a choice from two or more different nucleotides. In further specific embodiments, the selection of a nucleotide at one particular position comprises selection from only purines, only pyrimidines, or from non-pairing purines and pyrimidines.


“Sample” means a material obtained or isolated from a fresh or preserved biological sample or synthetically-created source that contains nucleic acids of interest. Samples can include at least one cell, fetal cell, cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, or reaction mixture suspected of containing immune nucleic acids of interest. Samples can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.


As used herein in relation to a nucleotide sequence, “substantially known” refers to having sufficient sequence information in order to permit preparation of a nucleic acid molecule, including its amplification. This will typically be about 100%, although in some embodiments some portion of an adaptor sequence is random or degenerate. Thus, in specific embodiments, substantially known refers to about 50% to about 100%, about 60% to about 100%, about 70% to about 100%, about 80% to about 100%, about 90% to about 100%, about 95% to about 100%, about 97% to about 100%, about 98% to about 100%, or about 99% to about 100%.


X. FURTHER PROCESSING OF TARGET NUCLEIC ACIDS

A. Amplification of DNA


A number of template-dependent processes are available to amplify the nucleic acids present in a given template sample. One of the best known amplification methods is the polymerase chain reaction (referred to as PCR™) which is described in detail in U.S. Pat. Nos. 4,683,195, 4,683,202, and 4,800,159 and in Innis et al., 1990, each of which is incorporated herein by reference in their entirety. Briefly, two synthetic oligonucleotide primers, which are complementary to two regions of the template DNA (one for each strand) to be amplified, are added to the template DNA (that need not be pure), in the presence of excess deoxynucleotides (dNTP's) and a thermostable polymerase, such as, for example, Taq (Thermus aquaticus) DNA polymerase. In a series (typically 30-35) of temperature cycles, the target DNA is repeatedly denatured (around 90° C.), annealed to the primers (typically at 50-60° C.) and a daughter strand extended from the primers (72° C.). As the daughter strands are created they act as templates in subsequent cycles. Thus, the template region between the two primers is amplified exponentially, rather than linearly.


B. Sequencing of DNA


Methods are also provided for the sequencing of the library of adaptor-linked fragments. Any technique for sequencing nucleic acids known to those skilled in the art can be used in the methods of the present disclosure. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing-by-synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing-by-synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, and SOLiD sequencing.


The nucleic acid library may be generated with an approach compatible with Illumina sequencing such as a Nextera™ DNA sample prep kit, and additional approaches for generating Illumina next-generation sequencing library preparation are described, e.g., in Oyola et al. (2012). In other embodiments, a nucleic acid library is generated with a method compatible with a SOLiD™ or Ion Torrent sequencing method (e.g., a SOLiD® Fragment Library Construction Kit, a SOLiD® Mate-Paired Library Construction Kit, SOLiD® ChIP-Seq Kit, a SOLiD® Total RNA-Seq Kit, a SOLiD® SAGE™ Kit, a Ambion® RNA-Seq Library Construction Kit, etc.). Additional methods for next-generation sequencing methods, including various methods for library construction that may be used with embodiments of the present invention are described, e.g., in Pareek (2011) and Thudi (2012).


In particular aspects, the sequencing technologies used in the methods of the present disclosure include the HiSeg™ system (e.g., HiSeg™ 2000 and HiSeg™ 1000), the NextSeg™ 500, and the MiSeg™ system from Illumina, Inc. The HiSeg™ system is based on massively parallel sequencing of millions of fragments using attachment of randomly fragmented genomic DNA to a planar, optically transparent surface and solid phase amplification to create a high density sequencing flow cell with millions of clusters, each containing about 1,000 copies of template per sq. cm. These templates are sequenced using four-color DNA sequencing-by-synthesis technology. The MiSeg™ system uses TruSeq™, Illumina's reversible terminator-based sequencing-by-synthesis.


Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is 454 sequencing (Roche) (Margulies et al., 2005). 454 sequencing involves two steps. In the first step, DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended. Oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors serve as primers for amplification and sequencing of the fragments. The fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5′-biotin tag. The fragments attached to the beads are PCR amplified within droplets of an oil-water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead. In the second step, the beads are captured in wells (pico-liter sized). Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated.


Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is SOLiD technology (Life Technologies, Inc.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to the 5′ and 3′ ends of the fragments to generate a fragment library. Alternatively, internal adaptors can be introduced by ligating adaptors to the 5′ and 3′ ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5′ and 3′ ends of the resulting fragments to generate a mate-paired library. Next, clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and beads are enriched to separate the beads with extended templates. Templates on the selected beads are subjected to a 3′ modification that permits bonding to a glass slide.


Another example of a DNA sequencing technique that can be used in the methods of the present disclosure is the IonTorrent system (Life Technologies, Inc.). Ion Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way. Each well holds a different DNA template. Beneath the wells is an ion-sensitive layer and beneath that a proprietary Ion sensor. If a nucleotide, for example a C, is added to a DNA template and is then incorporated into a strand of DNA, a hydrogen ion will be released. The charge from that ion will change the pH of the solution, which can be detected by the proprietary ion sensor. The sequencer will call the base, going directly from chemical information to digital information. The Ion Personal Genome Machine (PGM™) sequencer then sequentially floods the chip with one nucleotide after another. If the next nucleotide that floods the chip is not a match, no voltage change will be recorded and no base will be called. If there are two identical bases on the DNA strand, the voltage will be double, and the chip will record two identical bases called. Because this is direct detection—no scanning, no cameras, no light—each nucleotide incorporation is recorded in seconds.


Another example of a sequencing technology that can be used in the methods of the present disclosure includes the single molecule, real-time (SMRT™) technology of Pacific Biosciences. In SMRT™, each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked. A single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW). A ZMW is a confinement structure which enables observation of incorporation of a single nucleotide by DNA polymerase against the background of fluorescent nucleotides that rapidly diffuse in and out of the ZMW (in microseconds). It takes several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.


A further sequencing platform includes the CGA Platform (Complete Genomics). The CGA technology is based on preparation of circular DNA libraries and rolling circle amplification (RCA) to generate DNA nanoballs that are arrayed on a solid support (Drmanac et al. 2009). Complete genomics' CGA Platform uses a novel strategy called combinatorial probe anchor ligation (cPAL) for sequencing. The process begins by hybridization between an anchor molecule and one of the unique adapters. Four degenerate 9-mer oligonucleotides are labeled with specific fluorophores that correspond to a specific nucleotide (A, C, G, or T) in the first position of the probe. Sequence determination occurs in a reaction where the correct matching probe is hybridized to a template and ligated to the anchor using T4 DNA ligase. After imaging of the ligated products, the ligated anchor-probe molecules are denatured. The process of hybridization, ligation, imaging, and denaturing is repeated five times using new sets of fluorescently labeled 9-mer probes that contain known bases at the n+1, n+2, n+3, and n+4 positions.


XI. KITS

The technology herein includes kits for analyzing copy number variation or allele frequencies in a DNA sample. A “kit” refers to a combination of physical elements. For example, a kit may include, for example, one or more components such as nucleic acid primers, enzymes, reaction buffers, an instruction sheet, and other elements useful to practice the technology described herein. These physical elements can be arranged in any way suitable for carrying out the invention.


The components of the kits may be packaged either in aqueous media or in lyophilized form. The container means of the kits will generally include at least one vial, test tube, flask, bottle, syringe or other container means, into which a component may be placed, and preferably, suitably aliquoted (e.g., aliquoted into the wells of a microtiter plate). Where there is more than one component in the kit, the kit also will generally contain a second, third or other additional container into which the additional components may be separately placed. However, various combinations of components may be comprised in a single vial. The kits of the present invention also will typically include a means for containing the nucleic acids, and any other reagent containers in close confinement for commercial sale. Such containers may include injection or blow molded plastic containers into which the desired vials are retained. A kit will also include instructions for employing the kit components as well the use of any other reagent not included in the kit. Instructions may include variations that can be implemented.


XII. EXAMPLES

The following examples are included to demonstrate preferred embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function well in the practice of the invention, and thus can be considered to constitute preferred modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention.


Example 1—Calibration Results

An exemplary calibration experiment of the ERBB2 QASeq panel was performed on a normal cell line gDNA sample NA18562, which should not contain ERBB2 amplifications, to analyze the quantitation variability and potential LoD. The workflow was as described in the “QASeq Workflow” section. Taq polymerase was used in all the PCR steps. Denaturation was performed at 95° C., and annealing/extension was performed at 60° C. (except for the universal PCR step, in which annealing/extension was performed at 68° C.). Because all original molecules with UMIs attached need to be present in the NGS output, 15 reads were reserved for each molecule/UMI. For an input of 2500 haploid genomic copies and a 20-amplicon panel, the total reads needed is about 2×2500×20×15=1,500,000. Note that each of the strands in one DNA duplex carries a different UMI in this workflow, so 2500 haploid genomic copies=5000 molecule number=8.3 ng gDNA. This experiment was performed on an Illumina MiSeq instrument.


Exact string match was used to align NGS reads to the amplicon sequences; the alignment rate was between 50% and 70% for different libraries. Next, the UMI family sizes and unique UMI numbers were analyzed. The distribution of UMI family size peaked at ≈20 for most loci (FIG. 5). UMI families containing obvious PCR errors (i.e., G bases found in the poly(H) UMI sequence) and UMIs with family size <4 were removed (FIG. 5). If the UMI attachment rate were perfect, the unique UMI number should be equal to the original molecule number in the sample. For an input of 2500 haploid genomic copies (5000 molecules), between 632 and 3065 unique UMI number were obtained depending on the loci (FIG. 6).


In order to estimate the LoD of this assay, libraries were prepared for four different DNA inputs: 75, 250, 750, and 2500 haploid genomic copies; each condition was replicated five times. The CNV ratio of the sample was calculated as described in the “Data Analysis Workflow” section. The standard deviation of CNV ratio (σCNV ratio) across five replicates was used to evaluate quantitation variability; the LoD of the assay can be estimated as 3σCNV ratio. Simulations were also performed to calculate the theoretical σCNV ratio; note that the σCNV ratio and LoD should decrease if the input molecule number increases. The σCNV ratio was higher than the theoretical value (FIG. 7), which was as expected because the UMI attachment bias and amplification bias cannot be eliminated. The current best 6CNV ratio is 1% at 2500 haploid genomic copies input; to be conservative, a linear approximation based on all 4 data points was used, and a σCNV ratio=2% was obtained; therefore, the estimated LoD was about 6% of extra copies. Based on an extrapolation to 50,000 haploid genomic copies input, the potential σCNV ratio was 0.3%, and the LoD was about 1%. Another way to evaluate LoD is by testing a series of calibration samples containing different frequencies of extra copies; the lowest detectable frequency of extra copies is the LoD.


Example 2—CNV Detection Results in FFPE Samples

Two FFPE slides were analyzed using the example ERBB2 panel described in the “Multiplexed PCR Panel Design” section and Example 1. The FFPE slides (purchased from Asterand) were from the same lung cancer tumor, which is not expected to contain ERBB2 CNV. First, DNA was extracted using a QIAamp DNA FFPE Tissue Kit (Qiagen), and >6 μg of DNA per sample was obtained. The libraries were prepared using the same methods as described in Example 1. 8.3 ng extracted DNA was used for each library, which is equivalent to 2500 haploid genomic copies and 5000-molecule input. The number of NGS reads reserved for each library (1,500,000 reads) was the same as 2500 haploid genomic copies input cell line gDNA libraries.


Data analysis was performed using the same methods as described in Example 1. A similar pattern of UMI family size distribution to the cell line gDNA libraries was obtained (FIG. 8A). The unique UMI numbers were smaller than cell line gDNA libraries with 2500 haploid genomic copies input. The UMI attachment yield of FFPE samples was about ¼ of that of cell line gDNA on average, which indicates that 300% more FFPE DNA needs to be loaded to achieve the same LoD as the cell line gDNA sample (FIG. 8B).


The calculated CNV ratios of the FFPE samples are shown in FIG. 8C. The inferred LoD=15% of this assay was based on calibration results on 750 haploid genomic copies input cell line gDNA, which have similar unique UMI numbers to the FFPE libraries. Based on current results, CNV of ERBB2 was not detected in these FFPE slides. Because LoD decreases as the input molecule number increases, an LoD of 6% can be achieved, based on the calibration results on 2500 haploid genomic copies input cell line gDNA.


Example 3—CNV Quantitation Results in Spike-In Clinical FFPE Samples

A 100-plex QASeq panel was used to quantitate the ploidy of ERBB2 in breast cancer FFPE samples. 50-plex were in the ERBB2 gene region (see Table 3 for primer sequences; primer names have “ERBB2” in them), and 50-plex were in the short arm of Chromosome 17 as the Reference (see Table 3 for primer sequences; primer names have “Ref” in them).


Two previously characterized FFPE DNA samples (1 “normal” sample and 1 “ERBB2 amplified abnormal” sample) were mixed to generate 2.5%, 5%, and 10% ERBB2 FEC samples. The “normal” sample DNA was extracted from a FFPE lung cancer sample (purchased from Asterand), which should not have ERBB2 amplification (FEC=0%); the “ERBB2 amplified abnormal” sample DNA was extracted from a FFPE breast cancer sample (purchased from OriGene), which has a ERBB2 FEC of 78%. The sample input was 8.3 ng DNA per library (quantitated by qPCR). The “normal” sample was tested with 5 replicated NGS libraries prepared separately, each with 8.3 ng DNA input. The experimental normalized FEC values are shown in FIG. 13. Normalized FEC was calculated as:





Normalized FECsample=(1+FECsample)/(1+FECnormal sample)−1


The FECnormal sample was the average of 5 replicates. The LoD of the CNV panel was estimated as:





FECLoD=3×σnormal sample/(1+FECnormal sample)=0.85%


Here, the σnormal sample was the standard deviation of 5 replicates. CNV was successfully detected in 2.5%, 5%, and 10% ERBB2 FEC samples, because their calculated FEC are outside the 3 standard deviation range (see FIG. 13). The experimental normalized FEC of ERBB2 correlates well with the expected value.


Example 4—Comprehensive Panel for Both Mutation and CNV Quantitation

The method presented (QASeq) can not only be used for CNV quantitation, but also for NGS error correction and mutation quantitation. In each QASeq amplicon, the region between the 3′ of fP and the 3′ of rPin is the Mutation Detection Region (MDR); any small variations (including base substitutions, deletions, and insertions smaller than 500 bp) in the MDR can be detected with an LoD of 0.1%-0.3%. This is much better than standard non-UMI NGS methods for mutation detection, which has an LoD 1%.


A 179-plex comprehensive panel was developed and tested for both mutation and CNV quantitation in breast cancer samples. Every plex contains 3 primers: fP (a.k.a. SW), rPin (a.k.a SrPB), and rPout (a.k.a. SrPA) as stated in previous sections. 95 primer sets were used solely for CNV quantitation, including 45 in gene ERBB2, and 50 in the short arm of Chromosome 17 as the reference. 5 primer sets in the ERBB2 gene were used for both CNV and mutation quantitation. Another 79 primer sets were used for mutation quantitation only. UfP and UrP were used for universal amplification (see Table 3 for sequences).


CNV quantitation was done the same way as described in previous sections; data processing workflow for mutation quantitation is summarized in FIG. 14. After optional adapter trimming, NGS reads were aligned to the amplicon sequences. At each locus, reads were divided into UMI families; the UMI families with errors in the UMI sequence were removed, and the UMI families with small UMI family sizes (≤3) were also removed. Next, the consensus MDR sequence of each UMI family was found, which is usually the MDR sequence appearing the highest number of times in the UMI family. The last step was comparing the consensus sequence to the wildtype MDR sequence, and performing de novo mutation calling. The VAF of one mutation can be calculated as: VAF=Number of UMI families with the mutation/Total number of UMI families.


This 179-plex panel was tested on the Multiplex I cfDNA Reference Standard Set from Horizon Discovery. Three replicated NGS libraries of the Wild Type cfDNA Reference Standard and three replicates of the 0.3% cfDNA Reference Standard (created by mixing 0.1% cfDNA Reference Standard and 1% cfDNA Reference Standard) were tested. The sample input was 8.3 ng DNA per library (quantitated by qPCR).


The overall on-target rate was greater than 50% (i.e. >50% of the NGS reads can be aligned to the amplicons) for all the libraries; the conversion rate (i.e., the percentage of input molecules sequenced) has an average of 62%, and 97% of the plexes have >10% conversion rate (see FIG. 15). Error rates after UMI correction vary in different nucleotide positions; in the three replicated libraries of Horizon Discovery Multiplex I Wild Type cfDNA Reference Standard, highest error rates were 0.23%, 0.20%, and 0.23%, and average error rates were 0.006%, 0.005%, and 0.005% (see FIG. 16). The mutation quantitation capability was validated using the 0.3% cfDNA Reference Standard. The experimental VAF of 6 mutations were generally consistent with the expected VAF; the difference was mostly due to stochasticity in sampling a small number (≤9) of mutation molecules (see FIG. 17).


All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.


REFERENCES

The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.

  • Lun et al., “Noninvasive prenatal diagnosis of monogenic diseases by digital size selection and relative mutation dosage on DNA in maternal plasma,” Proc. Natl. Acad. Sci. U.S.A., 105:19920-19925, 2008.

Claims
  • 1. A method for preparing targeted regions of genomic DNA for high-throughput sequencing, the method comprising: (a) obtaining a genomic DNA sample;(b) amplifying at least a portion of the genomic DNA sample by performing two cycles of PCR using: (i) a first oligonucleotide comprising, from 5′ to 3′, a first region, a second region having a length between 0 and 50 nucleotides, a third region comprising at least four degenerate nucleotides, and a fourth region comprising a sequence that is complementary to a first target genomic DNA region; and(ii) a second oligonucleotide comprising, from 5′ to 3′, a fifth region, a sixth region having a length between 0 and 50 nucleotides, and a seventh region comprising a sequence that is complementary to a second target genomic DNA region;(c) amplifying a product of step (b) by performing at least three cycles of PCR with an annealing temperature that is 0-10° C. higher than an annealing temperature used in step (b) and using: (i) a third oligonucleotide comprising a sequence that is able to hybridize to the reverse complement of at least a portion of the first region; and(ii) a fourth oligonucleotide comprising a sequence that is able to hybridize to the reverse complement of at least a portion of the fifth region; and(d) amplifying a product of step (c) by performing at least one cycle of PCR using a fifth oligonucleotide comprising, from 5′ to 3′, an eighth region, a ninth region having a length between 0 and 50 nucleotides, and a tenth region comprising a sequence that is complementary to a third target genomic DNA region, wherein the third target genomic DNA region is at least one nucleotide closer to the first target genomic DNA region than the second target genomic DNA region.
  • 2. The method of claim 1, wherein the method is a method for preparing between 1 and 10,000 targeted regions of genomic DNA for high-throughput sequencing.
  • 3. The method of claim 1 or 2, wherein the third region is a unique molecular identifier (UMI).
  • 4. The method of any one of claims 1-3, wherein the third target genomic DNA region is 1-10 bases closer to the first target genomic DNA region than the second target genomic DNA region.
  • 5. The method of any one of claims 1-4, wherein the first region and the eighth region are universal primer binding sites.
  • 6. The method of any one of claims 1-5, wherein the first region and the eighth region comprise a full or partial NGS adapter sequence.
  • 7. The method of any one of claims 1-6, wherein the fifth region comprises a sequence that cannot be found in the human genome.
  • 8. The method of any one of claims 1-7, wherein the fifth region comprises a sequence that is different from an NGS adapter sequence.
  • 9. The method of any one of claims 1-8, wherein the melting temperatures of the first region and the fifth region are 0-10° C. higher than the melting temperatures of the fourth region and the seventh region.
  • 10. The method of any one of claims 1-9, wherein the degenerate nucleotides in the third region each independently are one of A, T, or C.
  • 11. The method of any one of claims 1-10, wherein none of the degenerate nucleotides in the third region are G.
  • 12. The method of any one of claims 1-11, wherein there is a population of first oligonucleotides each having a unique third region.
  • 13. The method of any one of claims 1-12, further comprising purifying the product of step (c).
  • 14. The method of claim 13, wherein purifying comprises SPRI purification or column purification.
  • 15. The method of any one of claims 1-14, further comprising purifying the product of step (d).
  • 16. The method of claim 15, wherein purifying comprises SPRI purification or column purification.
  • 17. The method of any one of claims 1-16, further comprising: (e) amplifying the product of step (d) by PCR using primers that hybridize to the first region and the eighth region, wherein the primers comprise an index sequence for next-generation sequencing.
  • 18. The method of claim 17, further comprising purifying the product of step (e).
  • 19. The method of claim 18, wherein purifying comprises SPRI purification or column purification.
  • 20. The method of any one of claims 17-19, further comprising: (f) performing high-throughput DNA sequencing of the produce of step (e).
  • 21. The method of claim 20, wherein high-throughput DNA sequencing comprises next-generation sequencing.
  • 22. The method of any one of claims 1-21, wherein the first target genomic DNA region and the second target genomic DNA region are on opposite strands of the genomic DNA.
  • 23. The method of any one of claims 1-22, wherein the first target genomic DNA region and the second target genomic DNA region are separated by between 40 nucleotides and 500 nucleotides.
  • 24. The method of any one of claims 1-23, wherein step (b) comprises an extension time of about 30 minutes.
  • 25. The method of any one of claims 1-24, wherein step (c) comprises an extension time of about 30 seconds.
  • 26. The method of any one of claims 1-25, wherein step (d) comprises an extension time of about 30 minutes.
  • 27. A method for quantifying the frequency of extra copies (FEC) of at least one target gene, the method comprising: (a) obtaining a genomic DNA sample;(b) preparing the genomic DNA for high-throughput sequencing according to a method of any one of claims 1-26, wherein the sequences of the fourth region, the seventh region, and the tenth region hybridize to the at least one target gene;(c) performing high-throughput sequencing according to a method of claim 20; and(d) calculating the FEC for the at least one target gene based on the sequencing information obtained in step (c).
  • 28. The method of claim 27, wherein the method is a method for quantifying the FEC for a set of target genes, wherein the set of target genes comprises between 2 and 1000 target genes.
  • 29. The method of claim 27 or 28, wherein step (b) is performed using a population of first oligonucleotides, a population of second oligonucleotides, and a population of fifth oligonucleotides, wherein a portion of each of the populations of first, second, and fifth oligonucleotides comprise fourth, seventh, and tenth regions, respectively, that are complementary to one of the set of target genes.
  • 30. The method of any one of claims 27-29, wherein each of the fourth, seventh, and tenth regions comprises sequences that are only found once in the human genome.
  • 31. The method of any one of claims 27-30, wherein each first oligonucleotide that hybridizes to one target gene has a unique third region compared to each other first oligonucleotide that hybridizes to the same target gene.
  • 32. The method of any one of claims 27-31, wherein step (b) is performed using a first oligonucleotide, a second oligonucleotide, and a fifth oligonucleotide comprising fourth, seventh, and tenth regions, respectively, that are complementary to a reference gene.
  • 33. The method of any one of claims 27-32, wherein step (b) prepares a portion of each target gene or reference gene for high-throughput sequencing, wherein the portion is between 40 nucleotides and 500 nucleotides long.
  • 34. The method of any one of claims 27-33, wherein FEC is defined as:
  • 35. The method of any one of claims 27-34, wherein step (d) comprises: (i) aligning NGS reads to the targeted portions of each target gene and grouping the NGS reads into subgroups based on the loci to which they align;(ii) dividing the NGS read at each locus based on their UMI sequences such that all NGS reads carrying the same UMI sequence are grouped as one UMI family;(iii) removing UMI families resulting from PCR errors or NGS errors;(iv) counting the number of unique UMI sequences at each locus; and(v) calculating the FEC based on the number of unique UMI sequences for each locus in each target gene and reference gene.
  • 36. The method of claim 35, wherein step (d)(iii) comprises removing UMI sequences that do not meet the UMI degenerate base design.
  • 37. The method of claim 35 or 36, wherein step (d)(iii) comprises removing UMI families with a UMI family size less than Fmin, wherein the UMI family size is the number of reads carrying the same UMI, wherein Fmin is between 2 and 20.
  • 38. The method of any one of claims 35-37, wherein step (d)(iv) comprises removing UMI sequences that differ by only one or two bases from another UMI sequence with a larger family size.
  • 39. The method of any one of claims 27-38, wherein FEC is defined as:
  • 40. The method of any one of claims 27-39, wherein the FEC is used to identify the copy number variation (CNV) status of the target gene.
  • 41. A method for quantifying the allele ratio of different genetic identities for an at least one target genomic locus, the method comprising: (a) obtaining a genomic DNA sample;(b) preparing the genomic DNA for high-throughput sequencing according to a method of any one of claims 1-26, wherein the sequences of the fourth region, the seventh region, and the tenth region hybridize to the genomic DNA near the at least one target genomic locus;(c) performing high-throughput sequencing according to a method of claim 20; and(d) calculating the allele ratio of different genetic identities for the at least one target genomic locus on the sequencing information obtained in step (c).
  • 42. The method of claim 41, wherein the method is a method for quantifying the allele ratio of different genetic identities for a set of target genomic loci, wherein the set of target genomic loci comprises between 2 and 10,000 target genomic loci.
  • 43. The method of claim 41 or 42, wherein step (b) is performed using a population of first oligonucleotides, a population of second oligonucleotides, and a population of fifth oligonucleotides, wherein a portion of each of the populations of first, second, and fifth oligonucleotides comprise fourth, seventh, and tenth regions, respectively, that are complementary to the genomic DNA near the at least one of the set of target genomic loci.
  • 44. The method of any one of claims 41-43, wherein each of the fourth, seventh, and tenth regions comprises sequences that are not able to hybridize with non-target regions of the genomic DNA under the conditions of step (b).
  • 45. The method of any one of claims 41-44, wherein each first oligonucleotide that hybridizes to the genomic DNA near one target genomic locus has a unique third region compared to each other first oligonucleotide that hybridizes to the genomic DNA near the same target genomic locus.
  • 46. The method of any one of claims 41-45, wherein each target genomic locus is between 40 nucleotides and 500 nucleotides long.
  • 47. The method of any one of claims 41-46, wherein step (d) comprises: (i) aligning NGS reads to the targeted genomic loci and grouping the NGS reads into subgroups based on the loci to which they align;(ii) dividing the NGS read at each locus based on their UMI sequences such that all NGS reads carrying the same UMI sequence are grouped as one UMI family;(iii) removing UMI families resulting from PCR errors or NGS errors;(iv) calling the genetic identity for each remaining UMI family;(v) counting the number of unique UMI sequences at each locus; and(vi) calculating the allele ratio.
  • 48. The method of claim 47, wherein step (d)(iii) comprises removing UMI sequences that do not meet the UMI degenerate base design.
  • 49. The method of claim 47 or 48, wherein step (d)(iii) comprises removing UMI families with a UMI family size less than Fmin, wherein the UMI family size is the number of reads carrying the same UMI, wherein Fmin is between 2 and 20.
  • 50. The method of any one of claims 47-49, wherein step (d)(iii) comprises removing UMI sequences that differ by only one or two bases from another UMI sequence with a larger family size.
  • 51. The method of any one of claims 47-50, wherein step (d)(iv) comprises calling the genetic identity only if at least 70% of the reads in a UMI family are the same on the genetic locus of interest.
  • 52. The method of any one of claims 41-51, wherein the allele ratio is defined as Rallele=N1/N2, where N1 is unique UMI number for the first genetic identity, and N2 is unique UMI number for the second genetic identity.
  • 53. The method of any one of claims 47-51, wherein step (d)(iv) comprises identifying the consensus sequence of each UMI family.
  • 54. The method of claim 53, wherein the consensus sequence is the sequence appearing the highest number of times in the UMI family.
  • 55. The method of claim 53 or 54, further comprising comparing the consensus sequence to the wild-type sequence for that locus, thereby identifying mutations in the consensus sequence.
  • 56. The method of claim 55, further comprising calculating the variant allele frequency (VAF) of the identified mutation.
  • 57. The method of claim 56, wherein the VAF of the identified mutation is defined as Number of UMI families with the mutation/Total number of UMI families.
REFERENCE TO RELATED APPLICATIONS

The present application claims the priority benefit of U.S. provisional application No. 62/788,375, filed Jan. 4, 2019, the entire contents of which is incorporated herein by reference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with government support under Grant No. R01 HG008752 awarded by the National Institutes of Health. The government has certain rights in the invention.

PCT Information
Filing Document Filing Date Country Kind
PCT/US2020/012089 1/2/2020 WO 00
Provisional Applications (1)
Number Date Country
62788375 Jan 2019 US