This document provides methods and materials for identifying chromosomal anomalies that can be used in cancer diagnostics, non-invasive prenatal testing (NIPT), preimplantation genetic diagnosis, and evaluation of congenital abnormalities. For example, this document provides methods and materials for evaluating sequencing data to identify a mammal as having a disease associated with one or more chromosomal anomalies (e.g., cancer or congenital abnormality). Additionally or alternatively, this document provides methods and materials for evaluating sequencing data that can be used in cancer diagnostics, non-invasive prenatal testing (NIPT), preimplantation genetic diagnosis, and evaluation of congenital abnormalities.
Aneuploidy is defined as an abnormal chromosome number. It was the first genomic abnormality identified in cancers (Boveri 2008 Journal of cell science 121 (Supplement 1):1-84; and Nowell 1976 Science 194(4260):23-28), and it has been estimated to be present in >90% of cancers of most histopathologic types (Knouse et al. 2017 Annual Review of Cancer Biology 1:335-354). Aneuploidy in cancers was first detected by karyotypic studies, later evaluated through microarrays, Sanger sequencing, and most recently, massively parallel sequencing methods (Wang et al. 2002 Proceedings of the National Academy of Sciences 99(25):16156-16161). Recent sequencing methods include those employing circular binary segmentation, hidden Markov models, expectation maximization and mean-shift (as reviewed in (Zhao et al. 2013 BMC bioinformatics 14(11):S1)). In addition to their application to cancer genomes, these technologies form the basis for the non-invasive prenatal detection of fetuses with Downs' Syndrome and other trisomies (Bianchi et al. 2015 JAMA 314(2):162-169; Zhao et al. 2015 Clinical chemistry 61(4):608-616).
This disclosure relates to methods and materials for identifying one or more chromosomal anomalies (e.g., aneuploidy). In some embodiments, this disclosure provides methods and materials for using amplicon-based sequencing data to identify a mammal as having a disease or disorder associated with one or more chromosomal anomalies. For example, methods and materials described herein can be applied to a sample obtained from a mammal to identify the mammal as having one or more chromosomal anomalies. For example, a mammal can be identified as having a disease or disorder based, at least in part, on the presence of one or more aneuploidies. In some embodiments, a single primer pair is used to amplify genomic elements throughout the genome. For example, a single primer pair described herein can be used to amplify ˜1,000,000 unique repetitive elements (e.g., amplicons). In some embodiments, the amplified unique repetitive elements average less than 100 basepairs (bp) in size. In some embodiments, an approach (called WALDO for Within-Sample-AneupLoidy-DetectiOn) can be used to evaluate the sequencing data obtained from amplicons to identify the presence of one or more chromosomal anomalies (e.g., aneuploidy). As described herein, assessment of aneuploidy in 1,348 plasma samples from healthy people and 883 plasma samples from cancer patients detected aneuploidy in 49% of the plasma samples from cancer patients.
In one aspect, provided herein is a method of testing for the presence of aneuploidy in a genome of a mammal. The method comprises amplifying a plurality of chromosomal sequences in a DNA sample with a pair of primers complementary to the chromosomal sequences to form a plurality of amplicons; determining at least a portion of the nucleic acid sequence of one or more of the plurality of amplicons; mapping the sequenced amplicons to a reference genome; dividing the DNA sample into a plurality of genomic intervals; quantifying a plurality of features for the amplicons mapped to the genomic intervals; comparing the plurality of features of amplicons in a first genomic interval with the plurality of features of amplicons in one or more different genomic intervals; and wherein at least 100,000 amplicons are formed in the step of amplifying (e.g., the plurality of amplicons can include ˜745,000 amplicons).
In some embodiments, the method is performed in vitro. In some embodiments, the plurality of amplicons comprise about 1,000,000 amplicons, e.g., about 1,000,000-10,000 amplicons; about 1,000,000-50,000 amplicons; about 1,000,000-100,000 amplicons; about 1,000,000-200,000 amplicons; about 1,000,000-300,000 amplicons; about 1,000,000-400,000 amplicons; about 1,000,000-500,000 amplicons; about 1,000,000-600,000 amplicons; about 1,000,000-700,000 amplicons; about 1,000,000-800,000 amplicons; about 1,000,000-900,000 amplicons; about 900,000-10,000 amplicons; about 800,000-10,000 amplicons; about 700,000-10,000 amplicons; about 600,000-10,000 amplicons; about 500,000-10,000 amplicons; about 400,000-10,000 amplicons; about 300,000-10,000 amplicons; about 200,000-10,000 amplicons; about 100,000-10,000 amplicons or about 50,000-10,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 50,000 amplicons; about 100,000 amplicons; about 150,000 amplicons; about 200,000 amplicons; about 250,000 amplicons; about 300,000 amplicons; about 350,000 amplicons; about 400,000 amplicons; about 450,000 amplicons; about 500,00 amplicons; about 550,000 amplicons; about 600,000 amplicons; about 650,000 amplicons; about 700,000 amplicons; about 750,000 amplicons; about 800,000 amplicons; about 850,000 amplicons; about 900,000 amplicons; about 950,000 amplicons; or about 1,000,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 750,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 350,000 amplicons.
In some embodiments, the number of repetitive elements, e.g., amplicons, amplified by the single primer pair disclosed herein is a function of: the number of repetitive elements present in a sample and/or the length of a repetitive element present in a sample. For example, in some samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about ˜750,000 amplicons. In some embodiments, in other samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about ˜350,000 amplicons.
In some embodiments, the DNA sample is a plurality of euploid DNA samples. In some embodiments, the DNA sample is a plurality of test DNA samples. In some embodiments, the DNA sample is a plurality of test DNA samples. In some embodiments, the DNA sample is from plasma. In some embodiments, the DNA sample is from serum. In some embodiments, the DNA sample comprises cell fetal DNA. In some embodiments, the DNA sample comprises at least 3 picograms of DNA. In some embodiments, the mammal is a human. In some embodiments the pair of primers comprises a first primer comprising SEQ ID NO: 1 and a second primer comprising SEQ ID NO: 10. In some embodiments, the methods provide herein include one or more additional pairs of primers. In some embodiments, the amplicons include repetitive elements (e.g., one or more types of repetitive elements shown in Table 1). In some embodiments, the amplicons include unique short interspersed nucleotide elements (SINEs). In some embodiments, the amplicons include unique long interspersed nucleotide elements (LINEs).
In some embodiments, the average length of the amplicons is about 100 basepairs or less. In some embodiments, the average length of the amplicons is less than about 110 bp, e.g., about 10-110 bp, about 10-105 bp, about 10-100 bp, about 10-99 bp, about 10-98 bp, about 10-97 bp, about 10-96 bp, about 10-95 bp, about 10-94 bp, about 10-93 bp, about 10-92 bp, about 10-91 bp, about 10-90 bp, about 10-89 bp, about 10-87 bp, about 10-86 bp, about 10-85 bp, about 10-84 bp, about 10-83 bp, about 10-82 bp, about 10-81 bp, about 10-80 bp, about 10-79 bp, about 10-78 bp, about 10-77 bp, about 10-76 bp, about 10-75 bp, about 10-74 bp, about 10-73 bp, about 10-72 bp, about 10-71 bp, about 10-70 bp, about 10-65 bp, about 10-60 bp, about 10-55 bp, about 10-50 bp, about 10-40 bp, about 10-30 bp, about 10-20 bp, about 15-110 bp, about 20-110 bp, about 25-110 bp, about 30-110 bp, about 35-110 bp, about 40-110 bp, about 45-110 bp, about 50-110 bp, about 55-110 bp about 60-110 bp, about 65-110 bp, about 70-110 bp, about 75-110 bp, about 80-110 bp, about 85-110 bp, about 90-110 bp, about 95-110 bp, about 100-110 bp, or about 105-110 bp.
In some embodiments, the average length of the amplicons is about 10 bp; about 20 bp; about 30 bp; about 40 bp; about 45 bp; about 50 bp; about 60 bp; about 65 bp; about 70 bp; about 75 bp; about 80 bp; about 85 bp; about 90 bp; about 95 bp; about 100 bp; about 105 bp or about 110 bp.
In some embodiments, the amplicons comprise one or more long amplicons where the average length is 1000 basepairs or greater. In some embodiments, the long amplicons comprise DNA from a contaminating cell. In some embodiments, the contaminating cell is a leukocyte. In some embodiments, the genomic intervals comprise from about 100 nucleotides to about 125,000,000 nucleotides (e.g., the genomic intervals can include about 500,000 nucleotides).
In another aspect, the disclosure provides a method of evaluating a subject for the presence of, or the risk of developing, any of a plurality of, e.g., any of at least four, cancers in the subject comprising:
(i) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, the presence of one or more genetic biomarkers, e.g., one or more mutations (e.g., one or more driver gene mutations), in each of one or more genes (e.g., one or more driver genes, e.g., in at least four driver genes), and optionally wherein, each gene, e.g., driver gene, is associated with the presence, or risk, of a cancer of the plurality of cancers;
(ii) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, the level of each of a plurality of, e.g., at least four, protein biomarkers, and optionally wherein, the level of each protein biomarker of the plurality is associated with the presence, or risk, of a cancer of the plurality of cancers; or
(iii) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, aneuploidy, wherein the aneuploidy value is a function of the copy number or length of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family), wherein the RE family comprises:
(a) a RE Family other than a long interspersed nucleotide element (LINE);
(b) a RE Family which when amplified with a primer moiety complementary to its repeated terminal elements, provides amplicons having an average length of less than X nts, wherein X is 100, 105, or 110,
(c) a RE family which is less than about 700 bp long; or
(d) a RE family which is present in at least 100 copies per genome;
and optionally wherein, the aneuploidy is associated with the presence, or risk, of a cancer of the plurality of cancers;
thereby evaluating the subject for the presence of or risk of developing, any of the plurality of, e.g., any of at least four, cancers.
In an embodiment, one of (i), (ii) and (iii) is directly acquired. In an embodiment, (i) and (ii) are directly acquired. In an embodiment, (i) and (iii) are directly acquired. In an embodiment, (ii) and (iii) are directly acquired. In an embodiment, all of (i), (ii) and (iii) are directly acquired.
In an embodiment, one of (i), (ii) and (iii) is indirectly acquired. In an embodiment, (i) and (ii) are indirectly acquired. In an embodiment, (i) and (iii) are indirectly acquired. In an embodiment, (ii) and (iii) are indirectly acquired. In an embodiment, all of (i), (ii) and (iii) are indirectly acquired.
In an embodiment, the method comprises sequencing one or more subgenomic intervals or amplicons comprising the genetic biomarkers. In an embodiment, the method comprises analyzing one or more genomic sequences for aneuploidy. In an embodiment, the method comprises, contacting a protein biomarker with a detection reagent. In an embodiment, the method comprises: (1) sequencing one or more subgenomic intervals or amplicons comprising the genetic biomarkers; (2) analyzing one or more genomic sequences for aneuploidy, and/or (3) contacting a protein biomarker with a detection reagent.
In an embodiment, the aneuploidy value is a function of the copy number of the genomic sequence disposed between at least two terminal repeated elements of a RE Family. In an embodiment, the aneuploidy value is a function of the length of the genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family).
In some embodiments, the method is performed in vitro.
In an embodiment, a sample, e.g., a biological sample, obtained from the subject is evaluated for one, two or all of (i)-(iii). In an embodiment, the biological sample comprises a liquid sample, e.g., a blood sample. In an embodiment, the biological sample comprises a cell-free DNA sample, a plasma sample or a serum sample. In an embodiment, the biological sample comprises cell-free DNA, e.g., circulating tumor DNA. In an embodiment, the biological sample comprises cells and/or tissue. In an embodiment, the biological sample comprises cells (e.g., normal or cancer cells) and cell-free DNA.
In an embodiment of any of the methods disclosed herein, specificity of detection of the cancer in the plurality of cancers with (i), (ii) and (iii) is substantially the same as, e.g., not substantially lower than, the specificity of detection of the cancer in the plurality of cancers with: (i); (ii); (iii); (i) and (ii); (i) and (iii); or (ii) and (iii).
In an embodiment of any of the methods disclosed herein, sensitivity of detection of the cancer in the plurality of cancers with (i), (ii) and (iii) is higher, e.g., about 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 fold higher, than the sensitivity of detection of the cancer in the plurality of cancers with: (i); (ii); (iii); (i) and (ii); (i) and (iii); or (ii) and (iii). In an embodiment, an increased sensitivity of detection, e.g., about 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 fold increase in sensitivity of detection at a specified specificity, e.g., at a predetermined specificity, e.g., at least about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100% specificity.
In some embodiments, the plurality of amplicons comprise about 1,000,000 amplicons, e.g., about 1,000,000-10,000 amplicons; about 1,000,000-50,000 amplicons; about 1,000,000-100,000 amplicons; about 1,000,000-200,000 amplicons; about 1,000,000-300,000 amplicons; about 1,000,000-400,000 amplicons; about 1,000,000-500,000 amplicons; about 1,000,000-600,000 amplicons; about 1,000,000-700,000 amplicons; about 1,000,000-800,000 amplicons; about 1,000,000-900,000 amplicons; about 900,000-10,000 amplicons; about 800,000-10,000 amplicons; about 700,000-10,000 amplicons; about 600,000-10,000 amplicons; about 500,000-10,000 amplicons; about 400,000-10,000 amplicons; about 300,000-10,000 amplicons; about 200,000-10,000 amplicons; about 100,000-10,000 amplicons or about 50,000-10,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 50,000 amplicons; about 100,000 amplicons; about 150,000 amplicons; about 200,000 amplicons; about 250,000 amplicons; about 300,000 amplicons; about 350,000 amplicons; about 400,000 amplicons; about 450,000 amplicons; about 500,00 amplicons; about 550,000 amplicons; about 600,000 amplicons; about 650,000 amplicons; about 700,000 amplicons; about 750,000 amplicons; about 800,000 amplicons; about 850,000 amplicons; about 900,000 amplicons; about 950,000 amplicons; or about 1,000,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 750,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 350,000 amplicons.
In some embodiments, the number of repetitive elements, e.g., amplicons, amplified by the single primer pair disclosed herein is a function of: the number of repetitive elements present in a sample and/or the length of a repetitive element present in a sample. For example, in some samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about ˜750,000 amplicons. In some embodiments, in other samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about ˜350,000 amplicons.
In some embodiments, the average length of the amplicons is about 100 basepairs or less. In some embodiments, the average length of the amplicons is less than about 110 bp, e.g., about 10-110 bp, about 10-105 bp, about 10-100 bp, about 10-99 bp, about 10-98 bp, about 10-97 bp, about 10-96 bp, about 10-95 bp, about 10-94 bp, about 10-93 bp, about 10-92 bp, about 10-91 bp, about 10-90 bp, about 10-89 bp, about 10-87 bp, about 10-86 bp, about 10-85 bp, about 10-84 bp, about 10-83 bp, about 10-82 bp, about 10-81 bp, about 10-80 bp, about 10-79 bp, about 10-78 bp, about 10-77 bp, about 10-76 bp, about 10-75 bp, about 10-74 bp, about 10-73 bp, about 10-72 bp, about 10-71 bp, about 10-70 bp, about 10-65 bp, about 10-60 bp, about 10-55 bp, about 10-50 bp, about 10-40 bp, about 10-30 bp, about 10-20 bp, about 15-110 bp, about 20-110 bp, about 25-110 bp, about 30-110 bp, about 35-110 bp, about 40-110 bp, about 45-110 bp, about 50-110 bp, about 55-110 bp about 60-110 bp, about 65-110 bp, about 70-110 bp, about 75-110 bp, about 80-110 bp, about 85-110 bp, about 90-110 bp, about 95-110 bp, about 100-110 bp, or about 105-110 bp.
In some embodiments, the average length of the amplicons is about 10 bp; about 20 bp; about 30 bp; about 40 bp; about 45 bp; about 50 bp; about 60 bp; about 65 bp; about 70 bp; about 75 bp; about 80 bp; about 85 bp; about 90 bp; about 95 bp; about 100 bp; about 105 bp or about 110 bp.
In some embodiments, the method further comprises subjecting the subject to a radiologic scan, e.g., a PET-CT scan, of an organ or body region. In some embodiments, the radiologic scanning of an organ or body region characterizes the cancer. In some embodiments, the radiologic scanning of an organ or body region identifies the location of the cancer. In some embodiments, the radiologic scan is a PET-CT scan. In some embodiments, the radiologic scanning is performed after the subject is evaluated for the presence of each of a plurality of cancers.
In another aspect, the disclosure provides a method of testing for the presence of aneuploidy in a genome of a mammal. The method comprises:
In some embodiments, the method is performed in vitro.
In an embodiment of any of the methods disclosed herein, increase in sensitivity of detection of the cancer in the plurality of cancers does not affect, e.g., reduce or substantially reduce, the specificity of detection of the cancer in the plurality of cancer. In an embodiment, the specificity of detection of the cancer in the plurality of cancers is at a plateau, e.g., the specificity of detection is not altered by detection of additional biomarkers.
In another aspect, provided herein is a method of detecting aneuploidy in a sample comprising low input DNA, using any of the methods disclosed herein.
In some embodiments, the sample comprises about 0.01 picogram (pg) to 500 pg of DNA. In some embodiments, the sample comprises about 0.01-500 pg, 0.05-400 pg, 0.1-300 pg, 0.5-200 pg, 1-100 pg, 10-90 pg, or 20-50 pg DNA. In some embodiments, the sample comprises at least 0.01 pg, at least 0.01 pg, at least 0.1 pg, at least 1 pg, at least 2 pg, at least 3 pg, at least 4 pg, at least 5 pg, at least 6 pg, at least 7 pg, at least 8 pg, at least 9 pg at least 10 pg, at least 11 pg, at least 12 pg, at least 13 pg, at least 14 pg, at least 15 pg, at least 16 pg, at least 17 pg, at least 18 pg, at least 19 pg, at least 20 pg, at least 21 pg, at least 22 pg, at least 23 pg, at least 24 pg, at least 25 pg, at least 26 pg, at least 27 pg, at least 28 pg, at least 29 pg, at least 30 pg, at least 31 pg, at least 32 pg, at least 33 pg, at least 34 pg, at least 35 pg, at least 36 pg, at least 37 pg, at least 38 pg, at least 39 pg, at least 40 pg, at least 50 pg, at least 60 pg, at least 70 pg, at least 80 pg, at least 90 pg, at least 100 pg, at least 150 pg, at least 200 pg, at least 300 pg, at least 350 pg, at least 400 pg, at least 450 pg, or at least 500 pg DNA.
In some embodiments, the sample comprises 1 pg DNA. In some embodiments, the sample comprises 2 pg DNA. In some embodiments, the sample comprises 3 pg DNA. In some embodiments, the sample comprises 4 pg DNA. In some embodiments, the sample comprises 5 pg DNA. In some embodiments, the sample comprises 10 pg DNA.
In some embodiments, the sample is a biological sample from a subject. In an embodiment, the biological sample comprises a liquid sample, e.g., a blood sample. In an embodiment, the biological sample comprises a cell-free DNA sample, a plasma sample or a serum sample. In an embodiment, the biological sample comprises cell-free DNA, e.g., circulating tumor DNA. In an embodiment, the biological sample comprises cells and/or tissue. In an embodiment, the biological sample comprises cells (e.g., normal or cancer cells) and cell-free DNA.
In some embodiments, the sample is a trisomy 21 sample. In some embodiments, the sample is a forensic sample. In some embodiments, the sample is from an embryo, e.g., preimplantation embryo.
In some embodiments, the sample is a biobank sample, e.g., as described in Example 3.
In some embodiments, the method is used for diagnostics, e.g., preimplantation diagnostics.
In some embodiments, the method is used for forensics.
In some embodiments, the method is an in vitro method.
In another aspect, provided herein is a method of identifying or distinguishing a sample using any of the methods disclosed herein.
In some embodiments, the sample, e.g., first sample, from a subject (e.g., first subject) is distinguished from a second sample from a second subject. In some embodiments, the sample, e.g., first sample, is identified as being from the first subject based on a polymorphism (e.g., a plurality of polymorphisms, e.g., common polymorphisms). In some embodiments, the second sample is identified as being from the second subject based on a polymorphism (e.g., a plurality of polymorphisms, e.g., common polymorphisms). In some embodiments, a common polymorphism is present in a repetitive element, e.g., as described herein. In some embodiments, methods disclosed in Example 8 can be used to identify and/or distinguish the sample.
In another aspect, provided herein is a reaction mixture comprising: at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 detection reagents, wherein a detection reagent mediates a readout that is a value of the level or presence of: (i) one or more genetic biomarkers referred to herein; (ii) one or more protein biomarkers referred to herein; and/or (iii) the copy number or length, e.g., aneuploidy, of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family) referred to herein.
In yet another aspect, the disclosure provides a kit comprising: (a) at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 detection reagents, wherein a detection reagent mediates a readout that is a value of the level or presence of: (i) one or more genetic biomarkers referred to herein; (ii) one or more protein biomarkers referred to herein; and/or (iii) the copy number or length, e.g., aneuploidy, of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family) referred to herein; and (b) instructions for using said kit.
In some embodiments of any of the methods disclosed herein, quantifying amplicons mapped to genomic intervals comprises identifying a plurality of genomic intervals with one or more shared amplicon features. In some embodiments, the shared amplicon feature is the number of the mapped amplicons.
In some embodiments of any of the methods disclosed herein, the shared amplicon feature is the average length of the mapped amplicons. In some embodiments, the plurality of genomic intervals with shared amplicon features are grouped into clusters. In some embodiments, each cluster includes about two hundred genomic intervals. In some embodiments, the clusters comprise predefined clusters. In some embodiments, the comparison of the genomic intervals further comprises matching one or more genomic intervals from test samples to predefined clusters. In some embodiments, matching genomic intervals from test samples to predefined clusters further comprises identifying one or more genomic intervals with shared amplicon features outside a predetermined significance threshold for a predefined cluster. In some embodiments, the method comprises supervised machine learning. In some embodiments, the supervised machine learning employs a support vector machine model.
In some embodiments of any of the methods disclosed herein, a single pair of primers is used for the amplification of a plurality of amplicons from a DNA sample comprising a first primer comprising a sequence that is at least 80% identical to SEQ ID NO: 1 and a second primer comprising a sequence that is at least 80% identical to SEQ ID NO: 10. In some embodiments, the sequence of the first primer is at least 90% identical to SEQ ID NO. 1. In some embodiments, the sequence of the first primer is at least 95% identical to SEQ ID NO. 1. In some embodiments, the sequence of the first primer is 100% identical to SEQ ID NO. 1. In some embodiments, the sequence of the second primer is at least 90% identical to SEQ ID NO. 10. In some embodiments, the sequence of the second primer is at least 95% identical to SEQ ID NO. 10. In some embodiments, the sequence of the second primer is 100% identical to SEQ ID NO. 10. In some embodiments, a kit comprising a pair of primers is used to amplify a plurality of amplicons from a DNA sample, wherein a first primer of the primer pair comprises SEQ ID NO: 1 or a sequence at least 80% identical thereto, and a second primer of the primer pair comprises SEQ ID NO: 10, or a sequence at least 80% identical thereto.
In another aspect, the disclosure provides a method of testing for the presence of cancer of a mammal. The method includes: a) amplifying a plurality of chromosomal sequences in a DNA sample with a pair of primers complementary to the chromosomal sequences to form a plurality of amplicons; b) determining at least a portion of the nucleic acid sequence of one or more of the plurality of amplicons; c) mapping the sequenced amplicons to a reference genome; d) dividing the DNA sample into a plurality of genomic intervals; e) quantifying a plurality of features for the amplicons mapped to the genomic intervals; f) comparing the plurality of features of amplicons in a first genomic interval with the plurality of features of amplicons in one or more different genomic intervals; and g) determining the presence of cancer in the mammal when the plurality of features of amplicons in a first genomic interval is different from the plurality of features of amplicons in one or more different genomic intervals. In some embodiments, the method can include at least 100,000 amplicons formed in the step of amplifying. In some embodiments, the cancer can be a Stage I cancer. In some embodiments, the cancer can be a liver cancer, an ovarian cancer, an esophageal cancer, a stomach cancer, a pancreatic cancer, a colorectal cancer, a lung cancer, a breast cancer, or a prostate cancer.
In some embodiments, the method is an in vitro method.
In some embodiments of any of the methods, reaction mixtures or kits disclosed herein, the plurality of amplicons comprise about 1,000,000 amplicons, e.g., about 1,000,000-10,000 amplicons; about 1,000,000-50,000 amplicons; about 1,000,000-100,000 amplicons; about 1,000,000-200,000 amplicons; about 1,000,000-300,000 amplicons; about 1,000,000-400,000 amplicons; about 1,000,000-500,000 amplicons; about 1,000,000-600,000 amplicons; about 1,000,000-700,000 amplicons; about 1,000,000-800,000 amplicons; about 1,000,000-900,000 amplicons; about 900,000-10,000 amplicons; about 800,000-10,000 amplicons; about 700,000-10,000 amplicons; about 600,000-10,000 amplicons; about 500,000-10,000 amplicons; about 400,000-10,000 amplicons; about 300,000-10,000 amplicons; about 200,000-10,000 amplicons; about 100,000-10,000 amplicons or about 50,000-10,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 50,000 amplicons; about 100,000 amplicons; about 150,000 amplicons; about 200,000 amplicons; about 250,000 amplicons; about 300,000 amplicons; about 350,000 amplicons; about 400,000 amplicons; about 450,000 amplicons; about 500,00 amplicons; about 550,000 amplicons; about 600,000 amplicons; about 650,000 amplicons; about 700,000 amplicons; about 750,000 amplicons; about 800,000 amplicons; about 850,000 amplicons; about 900,000 amplicons; about 950,000 amplicons; or about 1,000,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 750,000 amplicons.
In some embodiments, the plurality of amplicons comprises about 350,000 amplicons.
In some embodiments of any of the methods disclosed herein, the number of repetitive elements, e.g., amplicons, amplified by the single primer pair disclosed herein is a function of: the number of repetitive elements present in a sample and/or the length of a repetitive element present in a sample. For example, in some samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about 750,000 amplicons. In some embodiments, in other samples, the number of repetitive elements, e.g., amplicons, that can be detected with the single primer pair is about 350,000 amplicons.
In some embodiments of any of the methods, reaction mixtures or kits disclosed herein, the average length of the amplicons is about 100 basepairs or less. In some embodiments, the average length of the amplicons is less than about 110 bp, e.g., about 10-110 bp, about 10-105 bp, about 10-100 bp, about 10-99 bp, about 10-98 bp, about 10-97 bp, about 10-96 bp, about 10-95 bp, about 10-94 bp, about 10-93 bp, about 10-92 bp, about 10-91 bp, about 10-90 bp, about 10-89 bp, about 10-87 bp, about 10-86 bp, about 10-85 bp, about 10-84 bp, about 10-83 bp, about 10-82 bp, about 10-81 bp, about 10-80 bp, about 10-79 bp, about 10-78 bp, about 10-77 bp, about 10-76 bp, about 10-75 bp, about 10-74 bp, about 10-73 bp, about 10-72 bp, about 10-71 bp, about 10-70 bp, about 10-65 bp, about 10-60 bp, about 10-55 bp, about 10-50 bp, about 10-40 bp, about 10-30 bp, about 10-20 bp, about 15-110 bp, about 20-110 bp, about 25-110 bp, about 30-110 bp, about 35-110 bp, about 40-110 bp, about 45-110 bp, about 50-110 bp, about 55-110 bp about 60-110 bp, about 65-110 bp, about 70-110 bp, about 75-110 bp, about 80-110 bp, about 85-110 bp, about 90-110 bp, about 95-110 bp, about 100-110 bp, or about 105-110 bp.
In some embodiments, the average length of the amplicons is about 10 bp; about 20 bp; about 30 bp; about 40 bp; about 45 bp; about 50 bp; about 60 bp; about 65 bp; about 70 bp; about 75 bp; about 80 bp; about 85 bp; about 90 bp; about 95 bp; about 100 bp; about 105 bp or about 110 bp.
Additional features of any of the methods disclosed herein include one or more of the following enumerated embodiments.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following enumerated embodiments.
E1. A method of evaluating a subject for the presence of, or the risk of developing, any of a plurality of, e.g., any of at least four, cancers in the subject comprising:
(i) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, the presence of one or more genetic biomarkers, e.g., one or more mutations (e.g., one or more driver gene mutations), in each of one or more genes (e.g., one or more driver genes, e.g., in at least four driver genes), and optionally wherein, each gene, e.g., driver gene, is associated with the presence, or risk, of a cancer of the plurality of cancers;
(ii) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, the level of each of a plurality of, e.g., at least four, protein biomarkers, and optionally wherein, the level of each protein biomarker of the plurality is associated with the presence, or risk, of a cancer of the plurality of cancers; or
(iii) acquiring, e.g., directly acquiring or indirectly acquiring, a value for, e.g., detecting, aneuploidy, wherein the aneuploidy value is a function of the copy number or length of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family), wherein the RE family comprises:
(a) a RE Family other than a long interspersed nucleotide element (LINE);
(b) a RE Family which when amplified with a primer moiety complementary to its repeated terminal elements, provides a plurality of amplicons having an average length of less than X nts, wherein X is 100, 105, or 110,
(c) a RE family which is less than about 700 bp long; or
(d) a RE family which is present in at least 100 copies per genome;
and optionally wherein, the aneuploidy is associated with the presence, or risk, of a cancer of the plurality of cancers;
thereby evaluating the subject for the presence of or risk of developing, any of the plurality of, e.g., any of at least four, cancers.
E2. The method of embodiment E1, wherein:
(a) one of (i), (ii) and (iii) is directly acquired;
(b) (i) and (ii) are directly acquired;
(c) (i) and (iii) are directly acquired;
(d) (ii) and (iii) are directly acquired; or
(e) all of (i), (ii) and (iii) are directly acquired.
E3. The method of embodiment E1, wherein:
(a) one of (i), (ii) and (iii) is indirectly acquired;
(b) (i) and (ii) are indirectly acquired;
(c) (i) and (iii) are indirectly acquired;
(d) (ii) and (iii) are indirectly acquired; or
(e) all of (i), (ii) and (iii) are indirectly acquired.
E4. The method of any one of embodiments E1-E3, comprising:
(1) sequencing one or more subgenomic intervals or amplicons comprising the genetic biomarkers;
(2) analyzing one or more genomic sequences for aneuploidy, and/or
(3) contacting a protein biomarker with a detection reagent.
E5. The method of any one of embodiments E1-E4, wherein the aneuploidy value is a function of:
(a) the copy number of the genomic sequence disposed between at least two terminal repeated elements of a RE Family; and/or
(b) the length of the genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family).
E6. The method of any one of embodiments E1-E5, wherein a biological sample obtained from the subject is evaluated for one, two or all of (i)-(iii).
E7. The method of embodiment E6, wherein the biological sample comprises a liquid sample, e.g., a blood sample.
E8. The method of embodiment E6 or E7, wherein the biological sample comprises a cell-free DNA sample, a plasma sample or a serum sample.
E9. The method of any one of embodiments E6-E8, wherein the biological sample comprises cell-free DNA, e.g., circulating tumor DNA.
E10. The method of any one of embodiment E1-E9, further comprising:
(i) acquiring a sequence for a subgenomic interval from cell-free DNA from a sample;
(ii) acquiring a leukocyte parameter, e.g., sequence of the subgenomic interval, from leukocyte DNA from the sample.
E11. The method of any one of embodiments E1-E10 further comprising:
(i) acquiring a sequence for a subgenomic interval for aneuploidy analysis from cell-free DNA from a sample;
(ii) acquiring a leukocyte parameter, e.g., a sequence for the subgenomic interval for aneuploidy analysis, from leukocyte DNA from the sample.
E12. The method of embodiment E10 or E11 further comprising comparing (i) with (ii) to evaluate a genomic event, e.g., a mutation, found in the cell-free DNA subgenomic interval or cell-free DNA aneuploidy analysis sample.
E13. The method of any one of embodiments E10-E12, further classifying a genomic event, e.g., a mutation, in the subgenomic interval from cell-free DNA or from aneuploidy analysis of cell-free DNA, e.g., assigning the mutation to a first class or a second class.
E14. The method of any one of embodiments E10-E13, further comprising classifying a genomic event, e.g., a mutation, in the subgenomic interval from cell-free DNA or from aneuploidy analysis of cell-free DNA, as growth-deregulating, e.g., cancerous.
E15. The method of any one of embodiments E10-E13, further comprising classifying a genomic event, e.g., a mutation, in the subgenomic interval from cell-free DNA or from aneuploidy analysis of cell-free DNA, as other than growth-deregulating, e.g., as other than cancerous.
E16. The method of any one of embodiments E10-E14, wherein classifying a genomic event, e.g., a mutation, in the subgenomic interval from cell-free DNA or from aneuploidy analysis of cell-free DNA, as cancerous when:
(a) the subgenomic interval is aneuploid in cell-free DNA, and the subgenomic interval is not aneuploid in leukocytes; or
(b) the genomic event is present in the subgenomic interval of cell-free DNA, and the genomic event is not present in the subgenomic interval of leukocytes.
E17. The method of any one of embodiments E10-E13 or E15, wherein classifying a genomic event, e.g., a mutation, in the subgenomic interval from cell-free DNA or form aneuploidy analysis of cell-free DNA, as other than growth-deregulating when:
(a) the subgenomic interval is aneuploid in cell-free DNA, and the subgenomic interval is aneuploid in leukocytes; or
(b) the genomic event is present in the subgenomic interval of cell-free DNA and the genomic event is present in the subgenomic interval of leukocytes.
E18. The method of embodiment E17, wherein the genomic event is associated with clonal expansion of leukocytes, e.g., age-associated clonal hematopoiesis, e.g., clonal hematopoiesis of indeterminate potential (CHIP).
E19. The method of any one of embodiments E1-E18, wherein specificity of detection of the cancer in the plurality of cancers with (i), (ii) and (iii) is substantially the same as, e.g., not substantially lower than, the specificity of detection of the cancer in the plurality of cancers with: (i); (ii); (iii); (i) and (ii); (i) and (iii); or (ii) and (iii).
E20. The method of any one of embodiments E1-E19, wherein sensitivity of detection of the cancer in the plurality of cancers with (i), (ii) and (iii) is higher, e.g., about 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 fold higher, than the sensitivity of detection of the cancer in the plurality of cancers with: (i); (ii); (iii); (i) and (ii); (i) and (iii); or (ii) and (iii).
E21. The method of any one of embodiments E1-E20, wherein (i), (ii) and (iii) result in an increased sensitivity of detection, e.g., about 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 fold increase in sensitivity of detection at a specified specificity, e.g., at a predetermined specificity, e.g., at least about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100% specificity.
E22. The method of any one of embodiments E20-E21, wherein the increase in sensitivity of detection of the cancer in the plurality of cancers does not affect, e.g., reduce or substantially reduce, the specificity of detection of the cancer in the plurality of cancer.
E23. The method of embodiment E22, wherein the specificity of detection of the cancer in the plurality of cancers is at a plateau.
E24. The method of any one of embodiments E1-E23, wherein the RE family is other than a LINE.
E25. The method of any one of embodiments E1-E24, wherein the RE family comprises a repeated element which when amplified with a primer to its repeated terminal elements, provides a plurality of amplicons having an average length of less than about 110 bp, e.g., about 10-110 bp, about 10-105 bp, about 10-100 bp, about 10-99 bp, about 10-98 bp, about 10-97 bp, about 10-96 bp, about 10-95 bp, about 10-94 bp, about 10-93 bp, about 10-92 bp, about 10-91 bp, about 10-90 bp, about 10-89 bp, about 10-87 bp, about 10-86 bp, about 10-85 bp, about 10-84 bp, about 10-83 bp, about 10-82 bp, about 10-81 bp, about 10-80 bp, about 10-79 bp, about 10-78 bp, about 10-77 bp, about 10-76 bp, about 10-75 bp, about 10-74 bp, about 10-73 bp, about 10-72 bp, about 10-71 bp, about 10-70 bp, about 10-65 bp, about 10-60 bp, about 10-55 bp, about 10-50 bp, about 10-40 bp, about 10-30 bp, about 10-20 bp, about 15-110 bp, about 20-110 bp, about 25-110 bp, about 30-110 bp, about 35-110 bp, about 40-110 bp, about 45-110 bp, about 50-110 bp, about 55-110 bp about 60-110 bp, about 65-110 bp, about 70-110 bp, about 75-110 bp, about 80-110 bp, about 85-110 bp, about 90-110 bp, about 95-110 bp, about 100-110 bp, or about 105-110 bp.
E26. The method of any one of embodiments E1-E25, wherein the RE family comprises one or more repetitive elements shown in Table 1.
E27. The method of any one of embodiments E1-E26, wherein the RE family comprises a SINE or a tandem repeat (e.g., microsatellite DNA, mini-satellite DNA, satellite DNA or DNA of genes with multiple copies (e.g., DNA encoding ribosomal RNA)).
E28. The method of embodiment E27, wherein the RE family is a SINE, e.g., an Alu family, a MIR or a MIR3, or a SINE described in Vassetzky and Kramerov (2013) Nucleic Acids Res. 41: D83-89.
E29. The method of any one of embodiments E1-E28, wherein the value for aneuploidy is further a function of the copy number or length of a genomic sequence disposed between the terminal repeated elements of a LINE repeated element.
E30. The method of any one of embodiments E1-E29, wherein the value for aneuploidy is further a function of the copy number or length of a plurality of genomic sequences disposed between the terminal repeated elements of a repeated element family which when amplified with a primer complementary to its repeated terminal elements, provides amplicons having an average length of more than 100 bp.
E31. The method of any one of embodiments E1-E30, wherein the value for aneuploidy is further a function of:
E32. The method of any one of embodiments E1-E31, comprising providing a value for aneuploidy, wherein the value is a function of the copy number of at least about 5, 10, 20, 30, 50, 100, 200, 500, or 1000 different genomic sequences disposed between the terminal repeated elements of a RE family.
E33. The method of any one of embodiments E1-E32, wherein the copy number is greater than 2 or is less than 2.
E34. The method of any one of embodiments E31-E33, wherein at least about 100,000 amplicons, about 150,000 amplicons, about 200,000 amplicons; about 250,000 amplicons; about 300,000 amplicons; about 350,000 amplicons; about 400,000 amplicons; about 450,000 amplicons; about 500,000 amplicons; about 550,000 amplicons; about 600,000 amplicons; about 650,000 amplicons; about 700,000 amplicons; about 750,000 amplicons; about 800,000 amplicons; about 850,000 amplicons; about 900,000 amplicons; about 950,000 amplicons; or about 1,000,000 amplicons are formed.
E35. The method of any one of embodiments E1-E34, comprising providing a value for aneuploidy, wherein the value is a function of:
(i) the copy number or length of a first genomic sequence disposed between the terminal repeated elements of a RE family, on a first segment of genomic DNA; and
(ii) the copy number or length of a second genomic sequence disposed between the terminal repeated elements of a (e.g., the same or a different) RE family, on a second segment of genomic DNA.
E36. The method of embodiment E35, wherein:
(i) the first segment of genomic DNA and the second segment of genomic DNA are on different arms of the same chromosome, e.g., the first segment is on the q arm and the second segment is on the p arm of the same chromosome; or the first segment is on the p arm and the second segment is on the q arm of the same chromosome;
(ii) the first segment of genomic DNA and the second segment of genomic DNA are on the same arm of the same chromosome, e.g., the first segment and the second segment are both on the p arm, or q arm of a chromosome; and/or
(iii) the first segment of genomic DNA and the second segment of genomic DNA are on different chromosomes, e.g., non-homologous chromosomes.
E37. The method of any one of embodiments E1-E36, comprising providing a value for aneuploidy, wherein the value is a function of:
the copy number or length of a third genomic sequence disposed between the terminal repeated elements of a RE family, on a third chromosome.
E38. The method of any one of embodiments E1-E37, comprising providing a value for aneuploidy, wherein the value is a function of:
the copy number or length of an Nth genomic sequence disposed between the terminal repeated elements of a RE family, on an Nth chromosome, wherein N is 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23.
E39. The method of any one of embodiments E1-E38, comprising contacting subject genomic nucleic acid with a primer moiety which amplifies a sequence comprising a genomic sequence disposed between the terminal repeated elements of a RE family.
E40. The method of embodiment E39, wherein the primer moiety is complementary to a terminal element of the RE family.
E41. The method of embodiment E39 or E40, wherein the primer moiety comprises a pair of primers.
E42. The method of any one of embodiments E39-E41, wherein the primer moiety comprises a single primer, and e.g., is used with isothermal amplification.
E43. The method of any one of embodiments E1-E42, wherein, the number of biomarkers (e.g., number of driver gene mutations) detected is sufficient such that the sensitivity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, is associated with, is not substantially increased by the detection of one or more additional genetic biomarkers.
E44. The method of any one of embodiments E1-E42, wherein detecting the genetic biomarker comprises providing, e.g., by sequencing, the sequence (e.g., nucleotide sequence) of the genetic biomarker.
E45. The method of embodiment E44, wherein the number of genetic biomarker sequences provided is sufficient such that the sensitivity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, is associated with is not substantially increased by the provision of one or more sequences of additional genetic biomarkers.
E46. The method of any one of embodiments E1-E42, wherein detecting the biomarker comprises providing the sequence (e.g., nucleotide sequence) of one or more subgenomic intervals comprising the genetic biomarker.
E47. The method of embodiment E46, wherein, the number of subgenomic interval sequences provided is sufficient such that the sensitivity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, is associated with is not substantially increased by the provision of one or more sequences (e.g., nucleotide sequences) of additional subgenomic intervals.
E48. The method of any one of embodiments E1-E42, wherein detecting the genetic biomarker comprises providing the sequence of an amplicon comprising the genetic biomarker.
E49. The method of embodiment E48, wherein, the number of amplicon sequences provided is sufficient such that the sensitivity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, is associated with is not substantially increased by the provision of one or more sequences of additional amplicons.
E50. The method of embodiment E46, wherein the number of subgenomic interval sequences provided is sufficient such that the specificity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, is associated with is not substantially decreased by the provision of one or more sequences of additional subgenomic intervals.
E51. The method of embodiment E48, wherein the number of amplicons provided is sufficient such that the specificity of detection of the cancer in the plurality of cancers with which each gene, e.g., driver gene, of the plurality is associated with is not substantially decreased by the provision of one or more sequences of additional amplicons.
E52. The method of any of the preceding embodiments, wherein the plurality of cancers comprises 4, 5, 6, 7 or 8 cancers.
E53. The method of any of the preceding embodiments, wherein the plurality of cancers is chosen from solid tumors such as: mesothelioma (e.g., malignant pleural mesothelioma), lung cancer (e.g., non-small cell lung cancer, small cell lung cancer, squamous cell lung cancer, or large cell lung cancer), pancreatic cancer (e.g., pancreatic ductal adenocarcinoma), liver cancer (e.g., hepatocellular carcinoma, or cholangiocarcinoma), esophageal cancer (e.g., esophageal adenocarcinoma or squamous cell carcinoma), head and neck cancer, ovarian cancer, colorectal cancer, bladder cancer, cervical cancer, uterine cancer (endometrial cancer), kidney cancer, breast cancer, prostate cancer, brain cancer (e.g., medulloblastoma, or glioblastoma), or sarcoma (e.g., Ewing sarcoma, osteosarcoma, rhabdomyosarcoma), or a combination thereof.
E54. The method of any of the preceding embodiments, wherein the plurality of cancers is chosen from liver cancer, ovarian cancer, esophageal cancer, stomach cancer, pancreatic cancer, colorectal cancer, lung cancer, breast cancer, or prostate cancer, or a combination thereof.
E55. The method of any of the preceding embodiments, wherein one or more of the plurality of cancers is chosen from liver cancer, ovarian cancer, esophageal cancer, stomach cancer, pancreatic cancer, colorectal cancer, lung cancer, or breast cancer.
E56. The method of any of the preceding embodiments, wherein one or more of the plurality of cancers is a hematological cancer.
E57. The method of any of the preceding embodiments, wherein no more than 60, 100, 150, 200, 300 or 400 subgenomic intervals or amplicons from the one or more genes, e.g., one or more driver genes, e.g., genes listed in Tables 60 and 61 of US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CPLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDMSC, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2, are sequenced.
E58. The method of any of the preceding embodiments, wherein at least 30, 40, 50 or 60 subgenomic intervals or amplicons from the one or more genes, e.g., one or more driver genes, e.g., genes listed in Tables 60 and 61 of US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDMSC, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2, are sequenced.
E59. The method of any of the preceding embodiments, wherein at least 30 and not more than 400, at least 40 and not more than 300, at least 50 and no more than 200, at least 60 and no more than 150, or at least 60 and no more than 100, subgenomic intervals or amplicons from the one or more genes, e.g., one or more driver genes, e.g., one or more genes listed in Tables 60 and 61 of US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDMSC, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2, are sequenced.
E60. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons sequenced for a gene is no greater than 125, 150, 200, or 300% of the lowest number that achieves plateau for sensitivity of detection of the cancer.
E61. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises 6-800 bp, e.g., 6-750 bp, 6-700 bp, 6-650 bp, 6-600 bp, 6-550 bp, 6-500 bp, 6-450 bp, 6-400 bp, 6-350 bp, 6-300 bp, 6-250 bp, 6-200 bp, 6-150 bp, 6-100 bp, 10-800 bp, 15-800 bp, 20-800 bp, 25-800 bp, 30-800 bp, 35-800 bp, 40-800 bp, 45-800 bp, 50-800 bp, 55-800 bp, 60-800 bp, 65-800 bp, 70-800 bp, 75-800 bp, 80-800 bp, 85-800 bp, 90-800 bp, 95-800 bp, 100-800 bp, 200-800 bp, 300-800 bp, 400-800 bp, 500-800 bp, 600-800 bp, 700-800 bp, 10-700 bp, 20-600 bp, 30-500 bp, 40-400 bp, 50-300 bp, 60-200 bp, 61-150 bp, 62-140 bp, 63-130 bp, 64-120 bp, or 65-100 bp, e.g., 66-80 bp.
E62. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises about 35, 40, 45, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 100, or 110 bp.
E63. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises no more than 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, or 800 bp.
E64. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises at least 6, 10, 15, 20, 25, 30, 35, 40, 45, or 50 bp.
E65. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises at least 6 pb and no more than 800 bp, at least 10 bp and no more than 700 bp, at least 15 bp and no more than 600 bp, at least 20 bp and no more than 600 bp, at least 25 bp and no more than 500 bp, at least 30 bp and no more than 400 bp, at least 35 bp and no more than 300 bp, at least 40 bp and no more than 200 bp, at least 45 bp and no more than 100 bp, at least 50 bp and no more than 95 bp, or at least 55 bp and no more than 90 bp.
E66. The method of any of the preceding embodiments, wherein each subgenomic interval or amplicon of the genetic biomarker comprises 66-80 bp.
E67. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons of the genetic biomarker comprises no more than 2000, 2500, 3000, 3500, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 15,000, or 20,000 bp.
E68. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons of the genetic biomarker comprises at least 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900 or 2000 bp.
E69. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons of the genetic biomarker comprises at least 200 bp and no more than 20,000 bp, at least 300 bp and no more than 15,000 bp, at least 400 bp and no more than 10,000 bp, at least 500 bp and no more than 9000, at least 600 bp and no more than 8000 bp, at least 700 bp and no more than 7000 bp, at least 800 bp and no more than 6000 bp, at least 900 bp and no more than 5000 bp, at least 1000 bp and no more than 4000 bp, at least 1100 bp and no more than 3500 bp, at least 1200 bp and no more than 3000 bp, at least 1300 bp and no more than 2500 bp, or at least 1500 bp and no more than 2000 bp.
E70. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons of the genetic biomarker comprises 200+15%, 300+15%, 400+15%, 500+15%, 600+15%, 700+15%, 800+15%, 900+15%, 1000+15%, 1100+15%, 1200+15%, 1300+15%, 1400+15%, 1500+15%, 1600+15%, 1700+15%, 1800+15%, 1900+15%, 2000+15%, 2500+15%, 3000+15%, 3500+15%, 4000+15%, 5000+15%, 6000+15%, 7000+15%, 8000+15%, 9000+15%, 10,000+15%, 15,000+15%, or 20,000 bp+15%, e.g., 2000 bp+15%.
E71. The method of any of the preceding embodiments, wherein the number of subgenomic intervals or amplicons of the genetic biomarker comprise 2000 bp.
E72. The method of any of the preceding embodiments, wherein the average depth to which the number of subgenomic intervals or amplicons of the genetic biomarker is sequenced is at least 5× sequencing depth.
E73. The method of any of the preceding embodiments, wherein the average depth to which the number of subgenomic intervals or amplicons of the genetic biomarker is sequenced is no more than 500× sequencing depth.
E74. The method of any of the preceding embodiments, wherein the average depth to which the number of subgenomic intervals or amplicons of the genetic biomarker is sequenced is between 5× to 500× sequencing depth.
E75. The method of any of the preceding embodiments, wherein said detecting step comprises sequencing each subgenomic interval to a depth of at least 50,000 reads per base.
E76. The method of any of the preceding embodiments, wherein said detecting step comprises sequencing each subgenomic interval to a depth of no more than 150,000 reads per base.
E77. The method of any of the preceding embodiments, wherein said detecting step comprises sequencing each subgenomic interval to a depth of from 50,000 reads per base to 150,000 reads per base.
E78. The method of any of the preceding embodiments, wherein said detecting step comprises sequencing each subgenomic interval at a depth sufficient to detect a mutation in said region of interest at a frequency as low as 0.0005%.
E79. The method of any of the preceding embodiments, wherein no more than 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 45, 50, 55, 60,100, 200 or 300 bp, is sequenced for each biomarker, e.g., each gene, e.g., each driver gene, e.g., each gene disclosed in Table 60 or 61 in US2019/0256924A1 e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDMSC, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E80. The method of any of the preceding embodiments, wherein at least 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 bp, is sequenced in each biomarker, e.g., each gene, e.g., each driver gene, e.g., each gene disclosed in Table 60 or 61 in US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E81. The method of any of the preceding embodiments, wherein at least 6 and no more than 300 bp, at least 7 and no more than 200 bp, at least 8 bp and no more than 100 bp, at least 9 bp and no more than 60 bp, at least 10 bp and no more than 55 bp, at least 11 bp and no more than 50 bp, at least 12 bp and no more than 45 bp, at least 13 bp and no more than 40 bp, at least 14 bp and no more than 35 bp, at least 15 bp and no more than 34 bp, at least 14 bp and no more than 33 bp, at least 15 bp and no more than 32 bp, at least 16 bp and no more than 31 bp, at least 17 bp and no more than 30 bp, at least 18 bp and no more than 29 bp, at least 19 bp and no more than 28 bp, at least 20 bp and no more than 27 bp, is sequenced in each biomarker, e.g., each gene, e.g., each driver gene, e.g., each gene disclosed in Table 60 or 61 in US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E82. The method of any of the preceding embodiments, wherein about 33 bp is sequenced in each biomarker, e.g., each gene, e.g., each driver gene, e.g., each gene disclosed in Table 60 or 61 in US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E83. The method of any of the preceding embodiments, wherein detecting the biomarker comprises providing the sequence of the subgenomic interval or amplicon of no more than 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 45, 50, 55, 60,100, 200 or 300 bp, in length and wherein the subgenomic interval or the amplicon comprises the biomarker, e.g., a driver gene comprising a driver mutation.
E84. The method of any of the preceding embodiments, wherein detecting the biomarker comprises providing the sequence of the subgenomic interval or the amplicon of at least 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 bp, in length and wherein the subgenomic interval or the amplicon comprises the biomarker, e.g., a driver gene comprising a driver mutation.
E85. The method of any of the preceding embodiments, wherein detecting the biomarker comprises providing the sequence of a subgenomic interval or amplicon of at least 6 and no more than 300 bp, at least 7 and no more than 200 bp, at least 8 bp and no more than 100 bp, at least 9 bp and no more than 60 bp, at least 10 bp and no more than 55 bp, at least 11 bp and no more than 50 bp, at least 12 bp and no more than 45 bp, at least 13 bp and no more than 40 bp, at least 14 bp and no more than 35 bp, at least 15 bp and no more than 34 bp, at least 14 bp and no more than 33 bp, at least 15 bp and no more than 32 bp, at least 16 bp and no more than 31 bp, at least 17 bp and no more than 30 bp, at least 18 bp and no more than 29 bp, at least 19 bp and no more than 28 bp, at least 20 bp and no more than 27 bp, in length and wherein the subgenomic interval or amplicon comprises the biomarker, e.g., driver gene comprising a driver mutation.
E86. The method of any of the preceding embodiments, wherein detecting the biomarker comprises providing the sequence of a subgenomic interval or amplicon of between 6 bp and 300 bp, 7 bp and 200 bp, or 8 and 100 bp, 9 bp and 60 bp, 10 bp and 50 bp, 15 bp and 40 bp, 20 bp and 35 bp in length and wherein the subgenomic interval or amplicon comprises the biomarker, e.g., driver gene comprising a driver mutation.
E87. The method of any of the preceding embodiments, wherein detecting the biomarker comprises providing the sequence of a subgenomic interval or amplicon of about 33 bp in length and wherein the subgenomic interval or amplicon comprises the biomarker, e.g., driver gene comprising a driver mutation.
E88. The method of any of the preceding embodiments, further comprising:
b) detecting the level of each of a plurality of, e.g., at least four, protein biomarkers in a biological sample, wherein the level of each protein biomarker of the plurality is associated with the presence of a cancer of the plurality of cancers;
(optionally) (c) comparing the detected levels of each protein biomarker of the plurality of protein biomarkers to a reference level for the protein biomarker; and d) identifying the presence of a cancer of the plurality of cancers in the subject when the presence of one or more genetic biomarkers and the level of one of the protein biomarkers of the plurality of protein biomarkers is detected.
E89. The method of any of the preceding embodiments, wherein:
(i) the subject has not yet been determined to have a cancer, e.g., a cancer selected from the plurality of cancers,
(ii) the subject has not yet been determined to harbor a cancer cell, e.g., a cancer cell selected from the plurality of cancers, or
(iii) the subject does not exhibit, or has not exhibited a symptom associated with a cancer, e.g., a cancer selected from the plurality of cancers.
E90. The method of any of the preceding embodiments, wherein the subject:
(i) is a pediatric subject or a young adult; e.g., aged 6 months-21 years; or
(ii) is an adult, e.g., aged 18 years or older.
E91. The method of any of the preceding embodiments, wherein the sample comprises a tumor sample, e.g., a biopsy sample (e.g., a liquid biopsy sample (e.g., a circulating tumor DNA sample, or a cell-free DNA sample) or a solid tumor biopsy sample); a blood sample (e.g., a circulating tumor DNA sample, or a cell-free DNA sample), an apheresis sample, a urine sample, a cyst fluid sample (e.g., a pancreatic cyst fluid sample), a Papanicolaou (Pap) sample, or a fixed tumor sample (e.g., a formalin fixed sample or a paraffin embedded sample (FPPE)).
E92. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, genes comprises 1, 2, 3, or 4 genes from Tables 60 and 61 of US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CPLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E93. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, genes comprises 5, 6, 7, or 8 genes, chosen from Tables 60 and 61 of US2019/0256924A1, e.g., ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, or SKP2.
E94. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, genes is a gene selected from: NRAS, CTNNB1, PIK3CA, FBXW7, APC, EGFR, BRAF, CDKN2A, PTEN, FGFR2, HRAS, KRAS, AKT1, TP53, PPP2R1A, or GNAS.
E95. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, biomarkers (e.g., one or more genes) is chosen from KRAS, PIK3CA, HRAS, CDKN2A, TP53, AKT1, CTNNB1, APC, EGFR, GNAS, PPP2R1A, BRAF, FBXW7, PTEN, or FGFR2, or a combination thereof, and the cancer is chosen from: liver cancer, ovarian cancer, esophageal cancer, stomach cancer, pancreatic cancer, colorectal cancer, lung cancer, breast cancer, or prostate cancer.
E96. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, biomarkers (e.g., one or more genes) is chosen from KRAS, PIK3CA, HRAS, CDKN2A, TP53, TERT, ERBB2, FGFR3, MET, MLL, or VHL, or a combination thereof, and the cancer is chosen from a bladder cancer or upper tract urothelial carcinoma (UTUC).
E97. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, biomarkers (e.g., one or more genes) is chosen from KRAS, PIK3CA, CDKN2A, TP53, CTNNB1, PPP2R1A, BRAF, PTEN, CSMD3, FAT3, BRCA, or ARID1A, or a combination thereof, and the cancer is an ovarian cancer or an endometrial cancer.
E98. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of, biomarkers (e.g., one or more genes) is chosen from KRAS, PIK3CA, CDKN2A, TP53, CTNNB1, GNAS, BRAF, NRAS, VHL, RNF43, or SMAD4, or a combination thereof, and the cancer is a pancreatic cancer, e.g., a pancreatic ductal adenocarcinoma (PDAC).
E99. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of biomarkers, comprises 5, 6, 7, or 8 protein biomarkers.
E100. The method of any of the preceding embodiments, wherein the one or more, e.g., plurality of biomarkers, comprises a protein biomarker selected from: CA19-9, CEA, HGF, OPN, CA125, prolactin (PRL), TIMP-1, CA15-3, AFP or MPO.
E101. The method of any of the preceding embodiments, wherein detecting the presence of one or more genetic biomarkers comprises:
a. assigning a unique identifier (UID) to each of a plurality of template molecules present in the sample;
b. amplifying each uniquely tagged template molecule to create UID-families; and
c. redundantly sequencing the amplification products.
E102. The method of any of the preceding embodiments, further comprising detecting the presence of aneuploidy in the sample, e.g., detecting gain or loss in one or more chromosomes, e.g., using the WALDO method as described in Example 6.
E103. The method of embodiment 102, wherein the method comprises: (i) estimating somatic mutation load; (ii) estimating carcinogen signature, and/or (iii) detecting microsatellite instability (MSI).
E104. The method of embodiment 102 or 103, wherein the method can be used to compare two samples, e.g., two unrelated samples, to evaluate genetic similarities between the samples or to find somatic mutations within the samples, e.g., within the LINE elements in the sample.
E105. The method of embodiment 102 or 103, wherein the method results in an increase in specificity and/or sensitivity of aneuploidy detection.
E106. The method of embodiment 102, wherein the presence of aneuploidy is detected on one or more chromosome arms.
E107. The method of any of the preceding embodiments, further comprising responsive to a value of: a genetic marker, a protein biomarker and/or aneuploidy status, assigning an origin or cancer type to the cancer.
E108. The method of any one of the preceding embodiments, wherein responsive to a value of: a genetic marker, a protein biomarker and/or aneuploidy status, the method comprises identifying the subject as having a cancer, or having a risk of developing a cancer.
E109. The method of embodiment E108, further comprising administering to the subject a therapeutic agent to treat the cancer, or selecting a therapeutic agent for treating the cancer in the subject.
E110. The method of embodiment E109, wherein the subject is administered the therapeutic agent in combination with one or more additional therapeutic agents.
E111. A reaction mixture comprising:
at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 detection reagents, wherein a detection reagent mediates a readout that is a value of the level or presence of:
(i) one or more genetic biomarkers referred to herein;
(ii) one or more protein biomarkers referred to herein; and/or
(iii) the copy number or length, e.g., aneuploidy, of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family) referred to herein.
E112. The reaction mixture of embodiment E111, comprising a plurality of detection reagents for (i).
E113. The reaction mixture of any of embodiments E111-E112, comprising a plurality of detection reagents for (ii).
E114. The reaction mixture of any of embodiments E111-E113, comprising a plurality of detection reagents for (iii).
E115. The reaction mixture of any of embodiments E111-E114, comprising a sample from a subject, e.g., a subject sample.
E116. A kit comprising:
(a) at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 detection reagents, wherein a detection reagent mediates a readout that is a value of the level or presence of:
(i) one or more genetic biomarkers referred to herein;
(ii) one or more protein biomarkers referred to herein; and/or
(iii) the copy number or length, e.g., aneuploidy, of a genomic sequence disposed between at least two terminal repeated elements of a repeated element family (RE Family) referred to herein; and
(b) instructions for using said kit.
E117. The reaction mixture of embodiment E116, comprising a plurality of detection reagents for (i).
E118. The reaction mixture of any of embodiments E116-E117, comprising a plurality of detection reagents for (ii).
E119. The reaction mixture of any of embodiments E116 to E118, comprising a plurality of detection reagents for (iii).
E120. The method of any one of embodiments E1-E110, wherein aneuploidy status is evaluated, e.g., determined, using a first primer and a second primer.
E121. The method of embodiment E120, wherein the first primer comprises a sequence that is at least 80%, 85%, 90%, 95%, 96%, 96%, 98%, 99%, or 100% identical to SEQ ID NO: 1
E122. The method of embodiment E121, wherein the first primer comprises the sequence of SEQ ID NO: 1.
E123. The method of embodiment E120, wherein the second primer comprises a sequence that is at least 80%, 85%, 90%, 95%, 96%, 96%, 98%, 99%, or 100% identical to SEQ ID NO: 10.
E124. The method of embodiment E123, wherein the second primer comprises the sequence of SEQ ID NO: 10.
E125. The method of any one of embodiments E1-E110, or E120-E124, further comprising subjecting the subject to a radiologic scan, e.g., a PET-CT scan, of an organ or body region.
E126. The method of embodiment 125, wherein the radiologic scanning of an organ or body region characterizes the cancer.
E127. The method of embodiment 125, wherein the radiologic scanning of an organ or body region identifies the location of the cancer.
E128. The method of any one of embodiments E125-E127, wherein the radiologic scan is a PET-CT scan.
E129. The method of any one of embodiments E125-E128, wherein the radiologic scanning is performed after the subject is evaluated for the presence of each of a plurality of cancers.
E130. The method of any one of embodiments E1-E110, or E120-E129, comprising administering to the subject one or more therapeutic interventions (e.g., surgery, adjuvant chemotherapy, neoadjuvant chemotherapy, radiation therapy, immunotherapy, targeted therapy, and/or an immune checkpoint inhibitor).
E131. The method of any one of embodiments E1-E110, or E120-E130, wherein the evaluation comprises evaluating a sample from the subject at one time point or at different time points.
E132. The method of any one of embodiments E1-E110, or E120-E131, comprising evaluating one or more samples, e.g., multiple samples, obtained from the subject.
E133. The method of E132, wherein the one or more samples, e.g., multiple samples, are obtained yearly, e.g., within 1 year of one another.
E134. The method of any of embodiments E1-E110, or E120-E133, wherein the subject is evaluated simultaneously for the presence or absence of each of a plurality of cancers.
E135. The method of any of embodiments E1-E110, or E120-E134, wherein the subject is co-evaluated for the presence or absence of each of a plurality of cancers.
E136. The method of any of embodiments E1-E110, or E120-E135, comprising evaluating the presence of each of a plurality of cancers in a subject at one or more time points within a predetermined interval, e.g., at the same or substantially the same clinical stage of at least one of the cancers in the subject.
E137. The method of any of embodiments E1-E110, or E120-E136, comprising evaluating a sample, e.g., a single sample or multiple samples, obtained from the subject.
E138. The method of any of embodiments E1-E110, or E120-E137, wherein co-evaluation is performed on a single sample, aliquots of a single sample, or a plurality of samples taken, e.g., within 1, 5, 24 or 48 hours, of one another.
E139. The method of any embodiments E1-E110, or E120-E138, wherein the subject is asymptomatic for cancer.
E140. The method of any of embodiments E1-E110, or E120-E139, wherein the subject is asymptomatic for a cancer of the plurality.
E141. The method of any of embodiments E1-E110, or E120-E140, wherein the subject is not known or determined to harbor a cancer cell.
E142. The method of any of embodiments E1-E110, or E120-E141, wherein the subject has not been determined to have or diagnosed with a cancer.
E143. The method of any of embodiments E1-E110, or E120-E142, wherein the subject has an early stage cancer, e.g., Stage I or Stage II.
E144. The method of any of embodiments E1-E110, or E120-E143, wherein the subject is pre-metastatic.
E145. The method of any of embodiments E1-E110, or E120-E144, wherein the subject has no detectable metastasis.
E146. The method of any of embodiments E1-E110, or E120-E145, wherein the subject has not exhibited a symptom associated with a cancer.
E147. The method of any of embodiments E1-E110, or E120-E146, wherein the subject does not display one, two or more symptoms clinically associated with the cancer.
E148. The method of any of embodiments E1-E110, or E120-E147, wherein when the aneuploidy status is positive, the subject has an early stage cancer, e.g., Stage I or Stage II e.g., as provided in Table 3.
E149. The method of any of embodiments E1-E110, or E120-E147, wherein when the aneuploidy status is negative, the subject has an early stage cancer, e.g., Stage I or Stage II e.g., as provided in Table 3.
E150. A method of detecting aneuploidy in a sample comprising low input DNA.
E151. The method of any of embodiments E1-E110, or E120-E150, wherein the sample comprises about 0.01 picogram (pg) to 500 pg of DNA.
E152. The method of embodiment E151, wherein the sample comprises about 0.01-500 pg, 0.05-400 pg, 0.1-300 pg, 0.5-200 pg, 1-100 pg, 10-90 pg, or 20-50 pg DNA.
E153. The method of embodiment E151, wherein the sample comprises at least 0.01 pg, at least 0.01 pg, at least 0.1 pg, at least 1 pg, at least 2 pg, at least 3 pg, at least 4 pg, at least 5 pg, at least 6 pg, at least 7 pg, at least 8 pg, at least 9 pg at least 10 pg, at least 11 pg, at least 12 pg, at least 13 pg, at least 14 pg, at least 15 pg, at least 16 pg, at least 17 pg, at least 18 pg, at least 19 pg, at least 20 pg, at least 21 pg, at least 22 pg, at least 23 pg, at least 24 pg, at least 25 pg, at least 26 pg, at least 27 pg, at least 28 pg, at least 29 pg, at least 30 pg, at least 31 pg, at least 32 pg, at least 33 pg, at least 34 pg, at least 35 pg, at least 36 pg, at least 37 pg, at least 38 pg, at least 39 pg, at least 40 pg, at least 50 pg, at least 60 pg, at least 70 pg, at least 80 pg, at least 90 pg, at least 100 pg, at least 150 pg, at least 200 pg, at least 300 pg, at least 350 pg, at least 400 pg, at least 450 pg, or at least 500 pg DNA.
E154. A method of identifying or distinguishing a sample, e.g., using any of the methods disclosed herein.
E155. The method of embodiment E154, wherein a sample, e.g., a first sample, from a subject, e.g., a first subject, is distinguished from a second sample from a second subject.
E156. The method of embodiment E154, wherein a sample is identified as being from a subject based on a polymorphism (e.g., a plurality of polymorphisms, e.g., common polymorphisms).
E157. The method of embodiment E156, wherein a polymorphism, e.g., a common polymorphism, is present in a repetitive element, e.g., as described herein.
E158. The method of embodiment E154, wherein a method disclosed in Example 8 is used to identify and/or distinguish the sample.
E159. The method of any of embodiments E1-E110, or E120-E158, wherein the method is an in vitro method.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. Although methods and materials similar or equivalent to those described herein can be used to practice the invention, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
The term “driver gene mutation” or “driver mutation” as used herein, refers to a mutation that (i) occurs in a driver gene; and (ii) provides a growth advantage to the cell in which it occurs. A growth advantage for a cell can include:
a) an increase in the rate of cell division in a cell having a driver gene mutation, e.g., an increase in rate of cell division as compared to a reference cell, e.g., to an otherwise similar cell, e.g., an otherwise similar cell adjacent to the cell, e.g., as compared to a cell of the same type not having the driver gene mutation;
b) an increase in the rate of clonal expansion in a cell having a driver gene mutation, e.g., an increase in rate of clonal expansion as compared to a reference cell, e.g., to an otherwise similar cell, e.g., an otherwise similar cell adjacent to the cell, e.g., as compared to a cell of the same type not having the driver mutation;
c) an increase in the number of cells that are progeny, e.g., a daughter cell, of the cell that has the driver gene mutation, e.g., an increase in number of progeny cells compared to the number of progeny cells expected if the cell did not have the driver gene mutation;
d) an increase in the ability to form tumors or promote tumor growth, e.g., tumor progression, e.g., as compared to a reference cell, e.g., to an otherwise similar cell not having the driver gene mutation; or
e) presence or appearance at a second or subsequent site or location in the subject.
In an embodiment, a driver gene mutation provides a 0.1-5%, e.g., a 0.1-4.5%, 0.1-4%, 0.1-3.5%, 0.1-3%, 0.1-2.5%, 0.1-2%, 0.1-1.5%, 0.1-1%, 0.1-0.5%, 0.5-5%, 1-5%, 1.5-5%, 2-5%, 2.5-5%, 3-5%, 3.5-5%, 4-5%, 4.5-5%, 0.5-4.5%, 1-4%, 1.5-3.5%, or 2-3%, growth advantage, e.g., increase in the difference between cell birth and cell death. In an embodiment, a driver gene mutation provides at least 0.1% 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, or 4.5%, e.g., about a 0.4%, growth advantage, e.g., increase in the difference between cell birth and cell death. In an embodiment, a driver gene mutation, provides a proliferative capacity to the cell in which it occurs, e.g., allows for cell expansion, e.g., clonal expansion.
In some embodiments, the driver gene mutation can be causally linked to cancer progression.
In an embodiment, the driver gene mutation affects, e.g., alters the regulation, expression or function of, a protein coding gene. In an embodiment, a driver gene mutation affects, e.g., alters the function of, a noncoding region, e.g., non-protein coding region. In an embodiment, a driver gene mutation includes: a translocation, a deletion (e.g., a homozygous deletion), an insertion (e.g., an intragenic insertion), a small insertion and deletion (indels), a single base substitution (e.g., a synonymous mutation, non-synonymous mutation, nonsense mutation or a frameshift mutation), a copy number variation (CNV) (e.g., an amplification), or a single nucleotide variation (SNV) (e.g., a single nucleotide polymorphism (SNP)). Exemplary driver mutations are disclosed in Tables 60 and 61 of US2019/0256924A1.
In some embodiments, the presence of a driver gene mutation in a cell can alter (e.g., increase or decrease) the expression of the gene product in that cell. In some embodiments, the presence of a driver gene mutation in a cell can alter the function of the gene product. In some cases, the presence of a driver gene mutation in a cell can provide that cell with a growth advantage. For example, the presence of a driver gene mutation in a cell can cause an increase the rate of proliferation (e.g., as compared to a reference cell). For example, the presence of a driver gene mutation in a cell can cause an increase in the rate of clonal expansion in a cell having a driver gene mutation (e.g., as compared to a reference cell). For example, the presence of a driver gene mutation in a cell can cause an increase in the number of progeny cells derived from the cell having the driver gene mutation (e.g., as compared to a reference cell). For example, the presence of a driver gene mutation in a cell can cause an increase in the ability of the cell to form a tumor (e.g., as compared to a reference cell). In some cases, a growth advantage can be measures as an increase in the difference between cytogenesis (e.g., the formation of new cells) and cell death. For example, the presence of a driver gene mutation in a cell can provide that cell with a growth advantage of at least about 0.1% (e.g., about 0.2%, about 0.3%, about 0.4%, about 0.5%, about 0.6%, about 0.7%, about 0.8%, about 0.9%, about 1%, about 1.5%, about 2%, about 2.5%, about 3%, about 3.5%, about 4%, about 4.5%, or more). For example, the presence of a driver gene mutation in a cell can provide that cell with a growth advantage of about from 0.1% to about 5% (e.g., from about 0.1 to about 5%, from about 0.1 to about 4.5%, from about 0.1 to about 4%, from about 0.1 to about 3.5%, from about 0.1 to about 3%, from about 0.1 to about 2.5%, from about 0.1 to about 2%, from about 0.1 to about 1.5%, from about 0.1 to about 1%, from about 0.1 to about 0.5%, from about 0.5 to about 5%, from about 1 to about 5%, from about 1.5 to about 5%, from about 2 to about 5%, from about 2.5 to about 5%, from about 3 to about 5%, from about 3.5 to about 5%, from about 4 to about 5%, from about 4.5 to about 5%, from about 0.5 to about 4.5%, from about 1 to about 4%, from about 1.5 to about 3.5%, or from about 2 to about 3%).
In some cases, a driver gene can include more than one (e.g., two, three, four, five, six, seven, eight, nine, ten, or more) driver gene mutations. In some cases, a driver gene including one or more driver gene mutations also can include one or more additional mutations (e.g., passenger gene mutations (somatic mutations which are not a driver mutation)).
The term “driver gene” as used herein, refers to a gene which includes a driver gene mutation. In one embodiment, the driver gene is a gene in which one or more (e.g., one, two, three, four, five, six, seven, eight, nine, ten, or more) acquired mutations, e.g., driver gene mutations, can be causally linked to cancer progression. In an embodiment, a driver gene modulates one or more cellular processes including: cell fate determination, cell survival and genome maintenance. A driver gene can be associated with (e.g., can modulate) one or more signaling pathways. Examples of signaling pathways include, without limitation, a TGF-beta pathway, a MAPK pathway, a STAT pathway, a PI3K pathway, a RAS pathway, a cell cycle pathway, an apoptosis pathway, a NOTCH pathway, a Hedgehog (HH) pathway, an APC pathway, a chromatin modification pathway, a transcriptional regulation pathway, and a DNA damage control pathway. Examples of driver genes include, without limitation, ABL1, ACVR1B, AKT1, ALK, APC, AR, ARID1A, ARID1B, ARID2, ASXL1, ATM, ATRX, AXIN1, B2M, BAP1, BCL2, BCOR, BRAF, BRCA1, BRCA2, CARD11, CASP8, CBL, CDC73, CDH1, CDKN2A, CEBPA, CIC, CREBBP, CRLF2, CSF1R, CTNNB1, CYLD, DAXX, DNMT1, DNMT3A, EGFR, EP300, ERBB2, EZH2, FAM123B, FBXW7, FGFR2, FGFR3, FLT3, FOXL2, FUBP1, GATA1, GATA2, GATA3, GNA11, GNAQ, GNAS, H3F3A, HIST1H3B, HNF1A, HRAS, IDH1, IDH2, JAK1, JAK2, JAK3, KDM5C, KDM6A, KIT, KLF4, KRAS, MAP2K1, MAP3K1, MED12, MEN1, MET, MLH1, MLL2, MLL3, MPL, MSH2, MSH6, MYD88, NCOR1, NF1, NF2, NFE2L2, NOTCH1, NOTCH2, NPM1, NRAS, PAX5, PBRM1, PDGFRA, PHF6, PIK3CA, PIK3R1, PPP2R1A, PRDM1, PTCH1, PTEN, PTPN11, RB1, RET, RNF43, RUNX1, SETD2, SETBP1, SF3B1, SMAD2, SMAD4, SMARCA4, SMARCB1, SMO, SOCS1, SOX9, SPOP, SRSF2, STAG2, STK11, TET2, TNFAIP3, TRAF7, TP53, TSC1, TSHR, U2AF1, VHL, WT1, CCND1, CDKN2C, IKZF1, LMO1, MAP2K4, MDM2, MDM4, MYC, MYCL1, MYCN, NCOA3, NKX2-1, and SKP2. Exemplary driver genes include oncogenes and tumor suppressors. In an embodiment, a driver gene has one or more driver gene mutations, e.g., as described herein. In an embodiment, a driver gene is a gene listed in Tables 60 or 61 in US2019/0256924A1. In an embodiment, a driver gene is a gene that modulates one or more cellular processes described in Tables 60 or 61 in US2019/0256924A1, e.g., cell fate determination, cell survival and genome maintenance. In an embodiment, a driver gene is a gene that modulates one or more pathways described in Tables 60 or 61 in US2019/0256924A1. In an embodiment, a driver gene is a gene that modulates one or more signaling pathways described in Table 62 of US2019/0256924A1.
In an embodiment, a driver gene includes more than one driver mutation, and the first driver gene mutation, provides a selective growth advantage to the cell in which it occurs. In an embodiment, the subsequent mutation, e.g., second, third, fourth, fifth or later mutation, e.g., driver mutation in the driver gene, provides a proliferative capacity to the cell in which it occurs, e.g., allows for cell expansion, e.g., clonal expansion. In an embodiment, a driver gene has one or more passenger gene mutations, e.g., a somatic mutation that arises in the development of a cancer but which is not a driver mutation. In an embodiment, a driver gene can be present, e.g., expressed, in any cell type, e.g., a cell type derived from any one of the three germ cell layers: ectoderm, endoderm or mesoderm. In an embodiment, a driver gene is present, e.g., expressed, in a somatic cell. In an embodiment, a driver gene is present, e.g., expressed, in a germ cell. In an embodiment, a driver gene can be present in a large number of cancers, e.g., in more than 5% of cancers. In an embodiment, a driver gene can be present in a small number of cancer, e.g., in less than 5% of cancers. In an embodiment, a driver gene has a mutation pattern that is non-random and/or recurrent, i.e., the location at which a driver mutation occurs in the driver gene is the same in different cancer types. Exemplary recurrent driver gene mutations include mutations in the IDH1 gene at the substrate binding site, e.g., at codon 132, and mutations in the PIK3CA gene in the helical domain or the kinase domain, as depicted in Vogelstein et al (2013) Science 339: 1546-1558.
In an embodiment, a driver gene having a driver gene mutation is an oncogene. In an embodiment, an oncogene is a gene with an oncogene score of at least 20%, e.g., at least 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99% or 100%. In an embodiment, an oncogene score is defined as the number of mutations, e.g., clustered mutations (e.g., missense mutations at the same amino acid, or identical in-frame insertions or deletions) divided by the total number of mutations. In an embodiment, a driver gene having an amplification, e.g., as described herein, is an oncogene. In an embodiment, a driver gene having a driver gene mutation is a tumor suppressor gene (TSG). In an embodiment, a tumor suppressor gene is a gene with a tumor suppressor gene score of at least 20%, e.g., at least 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99% or 100%. In an embodiment, a tumor suppressor gene score is defined as the number of inactivating mutations divided by the total number of mutations. In an embodiment, a driver gene having a deletion, e.g., as described herein, is a tumor suppressor gene.
The phrase “repeated element family” or “RE family” as used herein, refers to a family of repeat DNA elements (also known as repetitive DNA elements or repeating units or DNA repeats) which are present in the genome of an organism. A DNA repeat element can be interspersed throughout the genome of an organism or can be present in select chromosomes. An RE family can include one or more repeat DNA elements. Exemplary RE families in the human genome include: interspersed repeats (e.g., long interspersed nucleotide elements (LINE); short interspersed nucleotide elements (SINE)); and tandem repeats (e.g., microsatellites, mini-satellites, satellite DNA or multiple copy genes (e.g., ribosomal RNA)). In some embodiments, an RE family includes one or more repeat elements listed in Table 1, e.g., SINE.
“Acquire” or “acquiring” as the terms are used herein, refer to obtaining possession of a physical entity, or a value, e.g., a numerical value, by “directly acquiring” or “indirectly acquiring” the physical entity or value. “Directly acquiring” as the term is used herein refers to performing a process (e.g., performing a synthetic or analytical method) to obtain the physical entity or value. “Indirectly acquiring” as the term is used herein refers to receiving the physical entity or value from another party or source (e.g., a third party laboratory that directly acquired the physical entity or value). Directly acquiring a physical entity includes performing a process that includes a physical change in a physical substance, e.g., a starting material. Directly acquiring a value includes performing a process that includes a physical change in a sample or another substance, e.g., performing an analytical process which includes a physical change in a substance, e.g., a sample, analyte, or reagent (sometimes referred to herein as “physical analysis”), performing an analytical method, e.g., a method which includes one or more of the following: separating or purifying a substance, e.g., an analyte, or a fragment or other derivative thereof, from another substance; combining an analyte, or fragment or other derivative thereof, with another substance, e.g., a buffer, solvent, or reactant; or changing the structure of an analyte, or a fragment or other derivative thereof.
“Biological sample,” “sample,” “patient sample,” or “specimen” as the terms are used herein, each refer to a sample obtained from a subject or a patient. The source of the sample can be a biopsy (e.g., a liquid biopsy), an aspirate; blood or any blood constituents; bodily fluids (e.g., cerebral spinal fluid, amniotic fluid, peritoneal fluid or interstitial fluid). The sample can comprise cells (e.g., any cell from a human body, e.g., normal cells and/or cancer cells) and/or cell-free DNA, e.g., circulating tumor DNA or circulating DNA from a normal cell. In an embodiment, the sample, e.g., the tumor sample, includes tissue or cells from a surgical margin. In another embodiment, the sample, e.g., tumor sample, includes one or more circulating tumor cells (CTC) (e.g., a CTC acquired from a blood sample).
As used herein, the term “sensitivity” refers to the ability of a method to detect or identify the presence of a disease in a subject. For example, when used in reference to any of the variety of methods described herein that can detect the presence of cancer in a subject, a high sensitivity means that the method correctly identifies the presence of cancer in the subject a large percentage of the time. For example, a method described herein that correctly detects the presence of cancer in a subject 95% of the time the method is performed is said to have a sensitivity of 95%. In some embodiments, a method described herein that can detect the presence of cancer in a subject provides a sensitivity of at least 70% (e.g., about 70%, about 72%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, about 99.5%, or about 100%). In some embodiments, methods provided herein that include detecting the presence of one or more members of two or more classes of biomarkers (e.g., genetic biomarkers and/or protein biomarkers) provide a higher sensitivity than methods that include detecting the presence of one or more members of only one class of biomarkers.
In some embodiments, sensitivity provides a measure of the ability of a method to detect a sequence variant in a heterogeneous population of sequences. A method has a sensitivity of S % for variants of F % if, given a sample in which the sequence variant is present as at least F % of the sequences in the sample, the method can detect the sequence at a confidence of C %, S % of the time. By way of example, a method has a sensitivity of 90% for variants of 5% if, given a sample in which the variant sequence is present as at least 5% of the sequences in the sample, the method can detect the sequence at a confidence of 99%, 9 out of 10 times (F=5%; C=99%; S=90%). Exemplary sensitivities include those of S=90%, 95%, 99%, 99.9% for sequence variants at F=0.5%, 1%, 5%, 10%, 20%, 50%, 100% at confidence levels of C=90%, 95%, 99%, and 99.9%.
As discussed above, in embodiments, sensitivity is the ability of a test method to make an assignment of a first state identity to all first state samples, in other words, to find or identify all first state samples. (Sensitivity does not address the propensity of a method to mis-assign a first state sample as a second state sample). In an embodiment the first state is negativity, and sensitivity is the ability to identify all negative samples. In an embodiment the first state is positivity, and sensitivity is the ability to identify all positive samples.
As used herein, the term “specificity” refers to the ability of a method to detect the presence of a disease in a subject (e.g., the specificity of a method can be described as the ability of the method to identify the true positive over true negative in a subject and/or to distinguish a truly occurring sequence variant from a sequencing artifact or other closely related sequences). For example, when used in reference to any of the variety of methods described herein that can detect the presence of cancer in a subject, a high specificity means that the method correctly identifies the absence of cancer in the subject a large percentage of the time (e.g., the method does not incorrectly identify the presence of cancer in the subject a large percentage of the time). A method has a specificity of X % if, when applied to a sample set of NTotal sequences, in which XTrue sequences are truly variant and XNot true are not truly variant, the method can select at least X % of the not truly variant as not variant. For example, a method has a specificity of 90% if, when applied to a sample set of 1,000 sequences, in which 500 sequences are truly variant and 500 are not truly variant, the method selects 90% of the 500 not truly variant sequences as not variant. For example, a method described herein that correctly detects the absence of cancer in a subject 95% of the time the method is performed is said to have a specificity of 95%. In some embodiments, a method described herein that can detect the absence of cancer in a subject provides a specificity of at least 80% (e.g., at least 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5%, or higher). A method having high specificity results in minimal or no false positive results (e.g., as compared to other methods). False positive results can arise from any source. For example, in various methods described herein that correctly detect the absence of cancer and include sequencing a nucleic acid, false positives can result from errors introduced into the sequence of interest during sample preparation, sequencing errors, and/or inadvertent sequencing of closely related sequences such as pseudo-genes or members of a gene family. In some embodiments, methods provided herein that include detecting the presence of one or more members of two or more classes of biomarkers (e.g., genetic biomarkers and/or protein biomarkers) provide a higher specificity than methods that include detecting the presence of one or more members of only one class of biomarkers.
As discussed above, in embodiments, specificity is the ability of a test method to make a true assignment of a first state identity to a sample. (Specificity does not address the ability of the method to find all true first state samples, that is sensitivity). In an embodiment the first state is negativity, and specificity is the ability to make true (as opposed to incorrect) assignments of negativity (and not mis-assign second state (e.g., positive) samples as first state (negative) sample). In an embodiment the first state is positivity, and specificity is the ability to make true (as opposed to incorrect) assignments of positivity (and not mis-assign second state (e.g., negative) samples as first state (positive) samples).
As used herein, the phrase “subgenomic interval” refers to a portion of a genomic sequence. A subgenomic interval can be any appropriate size (e.g., can include any appropriate number of nucleotides). In some embodiments, a subgenomic interval can include a single nucleotide (e.g., single nucleotide for which variants thereof are associated (positively or negatively) with a tumor phenotype). In some embodiments, a subgenomic interval can include more than one nucleotide. For example, a subgenomic interval can include at least about 2 (e.g., about 5, about 10, about 50, about 100, about 150, about 250, or about 300) nucleotides. In some cases, a subgenomic interval can include an entire gene. In some cases, a subgenomic interval can include a portion of gene (e.g., a coding region such as an exon, a non-coding region such as an intron, or a regulatory region such as a promoter, enhancer, 5′ untranslated region (5′ UTR), or 3′ untranslated region (3′ UTR)). In some cases, a subgenomic interval can include all or part of a naturally occurring (e.g., genomic) nucleotide sequence. For example, a subgenomic interval can correspond to a fragment of genomic DNA which can be subjected to a sequencing reaction. In some cases, a subgenomic interval can be a continuous nucleotide sequence from a genomic source. In some cases, a subgenomic interval can include nucleotide sequences that are not contiguous within the genome. For example, a subgenomic interval can include a nucleotide sequence that includes an exon-exon junction (e.g., in cDNA reverse transcribed from the subgenomic interval). In some cases, a subgenomic interval can include a mutation (e.g., a SNV, an SNP, a somatic mutation, a germ line mutation, a point mutation, a rearrangement, a deletion mutation (e.g., an in-frame deletion, an intragenic deletion, or a full gene deletion), an insertion mutation (e.g., an intragenic insertion), an inversion mutation (e.g., an intra-chromosomal inversion), an inverted duplication mutation, a tandem duplication (e.g., an intrachromosomal tandem duplication), a translocation (e.g., a chromosomal translocation, or a non-reciprocal translocation), a change in gene copy number, or any combination thereof.
As used herein, the phrase “leukocyte parameter,” refers to the sequence of a leukocyte nucleic acid, e.g., a chromosomal nucleic acid.
As used herein, the phrase “genomic event,” refers to a sequence of a subgenomic interval that differs from the sequence of a reference sequence. A genomic event can be, e.g., a mutation, e.g., a point mutation or a rearrangement, e.g., a translocation.
This document provides methods and materials for identifying one or more chromosomal anomalies (e.g., aneuploidies) in a sample. In some embodiments, methods and materials described herein are used to identify one or more chromosomal anomalies (e.g., aneuploidies) in an embryo. In some embodiments, methods and materials described herein are used to identify one or more chromosomal anomalies (e.g., aneuploidies) in a mammal (e.g., a juvenile mammal or an adult mammal). For example, a mammal (e.g., a sample obtained from a mammal) can be assessed for the presence or absence of one or more chromosomal anomalies. In some cases, this document provides methods and materials for using amplicon-based sequencing data to identify a mammal as having a disease associated with one or more chromosomal anomalies (e.g., cancer). For example, methods and materials described herein can be applied to a sample obtained from a mammal to identify the mammal as having one or more chromosomal anomalies. For example, methods and materials described herein can be applied to a sample obtained from a mammal to identify the mammal as having a disease associated with one or more chromosomal anomalies (e.g., cancer). This document also provides methods and materials for identifying and/or treating a disease or disorder associated with one or more chromosomal anomalies (e.g., one or more chromosomal anomalies identified as described herein). In some cases, one or more chromosomal anomalies can be identified in DNA (e.g., genomic DNA) obtained from a sample obtained from a mammal. For example, a prenatal mammal (e.g., prenatal human) can be identified as having a disease or disorder based, at least in part, on the presence of one or more chromosomal anomalies. In some embodiments, a mammalian embryo identified as having a disease or disease based, at least in part, on one or more chromosomal abnormalities can be assessed for the purposes of in vitro fertilization. In some embodiments, a mammal identified as having cancer based, at least in part, on the presence of one or more chromosomal anomalies can be treated with one or more cancer treatments. In some embodiments, a mammal can be identified as having congenital abnormalities based, at least in part, on the presence of one or more chromosomal abnormalities. In some embodiments, methods and materials provided herein are used to test an embryo (e.g., an embryo generated by in vitro fertilization) for chromosomal abnormalities prior to transfer to the uterus (e.g., a human uterus) for implantation.
Disclosed herein, inter alia, is a method of increasing the sensitivity of detecting one or more cancers, or a plurality of cancers, without altering the specificity of detecting said cancer or a plurality of cancers. In an embodiment, the sensitivity of detection of a cancer by evaluating (i) a genetic biomarker, e.g. a somatic mutation; (ii) a protein biomarker; and (iii) aneuploidy status, is higher, e.g., about 1.1, 1.2, 1.3, 1.4, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or 10 fold higher, than the sensitivity of detection of the cancer by evaluating (i) alone; (ii) alone; (iii) alone; (i) and (ii) only; (i) and (iii) only; or (ii) and (iii) only. The increase in sensitivity by a method comprising (i), (ii) and (iii) does not alter, e.g., reduce the specificity of detecting the cancer, or plurality of cancers. Exemplary increase in sensitivity of cancer detection using the method of the disclosure is demonstrated in Example 6 of this disclosure.
Any appropriate mammal can be assessed as described herein. A mammal can be a prenatal mammal (e.g., prenatal human). A mammal can be a mammal suspected of having a disease associated with one or more chromosomal anomalies (e.g., cancer or a congenital abnormality). In some cases, humans or other primates such as monkeys can be assessed for the presence of one or more chromosomal anomalies as described herein. In some cases, dogs, cats, horses, cows, pigs, sheep, mice, and rats can be assessed for the presence of one or more chromosomal anomalies as described herein. For example, a human can be assessed for the presence of one or more chromosomal anomalies as described herein.
Any appropriate sample from a mammal can be assessed as described herein (e.g., assessed for the presence of one or more chromosomal anomalies). A sample can include genomic DNA. In some cases, a sample can include cell-free circulating DNA (e.g., cell-free circulating fetal DNA). In some cases, a sample can include circulating tumor DNA (ctDNA). Examples of samples that can contain DNA (e.g., ctDNA) include, without limitation, blood (e.g., whole blood, serum, or plasma), amnion, tissue, urine, cerebrospinal fluid, saliva, sputum, broncho-alveolar lavage, bile, lymphatic fluid, cyst fluid, stool, ascites, pap smears, cerebral spinal fluid, endo-cervical, endometrial, and fallopian samples. For example, a sample can be a plasma sample. For example, a sample can be a urine sample. For example, a sample can be a saliva sample. For example, a sample can be a cyst fluid sample. For example, a sample can be a sputum sample. In some cases, a sample can include a neoplastic cell fraction (e.g., a low neoplastic cell fraction).
In some embodiments, a sample can be processed to isolate and/or purify DNA from the sample. In some embodiments, DNA isolation and/or purification can include cell lysis (e.g., using detergents and/or surfactants). In some embodiments, further processing of DNA (e.g., an amplification reaction) is performed without purifying DNA from the cell lysis. In such cases, additional reagents are added to facilitate further processing including, without limitation, protease inhibitors. In some embodiments, DNA isolation and/or purification can include removing proteins (e.g., using a protease). In some cases, DNA isolation and/or purification can include removing RNA (e.g., using an RNase). In some embodiments, DNA isolation is performed using commercially available kits (for example, without limitation, Qiagen DNAeasy kit) or buffers known in the art (e.g., detergents in Tris-buffer).
In some embodiments, the amount DNA inputted (“input DNA”) into the isolation and/or purification reaction may vary depending on a variety of factors including, without limitation, average length of DNA fragments, overall DNA quality, and/or type of DNA (e.g., gDNA, mitochondrial DNA, cfDNA). In some embodiments, any suitable amount of input DNA can be used in the methods described herein. In some embodiments, the amount of input DNA can be any amount from 1 picogram (pg) to 500 pg. In some embodiments, the amount of input DNA can be at least 0.01 pg, at least 0.01 pg, at least 0.1 pg or at least 1 pg. In some embodiments, the amount of input DNA can be at least 1 picogram (pg), at least 2 pg, at least 3 pg, at least 4 pg, at least 5 pg, at least 6 pg, at least 7 pg, at least 8 pg, at least 9 pg at least 10 pg, at least 11 pg, at least 12 pg, at least 13 pg, at least 14 pg, at least 15 pg, at least 16 pg, at least 17 pg, at least 18 pg, at least 19 pg, at least 20 pg, at least 21 pg, at least 22 pg, at least 23 pg, at least 24 pg, at least 25 pg, at least 26 pg, at least 27 pg, at least 28 pg, at least 29 pg, at least 30 pg, at least 31 pg, at least 32 pg, at least 33 pg, at least 34 pg, at least 35 pg, at least 36 pg, at least 37 pg, at least 38 pg, at least 39 pg or at least 40 pg. In some embodiments, the amount of input DNA is 3 pg.
In some embodiments, methods and materials for identifying one or more chromosomal anomalies (e.g., aneuploidies) as described herein can include amplification of a plurality of amplicons. In some embodiments, the plurality of amplicons is amplified from a plurality of chromosomal sequences in a DNA sample. In some embodiments, the plurality of amplicons can be amplified from any variety of repetitive elements (see e.g., Table 1 for a list of repetitive elements). In some embodiments, the plurality of amplicons is amplified from a plurality of short interspersed nucleotide elements (SINEs). In some embodiments, the plurality of amplicons is amplified from a plurality of long interspersed nucleotide elements (LINEs). Methods of amplifying a plurality of amplicons include, without limitation, the polymerase chain reaction (PCR) and isothermal amplification methods (e.g., rolling circle amplification or bridge amplification). In some embodiments, a second amplification step is performed. In some embodiments, the amplified DNA from a first amplification reaction is used as a template in a second amplification reaction. In some embodiments, the amplified DNA is purified before the second amplification reaction (e.g., PCR purification using methods known in the art).
In some embodiments, an amplification reaction includes using a single pair of primers comprising a first primer having or including SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8 or SEQ ID NO: 9. In some embodiments, an amplification reaction includes using a single pair of primers comprising a first primer having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8 or SEQ ID NO: 9. In some embodiments, an amplification reaction includes using a single pair of primers comprising a second primer having or including SEQ ID NO: 10, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 16, SEQ ID NO: 17, SEQ ID NO: 18 or SEQ ID NO: 19. In some embodiments, an amplification reaction includes using a single pair of primers comprising a second primer having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO: 10, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 16, SEQ ID NO: 17, SEQ ID NO: 18 or SEQ ID NO: 19.
In some embodiments, the first primer has a sequence that is at least 80% identical (e.g., at least 85%, at least 90%, at least 95% at least 99%, or 100% identical) to CGACGTAAAACGACGGCCAGTNNNNNNNNNNNNNNNNGGTGAAACCCCGTCTC TACA (SEQ ID NO: 1). In some embodiments, the second primer has a sequence that is at least 80% identical (e.g., at least 85%, at least 90%, at least 95% at least 99%, or 100% identical) to CACACAGGAAACAGCTATGACCATGCCTCCTAAGTAGCTGGGACTACAG (SEQ ID NO: 10). In some embodiments, an amplification reaction includes using a single pair of primers comprising a first primer having SEQ ID NO. 1 and a second primer having SEQ ID NO. 10. In some embodiments, an amplification reaction includes using a single pair of primers comprising a first primer having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO. 1 and a second primer having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO. 10.
In some embodiments, the first primer comprises from the 5′ to 3′ end: a universal primer sequence (UPS), a unique identifier DNA sequence (UID), and an amplification sequence. In some embodiments, the first primer comprises from the 5′ to 3′ end: a UPS sequence and an amplification sequence. In some embodiments, the first primer comprises from the 5′ to 3′ end: an amplification sequence. In such cases in which the first primer comprises at least an amplification sequence, any variety of library generation techniques known in the art can be used to generate a next generation sequencing library from the amplified amplicons.
In some embodiments, the universal primer sequence (UPS) facilitates the generation of a library of amplicons ready for next generation sequencing. For example, an amplicon generated during the amplification reaction using a first primer (SEQ ID NO. 1) and a second primer (SEQ ID NO. 10) is used as a template for a second amplification reaction. In such cases, a second set of primers designed to bind to the UPS includes the 5′ grafting sequences necessary for hybridization to an Illumina flow cell.
In some embodiments, the UID comprises a sequence of 16-20 degenerate bases. In some embodiments, a degenerate sequence is a sequence in which some positions of a nucleotide sequence contain a number of possible bases. In some embodiments of any of the methods described herein, a degenerate sequence can be a degenerate nucleotide sequence comprising about or at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, or 50 nucleotides. In some embodiments, a nucleotide sequence contains 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 10, 15, 20, 25, or more degenerate positions within the nucleotide sequence. In some embodiments, the degenerate sequence is used as a unique identifier DNA sequence (UID). In some embodiments, the degenerate sequence is used to improve the amplification of an amplicon. For example, a degenerate sequence may contain bases complementary to a chromosomal sequence being amplified. In such cases, the increased complementarity may increase a primers affinity for the chromosomal sequence. In some embodiments, the UID (e.g., degenerate bases) is designed to increase a primers affinity to a plurality of chromosomal sequences.
In some embodiments, an amplification reaction includes one or more pairs of primers (e.g., one or more pairs of primers selected from Table 2). In some embodiments, an amplification reaction includes at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, or at least 9 pairs of primers. In some embodiments, when an amplification reaction includes more than one pair or primers, at least one pair of primers includes a primer having SEQ ID NO: 1 as a first primer and a primer having SEQ ID NO: 10 as a second primer. In some embodiments, when an amplification reaction includes more than one pair of primers, at least one pair of primers includes a first primer with a sequence having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO: 1 and a second primer with a sequence having at least 80% (e.g., at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) sequence identity to SEQ ID NO: 10.
In some embodiments when an amplification reaction includes one or more pairs of primers, any variety of combinations of primers or pairs of primers can be selected from Table 2. For example, an amplification reaction containing 2 pairs of primers (e.g., 4 primers selected from Table 2) can include a first pair of primers (e.g., a first primer pair 1 from Table 2) that includes a first primer (e.g., a first primer having SEQ ID NO: 1) and a second primer (e.g., a second primer having SEQ ID NO: 10) and a second pair of primers (e.g., a second primer pair 2 from Table 2) that includes a third primer (e.g., a third primer having SEQ ID NO: 2) and a fourth primer (e.g., a fourth primer having SEQ ID NO: 11). Combining any of the forward primers listed in Table 2 (e.g., a “FP” having SEQ ID NO: 1, SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8 or SEQ ID NO: 9) with any of the reverse primers listed in Table 2 (e.g., a “RP” having SEQ ID NO: 10, SEQ ID NO: 11, SEQ ID NO: 12, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 16, SEQ ID NO: 17, SEQ ID NO: 18 or SEQ ID NO: 19) will generate amplicons from the repetitive elements as described herein (see e.g., Table 1 for a list of exemplary repetitive elements). For example, an amplification reaction containing 2 pairs of primers (e.g., 4 primers selected from Table 2) can include a first pair of primers (e.g., a first primer pair 1 from Table 2) that includes a first primer (e.g., a first primer having SEQ ID NO: 1) and a second primer (e.g., a second primer having SEQ ID NO: 10) and a second pair of primers (e.g., not listed as a primer pair in Table 2) that includes a third primer (e.g., a third primer having SEQ ID NO: 2) and a fourth primer (e.g., a fourth primer having SEQ ID NO: 12). In some embodiments, an amplification reaction includes one or more pairs of primers where a first primer is included in both pairs of primers. For example, an amplification reaction can include a first pair of primers (e.g., a first primer pair 1 from Table 2) that includes a first primer (e.g., a first primer having SEQ ID NO: 1) and a second primer (e.g., a second primer having SEQ ID NO: 10) and a second pair of primers that includes a third primer (e.g., a third primer having SEQ ID NO: 1) and a fourth primer (e.g., a fourth primer having SEQ ID NO: 11).
In some embodiments, a pair of primers are complementary to a plurality of chromosomal sequences. As used herein, the term “complementary” or “complementarity” refers to nucleic acid residues that are capable or participating in Watson-Crick type or analogous base pair interactions that is enough to support amplification. In some embodiments, an amplification sequence of a first primer is designed to amplify one or more chromosomal sequences. In some embodiments, the one or more chromosomal sequence include any of a variety of repetitive elements as described herein (see e.g., Table 1 for a list of exemplary repetitive elements). In some embodiments, the chromosomal sequences are SINEs. In some embodiments, the chromosomal sequences are LINEs. In some embodiments, the chromosomal sequences are a mixture of different types of repetitive elements (e.g., SINEs, LINEs and/or other exemplary repetitive elements list in Table 1). In some embodiments when an amplification reaction includes two or more pairs of primers, each pair of primers amplifies a different type of repetitive element (see, e.g., Table 1 for a list of exemplary repetitive elements). For example, a first pair of primers can amplify SINEs, and a second pair of primers can amplify LINEs. Optionally, a third, fourth, fifth, etc. pair of primers can amplify a third, fourth, fifth, etc. type of repetitive element (see, e.g., Table 1 for a list of additional exemplary repetitive elements). In some embodiments when an amplification reaction includes two or more pairs of primers, each pair of primers generates amplicons from the same type of repetitive element (see, e.g., Table 1 for a list of exemplary repetitive elements). For example, a first pair of primers can amplify SINEs, and a second pair of primers amplify SINEs. Optionally, a third, fourth, fifth, etc. pair of primers can amplify SINEs. In some embodiments when an amplification reaction includes two or more primer pairs, each pair of primers generates amplicons from a mixture of different types of repetitive elements (see e.g., Table 1 for a list of exemplary repetitive elements).
In some embodiments, one or both primers of a primer pair described herein include primer modifications. Examples of primer modifications include, without limitation, a spacer (e.g., C3 spacer, PC spacer, hexanediol, spacer 9, spacer 18, 1′,2′-dideoxyribose (dspacer)), phosphorylation, phosphorothioate bond modifications, modified nucleic acids, attachment chemistry and/or linker modifications. Examples of modified nucleic acids include, without limitation, 2-Aminopurine, 2,6-Diaminopurine (2-Amino-dA), 5-Bromo dU, deoxyUridine, Inverted dT, Inverted Dideoxy-T, Dideoxy-C, 5-Methyl dC, deoxyInosine, Super T®, Super G®, Locked Nucleic Acids (LNA's), 5-Nitroindole, 2′-O-Methyl RNA Bases, Hydroxymethyl dC, Iso-dG, Iso-dC, Fluoro C, Fluoro U, Fluoro A, Fluoro 2-MethoxyEthoxy A, 2-MethoxyEthoxy MeC, 2-MethoxyEthoxy and/or 2-MethoxyEthoxy T. Examples of attachment chemistries and linker modifications include, without limitation, Acrydite™, Adenylation, Azide (NHS Ester), Digoxigenin (NHS Ester), Cholesterol-TEG I-Linker, Amino Modifiers (e.g., amino modifier C6, amino nodifier C12, amino modifier C6 dT, amino modifier, and/or Uni-Link™ amino modifier), Alkynes (e.g., 5′ Hexynyl and/or 5-Octadiynyl dU), Biotinylation (e.g., biotin, biotin (Azide), biotin dT, biotin-TEG dual biotin, pC biotin, and/or desthiobiotin-TEG), and/or Thiol Modifications (e.g., thiol modifier C3 S—S, dithiol, and/or thiol modifier C6 S—S). In some embodiments, any primer as described herein includes synthetic nucleic acids.
In some embodiments, one or both primers of a primer pair described herein include primer modifications that enhance processing of amplified DNA. In some embodiments, any primer as described herein includes primer modifications that facilitate elimination of primers (e.g., elimination of primers following an amplification reaction). In some embodiments, primer modifications are conveyed to a product of an amplification reaction (e.g., an amplification product contains modified bases). In such cases, the amplification product includes the modification and the inherent properties of the modification (e.g., the ability to select the amplification product containing the modification).
In some embodiments, methods for identifying one or more chromosomal anomalies as described herein include using amplicon-based sequencing reads. In some embodiments, a plurality of amplicons (e.g., amplicons obtained from a DNA sample) are sequenced. In some embodiments, each amplicon is sequenced at least 1, 2,3,4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more times. In some embodiments, each amplicon can be sequenced between about 1 and about 20 (e.g., between about 1 and about 15, between about 1 and about 12, between about 1 and about 10, between about 1 and about 8, between about 1 and about 5, between about 5 and about 20, between about 7 and about 20, between about 10 and about 20, between about 13 and about 20, between about 3 and about 18, between about 5 and about 16, or between about 8 and about 12) times. In some cases, amplicon-based sequencing reads can include continuous sequencing reads. In some cases, amplicons include short interspersed nucleotide elements (SINEs). In some cases, amplicon-based sequencing reads can include from about 100,000 to about 25 million (e.g., from about 100,000 to about 20 million, from about 100,000 to about 15 million, from about 100,000 to about 12 million, from about 100,000 to about 10 million, from about 100,000 to about 5 million, from about 100,000 to about 1 million, from about 100,000 to about 750,000, from about 100,000 to about 500,000, from about 100,000 to about 250,000, from about 250,000 to about 25 million, from about 500,000 to about 25 million, from about 750,000 to about 25 million, from about 1 million to about 25 million, from about 5 million to about 25 million, from about 10 million to about 25 million, from about 15 million to about 25 million, from about 200,000 to about 20 million, from about 250,000 to about 15 million, from about 500,000 to about 10 million, from about 750,000 to about 5 million, or from about 1 million to about 2 million) sequencing reads. For example, sequencing a plurality of amplicons can include assigning a unique identifier (UID) to each template molecule (e.g., to each amplicon), amplifying each uniquely tagged template molecule to create UID-families, and redundantly sequencing the amplification products. For example, sequencing a plurality of amplicons can include calculating a Z-score of a variant on said selected chromosome arm using the equation
where wi is UID depth at a variant i, Zi is the Z-score of variant i, and k is the number of variants observed on the chromosome arm. In some embodiments, methods of sequencing amplicons includes methods known in the art (see, e.g., U.S. Pat. No. 2015/0051085; and Kinde et al. 2012 PloS ONE 7:e41162, which are herein incorporated by reference in their entireties). In some embodiments, amplicons are aligned to a reference genome (e.g., GRC37).
In some embodiments, a plurality of amplicons generated by methods described herein includes from about 10,000 to about 1,000,000 (e.g., from about 15,000 to about 1,000,000, from about 25,000 to about 1,000,000, from about 35,000 to about 1,000,000, from about 50,000 to about 1,000,000, from about 75,000 to about 1,000,000, from about 100,000 to about 1,000,000, from about 125,000 to about 1,000,000, from about 160,000 to about 1,000,000, from about 180,000 to about 1,000,000, from about 200,000 to about 1,000,000, from about 300,000 to about 1,000,000, from about 500,000 to about 1,000,000, from about 750,000 to about 1,000,000, from about 10,000 to about 800,000, from about 10,000 to about 500,000, from about 10,000 to about 250,000, from about 10,000 to about 150,000, from about 10,000 to about 100,000, from about 10,000 to about 75,000, from about 10,000 to about 50,000, from about 10,000 to about 40,000, from about 10,000 to about 30,000, or from about 10,000 to about 20,000) amplicons (e.g., unique amplicons). As one non-limiting example, a plurality of amplicons can include about 745,000 amplicons (e.g., 745,000 unique amplicons). Amplicons in a plurality of amplicons can include from about 50 to about 140 (e.g., from about 60 to about 140, from about 76 to about 140, from about 90 to about 140, from about 100 to about 140, from about 130 to about 140, from about 50 to about 130, from about 50 to about 120, from about 50 to about 110, from about 50 to about 100, from about 50 to about 90, from about 50 to about 80, from about 60 to about 130, from about 70 to about 125, from about 80 to about 120, or from about 90 to about 100) nucleotides. As one non-limiting example, an amplicon can include about 100 nucleotides.
In some embodiments, one or more amplicons in a plurality of amplicons generated by methods described herein can be greater than 1000 basepairs (bp) in length (“long amplicons”). In some embodiments, one or more long amplicons make up at least 4.0% of all amplicons within the total plurality of amplicons. In some embodiments, methods and materials described herein can detect long amplicons when the long amplicons make up at least 4.0% of all the amplicons within the total plurality of amplicons. In some embodiments, methods and materials described herein can detect long amplicons when the long amplicons make up between 0.01% and 3.9% of all amplicons within the total plurality of amplicons.
In some embodiments, one or more amplicons with a length >1000 bp originate from amplification of DNA from cells that do not contain a chromosomal abnormality. In some embodiments, cells that do not contain chromosomal abnormalities are considered contaminating cells. In some embodiments, cells that do not contain chromosomal abnormalities are used as control cells or samples. In some embodiments, contaminating cells can be any variety of cells that might be found in a plasma sample that may dilute amplification of the intended target. In some embodiments, contaminating cells are white blood cells (e.g., leukocyte, granulocyte, eosinophil, basophile, B-cell, T-cell or Natural Killer cell). For example, contaminating cells can be leukocytes.
In some embodiments, methods and materials for identifying one or more chromosomal anomalies as described herein include grouping sequencing reads (e.g., from a plurality of amplicons) into clusters (e.g., unique clusters) of genomic intervals. In some embodiments, a genomic interval is included in one or more clusters. In some embodiments, a genomic interval can belong to from about 100 to about 252 (e.g., from about 125 to about 252, from about 150 to about 252, from about 175 to about 252, from about 200 to about 252, from about 225 to about 252, from about 100 to about 250, from about 100 to about 225, from about 100 to about 200, from about 100 to about 175, from about 100 to about 150, from about 125 to about 225, from about 150 to about 200, or from about 160 to about 180) clusters. As one non-limiting example, a genomic interval can belong to about 176 clusters. In some embodiments, each cluster includes any appropriate number of genomic intervals. In some embodiments, each cluster includes the same number of genomic intervals. In some embodiments, different clusters include varying numbers of genomic clusters. As one non-limiting example, each cluster can include about 200 genomic intervals.
In some embodiments, genomic intervals are identified as having shared amplicon features. As used herein, the term “shared amplicon feature” refers to amplicons with one or more features that are similar. In some embodiments, a plurality of genomic intervals are grouped into a cluster based on one or more shared amplicon features of the sequencing reads mapped to a genomic interval. In some embodiments, the shared amplicon feature is the number amplicons mapped to a genomic interval (e.g., sums of the distributions of the sequencing reads in each genomic interval). In some embodiments, the shared amplicon feature is the average length of the mapped amplicons.
In some embodiments, a cluster of genomic intervals includes from about 5000 to about 6000 (e.g., from about 5100 to about 6000, from about 5200 to about 6000, from about 5300 to about 6000, from about 5400 to about 6000, from about 5500 to about 6000, from about 5600 to about 6000, from about 5700 to about 6000, from about 5800 to about 6000, from about 5900 to about 6000, from about 5000 to about 5900, from about 5000 to about 5800, from about 5000 to about 5700, from about 5000 to about 5600, from about 5000 to about 5500, from about 5000 to about 5400, from about 5000 to about 5300, from about 5000 to about 5200, from about 5000 to about 5100, from about 5100 to about 5800, from about 5100 to about 5700, from about 5100 to about 5600, from about 5100 to about 5500, from about 5100 to about 5400, from about 5100 to about 5300, from about 5100 to about 5200, from about 5200 to about 5600, from about 5200 to about 5500, from about 5200 to about 5400, from about 5200 to about 5300, from about 5300 to about 5500, from about 5300 to about 5400, or from about 5400 to 5500 from about 5200 to about 5700, or from about 5300 to about 5500) genomic intervals. As one non-limiting example, a cluster of genomic intervals can include about 5344 genomic intervals. A genomic interval can be any appropriate length. For example, a genomic interval can be the length of an amplicon sequenced as described herein. For example, a genomic interval can be the length of a chromosome arm. In some cases, a genomic interval can include from about 100 to about 125,000,000 (e.g., from about 250 to about 125,000,000, from about 500 to about 125,000,000, from about 750 to about 125,000,000, from about 1,000 to about 125,000,000, from about 1,500 to about 125,000,000, from about 2,000 to about 125,000,000, from about 5,000 to about 125,000,000, from about 7,500 to about 125,000,000, from about 10,000 to about 125,000,000, from about 25,000 to about 125,000,000, from about 50,000 to about 125,000,000, from about 100,000 to about 125,000,000, from about 250,000 to about 125,000,000, from about 500,000 to about 125,000,000, from about 100 to about 1,000,000, from about 100 to about 750,000, from about 100 to about 500,000, from about 100 to about 250,000, from about 100 to about 100,000, from about 100 to about 50,000, from about 100 to about 25,000, from about 100 to about 10,000, from about 100 to about 5,000, from about 100 to about 2,500, from about 100 to about 1,000, from about 100 to about 750, from about 100 to about 500, from about 100 to about 250, from about 500 to about 1,000,000, from about 5000 to about 900,000, from about 50,000 to about 800,000, or from about 100,000 to about 750,000) nucleotides. As one non-limiting example, a genomic interval can include about 500,000 nucleotides. In some embodiments, clusters of genomic intervals are formed using any appropriate method known in the art. In some embodiments, clusters of genomic intervals are formed based on shared amplicon features of the genomic intervals (see, e.g., Douville et al. PNAS 201 115(8):1871-1876, which is herein incorporated by reference in its entirety).
In some embodiments, methods and materials described herein for identifying one or more chromosomal anomalies include assessing a genome (e.g., a genome of a mammal) for the presence or absence of one or more chromosomal anomalies (e.g., aneuploidies). The presence or absence of one or more chromosomal anomalies in the genome of a mammal can, for example, be determined by sequencing a plurality of amplicons obtained from a sample (e.g., a test sample) obtained from the mammal to obtain sequencing reads, and grouping the sequencing reads into clusters of genomic intervals. In some cases, read counts of genomic intervals can be compared to read counts of other genomic intervals within the same sample. In some cases where read counts of genomic intervals are compared to read counts of other genomic intervals within the same sample, a second (e.g., control or reference) sample is not assayed. In some cases, read counts of genomic intervals can be compared to read counts of genomic intervals in another sample. For example, when using methods and materials described herein to identify genetic relatedness, polymorphisms (e.g., somatic mutations), and/or microsatellite instability, genomic intervals can be compared to read counts of genomic intervals in a reference sample. A reference sample can be a synthetic sample. A reference sample can be from a database. In some cases where methods and materials described herein are used to identify anomalies (e.g., aneuploidies), a reference sample can be a normal sample obtained from the same cancer patient (e.g., a sample from the cancer patient that does not harbor cancer cells) or a normal sample from another source (e.g., a patient that does not have cancer). In some cases where method and materials described herein are used to identify anomalies (e.g., aneuploidies), a reference sample can be a normal sample obtained from the same patient (e.g., a sample from pre-natal human that contains only maternal cells).
In some embodiments, methods and materials described herein are used for detecting aneuploidy in a preimplantation embryo (e.g., an embryo generated via in vitro fertilization). In some embodiments, the presence or absence of one or more chromosomal anomalies in a preimplantation embryo is determined by sequencing a plurality of amplicons obtained from a sample taken from the preimplantation embryo (e.g., a test sample such, as without limitation, one or more cells obtained from a blastocyst) to obtain sequencing reads, and grouping the sequencing reads into clusters of genomic intervals. In some cases, read counts of genomic intervals can be compared to read counts of other genomic intervals within the same sample. In some cases where read counts of genomic intervals are compared to read counts of other genomic intervals within the same sample, a second (e.g., control or reference) sample is not assayed. In some cases, read counts of genomic intervals can be compared to read counts of genomic intervals in another sample (e.g., a reference sample). In some embodiments, a reference sample is a sample obtained from a reference mammal. In some embodiments, a reference sample is obtained from a database (e.g., the reference sample is an in silico sample having a known sequence and/or ploidy at the genomic position of interest). Exemplary aneuploidies that can be detected in preimplantation embryos include trisomies at chromosome 21 (e.g., resulting in Down's Syndrome), trisomies at chromosome 13, trisomies at chromosome 18, Turner Syndrome (e.g., women with only one X chromosome) and Klinefelter Syndrome (e.g., men with two or more X chromosomes). In some embodiments, methods and materials described herein are used for detecting aneuploidy in a genome of mammal. For example, a plurality of amplicons obtained from a sample obtained from a mammal can be sequenced, the sequencing reads can be grouped into clusters of genomic intervals, the sums of the distributions of the sequencing reads in each genomic interval can be calculated, a Z-score of a chromosome arm can be calculated, and the presence or absence of an aneuploidy in the genome of the mammal can be identified.
The distributions of the sequencing reads in each genomic interval can be summed. For example, sums of distributions of the sequencing reads in each genomic interval can be calculated using the equation Σ1I R˜N(Σ1l μi, Σ1I σi2), where Ri is the number of sequencing reads, I is the number of clusters on a chromosome arm, N is a Gaussian distribution with parameters μi and σl2, and μi is the mean number of sequencing reads in each genomic interval, and σi2 is the variance of sequencing reads in each genomic interval. A Z-score of a chromosome arm can be calculated using any appropriate technique. For example, a Z-score of a chromosome arm can be calculated using the quantile function 1-CDF(Σ1I μi, Σ1I σi2). The presence of an aneuploidy in the genome of the mammal can be identified in the genome of the mammal when the Z-score is outside a predetermined significance threshold, and the absence of an aneuploidy in the genome of the mammal can be identified in the genome of the mammal when the Z-score is within a predetermined significance threshold. The predetermined threshold can correspond to the confidence in the test and the acceptable number of false positives. For example, a significance threshold can be ±1.96, ±3, or ±5. In some embodiments, methods and materials described herein employ supervised machine learning. In some embodiments, supervised machine learning can detect small changes in one or more chromosome arms. For example, supervised machine learning can detect changes such as chromosome arm gains or losses that are often present in a disease or disorder associated with chromosomal anomalies, such as cancer or congenital anomalies. In some embodiments, supervised machine learning can detect changes such as chromosome arm gains or losses that are present in a preimplantation embryo (e.g., a preimplantation embryo generated by in vitro fertilization methods). In some cases, supervised machine learning can be used to classify samples according to aneuploidy status. For example, supervised machine learning can be employed to make genome-wide aneuploidy calls. In some cases, a support vector machine model can include obtaining an SVM score. An SVM score can be obtained using any appropriate technique. In some cases, an SVM score can be obtained as described elsewhere (see, e.g., Cortes 1995 Machine learning 20:273-297; and Meyer et al. 2015 R package version:1.6-3). At lower read depths, a sample will typically have a higher raw SVM score. Thus, in some cases, raw SVM probabilities can be corrected based on the read depth of a sample using the equation
where r is the ratio of the SVM score at a particular read depth/minimum SVM score of a particular sample given sufficient read depth. A and B can be determined as described in Example 1. For example, A=−7.076*10{circumflex over ( )}−7, x=the number of unique template molecules for the given sample, and B=−1.946*10{circumflex over ( )}−1.
Also provided herein are new methods of normalization that reduce the amount of variability between samples. In some embodiments, a principal component analysis (PCA) can be used for normalization. In some embodiments, a PCA is performed on sequencing data from the controls. For example, a PCA may reduce the number of 500 kb genomic intervals from n=5,344 to a more manageable number of dimensions. Using the PCA coordinates of the controls, a model can be generated that predicts whether a particular 500 kb interval will be amplified more or less efficiently in future samples based on their PCA coordinates.
Correction Factor for 500 kb Intervali=βoi+β1i*PCA1β2i*PCA2+β3i*PCA3+β4i*PCA4+β5i*PCA5
For example, for each test sample, a sample can be projected into PCA space and the correction factor can be calculated for each 500 kb interval as function of its PCA coordinates. After applying the correction factor to each 500 kb genomic interval, the test sample may be matched to one or more control samples based on the closest Euclidean distance of the 500 kb intervals.
In some embodiments, samples are excluded in order to ensure the quality of the data. In some embodiments, samples are excluded before, contemporaneously with, and/or after data analysis. In some embodiments, a list of factors can be applied to the data in order to exclude data that does not meet the criteria set forth in the list of factors. In some embodiments, the list of factors may be any reasonable number of factors. For example, a list of five factors can be used to exclude samples. Any combination of factors can be used to determine that a sample should be excluded. In some embodiments, samples with fewer than 2.5M reads may be excluded. In some embodiments, samples with sufficient evidence of contamination may be excluded. For example, a sample may be considered contaminated if the sample has at least 10 significant allelic imbalanced chromosome arms (z score >=2.5) and fewer than ten significant chromosome arms gains or losses (z>=2.5 or z<=−2.5). In some embodiments, allelic imbalance can be determined from SNPs, while gains or losses can be assessed through WALDO. In some embodiments, when examining the quality of the plasma samples, samples may be excluded in which more than 8.5% of the amplicons were larger than 94 bps (50 base pairs between the forward and reverse primers). Without wishing to be bound by theory, such samples may be contaminated with leukocyte DNA. In some embodiments, samples outside the dynamic range of the assay, as defined by the equation below, may be excluded.
For example, the distribution of this metric has long tails. The values of >0.2450 and 0.2320 may be selected as a dynamic range that could evaluate cutoffs. In some embodiments, plasma samples with known aneuploidy in the leukocytes of the same patients may be excluded. For example, such patients may have Clonal Hematopoiesis of Indeterminate Potential (CHIP) or congenital disorders.
In some embodiments, provided herein are methods to detect copy number variants (CNVs) of indeterminate length. In some embodiments, provided herein are methods to detect copy number variation of near-fixed length. In some embodiments, detecting copy number variation include calculating the values of one or more variables. In some embodiments, using a log ratio of the observed test sample and WALDO predicted values from every 500 kb interval across each chromosomal arm, a circular binary segmentation algorithm can be applied to determine copy number variants throughout each chromosome arm. For example, copy number variant ≤5 Mb in size can be flagged. In some embodiments, the flagged CNVs can be removed before, contemporaneously with, and/or after the analysis. In some embodiments, small CNVs may be used to assess microdeletions or microamplifications. For example, microdelections or microamplifications occur in DiGeorge Syndrome (chromosome 22q11.2 or in breast cancers (chromosome 17q12).
In some embodiments, provided herein are methods of using synthetic aneuploid samples. In some embodiments, synthetic aneuploidy samples can be created by adding (or subtracting) reads from several chromosome arms to the reads from these normal DNA samples. For example, reads can be added or subtracted from 1, 10, 15, or 20 chromosome arms to each sample. The additions and subtractions can be designed to represent neoplastic cell fractions ranging from 0.5% to 1.5% and resulted in synthetic samples containing exactly ten million reads. The reads from each chromosome arm can be added or subtracted uniformly. In some embodiments, provided herein are methods of generating synthetic aneuploid samples using exemplary pseudocode (
Examples of chromosomal anomalies that can be detected using methods and materials described herein include, without limitation, numerical disorders, structural abnormalities, allelic imbalances, and microsatellite instabilities. A chromosomal anomaly can include a numerical disorder. For example, a chromosomal anomaly can include an aneuploidy (e.g., an abnormal number of chromosomes). In some cases, an aneuploidy can include an entire chromosome. In some cases, an aneuploidy can include part of a chromosome (e.g., a chromosome arm gain or a chromosome arm loss). Examples of aneuploidies include, without limitation, monosomy, trisomy, tetrasomy, and pentasomy. A chromosomal anomaly can include a structural abnormality. Examples of structural abnormalities include, without limitation, deletions, duplications, translocations (e.g., reciprocal translocations and Robertsonian translocations), inversions, insertions, rings, and isochromosomes. Chromosomal anomalies can occur on any chromosome pair (e.g., chromosome 1, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 13, chromosome 14, chromosome 15, chromosome 16, chromosome 17, chromosome 18, chromosome 19, chromosome 20, chromosome 21, chromosome 22, and/or one of the sex chromosomes (e.g., an X chromosome or a Y chromosome). For example, aneuploidy can occur, without limitation, in chromosome 13 (e.g., trisomy 13), chromosome 16 (e.g., trisomy 16), chromosome 18 (e.g., trisomy 18), chromosome 21 (e.g., trisomy 21), and/or the sex chromosomes (e.g., X chromosome monosomy; sex chromosome trisomy such as XXX, XXY, and XYY; sex chromosome tetrasomy such as XXXX and XXYY; and sex chromosome pentasomy such as XXXXX, XXXXY, and XYYYY). For example, structural abnormalities can occur, without limitation, in chromosome 4 (e.g., partial deletion of the short arm of chromosome 4), chromosome 11 (e.g., a terminal 11q deletion), chromosome 13 (e.g., Robertsonian translocation at chromosome 13), chromosome 14 (e.g., Robertsonian translocation at chromosome 14), chromosome 15 (e.g., Robertsonian translocation at chromosome 15), chromosome 17 (e.g., duplication of the gene encoding peripheral myelin protein 22), chromosome 21 (e.g., Robertsonian translocation at chromosome 21), and chromosome 22 (e.g., Robertsonian translocation at chromosome 22).
In some embodiments, methods and materials as described herein are used for identifying and/or treating a disease associated with one or more chromosomal anomalies (e.g., one or more chromosomal anomalies identified as described herein, such as, without limitation, an aneuploidy). In some cases, a DNA sample (e.g., a genomic DNA sample) obtained from a mammal can be assessed for the presence or absence of one or more chromosomal anomalies. For example, a mammal (e.g., a human) can be identified as having a disease based, at least in part, on the presence of one or more chromosomal anomalies can be treated with one or more cancer treatments. In some embodiments, a mammal identified as having cancer based, at least in part, on the presence of one or more chromosomal anomalies is treated with one or more cancer treatments. In some embodiments, a mammal (e.g., a prenatal human) can be identified as having a disease or disorder based, at least in part, on the presence of one or more chromosomal anomalies. In some embodiments, an embryo (e.g., an embryo generated by in vitro fertilization) can be identified as being unsuitable for to transfer to the uterus (e.g., a human uterus) for implantation based, at least in part, on the presence of one or more chromosomal anomalies. In some embodiments, an embryo (e.g., an embryo generated by in vitro fertilization) can be identified as being suitable for to transfer to the uterus (e.g., a human uterus) for implantation based, at least in part, on the absence of one or more chromosomal anomalies.
In some embodiments, a mammal identified as having a disease or disorder associated with one or more chromosomal anomalies as described herein (e.g., based at least in part on the presence of one or more chromosomal anomalies, such as, without limitation, an aneuploidy) can have the disease or disorder diagnosis confirmed using any appropriate method. Examples of methods that can be used to confirm the presence of one or more chromosomal anomalies include, without limitation, karyotyping, fluorescence in situ hybridization (FISH), quantitative PCR of short tandem repeats, quantitative fluorescence PCR (QF-PCR), quantitative PCR dosage analysis, quantitative mass spectrometry of SNPs, comparative genomic hybridization (CGH), whole genome sequencing, and exome sequencing.
In some embodiments, detection of aneuploidy is used to identify a mammal as having cancer (e.g., any of the exemplary cancers described herein). In some embodiments, detection of one or more genetic biomarkers is used to confirm or identify a mammal as having cancer (e.g., any of the exemplary cancers described herein). In some embodiments, an elevated level of one or more peptide biomarkers is used to confirm or identify a mammal as having cancer (e.g., any of the exemplary cancers described herein). In some embodiments, a mammal identified as having cancer as described herein (e.g., based on detection of aneuploidy, and/or at least in part on the presence or absence of one or more genetic biomarkers (e.g., mutations) and/or an elevated level of one or more protein biomarkers (e.g., peptides)) can have the cancer diagnosis confirmed using any appropriate method. Examples of methods that can be used to diagnose or confirm diagnosis of a cancer include, without limitation, physical examinations (e.g., pelvic examination), imaging tests (e.g., ultrasound or CT scans), cytology, and tissue tests (e.g., biopsy).
In some embodiments, methods for identifying one or more chromosomal anomalies (e.g., aneuploidy) provided herein are used to identify a mammal as having a distinct stage of cancer. In some embodiments, a cancer can be a Stage I cancer. In some embodiments, a cancer can be a Stage II cancer. In some embodiments, a cancer can be a Stage III cancer. In some embodiments, a cancer can be a Stage IV cancer. In some embodiments, methods for identifying one or more chromosomal anomalies (e.g., aneuploidy) provided herein are used to identify a mammal as having a stage of cancer that conventional methods of detecting cancer cannot reliably detect. For example, methods for identifying one or more chromosomal anomalies (e.g., aneuploidy) provided herein can be used to identify a mammal as having a Stage I cancer that conventional methods of detecting cancer cannot reliably detect. In some embodiments, methods provided herein for identifying: 1) one or more chromosomal anomalies (e.g., aneuploidy), and 2) one or more genetic biomarkers (e.g., any of the genetic biomarkers provided herein) are used to identify a mammal as having a stage of cancer that conventional methods of detecting cancer cannot reliably detect. In some embodiments, methods provided herein for identifying: 1) one or more chromosomal anomalies (e.g., aneuploidy), and 2) one or more protein biomarkers (e.g., any of the protein biomarkers provided herein) are used to identify a mammal as having a stage of cancer that conventional methods of detecting cancer cannot reliably detect. Non-limiting examples of cancers that be identified as described herein (e.g., based on detection of aneuploidy, and/or at least in part on the presence or absence of one or more genetic biomarkers (e.g., mutations) and/or an elevated level of one or more protein biomarkers (e.g., peptides)) include, liver cancer, ovarian cancer, esophageal cancer, stomach cancer, pancreatic cancer, colorectal cancer, lung cancer, breast cancer, and prostate cancer.
In some embodiments, the subject in which the presence of one or more chromosomal anomalies (e.g., aneuploidies) is detected may be selected for further diagnostic testing. In some embodiments, methods provided herein can be used to select a subject for further diagnostic testing at a time period prior to the time period when conventional techniques are capable of diagnosing the subject with an early-stage cancer. For example, methods provided herein for selecting a subject for further diagnostic testing can be used when a subject has not been diagnosed with cancer by conventional methods and/or when a subject is not known to harbor a cancer. In some embodiments, a subject selected for further diagnostic testing can be administered a diagnostic test (e.g., any of the diagnostic tests described herein) at an increased frequency compared to a subject that has not been selected for further diagnostic testing. For example, a subject selected for further diagnostic testing can be administered a diagnostic test at a frequency of twice daily, daily, bi-weekly, weekly, bi-monthly, monthly, quarterly, semi-annually, annually, or any at frequency therein. In some embodiments, a subject selected for further diagnostic testing can be administered one or more additional diagnostic tests compared to a subject that has not been selected for further diagnostic testing. For example, a subject selected for further diagnostic testing can be administered two diagnostic tests or more, whereas a subject that has not been selected for further diagnostic testing is administered only a single diagnostic test (or no diagnostic tests). In some embodiments, the diagnostic testing method can determine the presence of the same type of cancer as the originally detected cancer. Additionally or alternatively, the diagnostic testing method can determine the presence of a different type of cancer from the originally detected cancer.
In some embodiments, the diagnostic testing method is a scan. In some embodiments, the scan is a bone scan, a computed tomography (CT), a CT angiography (CTA), an esophagram (a Barium swallow), a Barium enema, a gallium scan, a magnetic resonance imaging (MRI), a mammography, a monoclonal antibody scan (e.g., ProstaScint® scan for prostate cancer, OncoScint® scan for ovarian cancer, and CEA-Scan® for colon cancer), a multigated acquisition (MUGA) scan, a PET scan, a PET/CT scan, a thyroid scan, an ultrasound (e.g., a breast ultrasound, an endobronchial ultrasound, an endoscopic ultrasound, a transvaginal ultrasound), an X-ray, a DEXA scan.
In some embodiments, the diagnostic testing method is a physical examination, such as, without limitation, an anoscopy, a biopsy, a bronchoscopy (e.g., an autofluorescence bronchoscopy, a white-light bronchoscopy, a navigational bronchoscopy), a digital breast tomosynthesis, a digital rectal exam, an endoscopy, including but not limited to a capsule endoscopy, virtual endoscopy, an arthroscopy, a bronchoscopy, a colonoscopy, a colposcopy, a cystoscopy, an esophagoscopy, a gastroscopy, a laparoscopy, a laryngoscopy, a neuroendoscopy, a proctoscopy, a sigmoidoscopy, a skin cancer exam, a thoracoscopy, an endoscopic retrograde cholangiopancreatography (ERCP), an ensophagogastroduodenoscopy, a pelvic exam.
In some embodiments, the diagnostic testing method is a biopsy (e.g., a bone marrow aspiration, a tissue biopsy). In some embodiments, the biopsy is performed by fine needle aspiration or by surgical excision. In some embodiments, the diagnostic testing method(s) further include obtaining a biological sample (e.g., a tissue sample, a urine sample, a blood sample, a check swab, a saliva sample, a mucosal sample (e.g., sputum, bronchial secretion), a nipple aspirate, a secretion or an excretion). In some embodiments, the diagnostic testing method(s) include determining exosomal proteins (e.g., an exosomal surface protein (e.g., CD24, CD147, PCA-3)) (Soung et al. (2017) Cancers 9(1):pii:E8). In some embodiments, the diagnostic testing method is an oncotype DX® test (Baehner (2016) Ecancermedicalscience 10:675).
In some embodiments, the diagnostic testing method is a test, such as without limitation, an alpha-fetoprotein blood test, a bone marrow test, a fecal occult blood test, a human papillomavirus test, low-dose helical computed tomography, a lumbar puncture, a prostate specific antigen (PSA) test, a pap smear, or a tumor marker test.
In some embodiments, the diagnostic testing method includes determining the level of a known protein biomarker (e.g., CA-125 or prostate specific antigen (PSA)). For example, a high amount of CA-125 can be found in subject's blood, which subject has ovarian cancer, endometrial cancer, fallopian tube cancer, pancreatic cancer, stomach cancer, esophageal cancer, colon cancer, liver cancer, breast cancer, or lung cancer. The term “biomarker” as used herein refers to “a biological molecule found in blood, other bodily fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease”, e.g., as defined by the National Cancer Institute. (see, e.g., the URL www.cancer.gov/publications/dictionaries/cancer-terms? CdrID=45 618). A biomarker can include a genetic biomarker such as, without limitation, a nucleic acid (e.g., a DNA molecule, a RNA molecule (e.g., a microRNA, a long non-coding RNA (lncRNA) or other non-coding RNA) A biomarker can include a protein biomarker such as, without limitation, a peptide, a protein, or a fragment thereof.
In some embodiments, the biomarker is FLT3, NPM1, CEBPA, PRAM1, ALK, BRAF, KRAS, EGFR, Kit, NRAS, JAK2, KRAS, HPV virus, ERBB2, BCR-ABL, BRCA1, BRCA2, CEA, AFP, and/or LDH. See e.g., Easton et al. (1995) Am. J. Hum. Genet. 56: 265-271, Hall et al. (1990) Science 250: 1684-1689, Lin et al. (2008) Ann. Intern. Med. 149: 192-199, Allegra et al. (2009) (2009) J. Clin. Oncol. 27: 2091-2096, Paik et al. (2004) N. Engl. J. Med. 351: 2817-2826, Bang et al. (2010) Lancet 376: 687-697, Piccart-Gebhart et al. (2005) N. Engl. J. Med. 353: 1659-1672, Romond et al. (2005) N. Engl. J. Med. 353: 1673-1684, Locker et al. (2006) J. Clin. Oncol. 24: 5313-5327, Giligan et al. (2010) J. Clin. Oncol. 28: 3388-3404, Harris et al. (2007) J. Clin. Oncol. 25: 5287-5312; Henry and Hayes (2012) Mol. Oncol. 6: 140-146. In some embodiments, the biomarker is a biomarker for detection of breast cancer in a subject, such as, without limitation, MUC-1, CEA, p53, urokinase plasminogen activator, BRCA1, BRCA2, and/or HER2 (Gam (2012) World J. Exp. Med. 2(5): 86-91). In some embodiments, the biomarker is a biomarker for detection of lung cancer in a subject, such as, without limitation, KRAS, EGFR, ALK, MET, and/or ROS1 (Mao (2002) Oncogene 21: 6960-6969; Korpanty et al. (2014) Front Oncol. 4: 204). In some embodiments, the biomarker is a biomarker for detection of ovarian cancer in a subject, such as, without limitation, HPV, CA-125, HE4, CEA, VCAM-1, KLK6/7, GST1, PRSS8, FOLR1, ALDH1 (Nolen and Lokshin (2012) Future Oncol. 8(1): 55-71; Sarojini et al. (2012) J. Oncol. 2012:709049). In some embodiments, the biomarker is a biomarker for detection of colorectal cancer in a subject, such as, without limitation, MLH1, MSH2, MSH6, PMS2, KRAS, and BRAF (Gonzalez-Pons and Cruz-Correa (2015) Biomed. Res. Int. 2015: 149014; Alvarez-Chaver et al. (2014) World J. Gastroenterol. 20(14): 3804-3824). In some embodiments, the diagnostic testing method determines the presence and/or expression level of a nucleic acid (e.g., microRNA (Sethi et al. (2011) J. Carcinog. Mutag. S1-005), RNA, a SNP (Hosein et al. (2013) Lab. Invest doi: 10.1038/labinvest.2013.54; Falzoi et al. (2010) Pharmacogenomics 11: 559-571), methylation status (Castelo-Branco et al. (2013) Lancet Oncol 14: 534-542), a hotspot cancer mutation (Yousem et al. (2013) Chest 143: 1679-1684)). Non-limiting examples of methods of detecting a nucleic acid in a sample include: PCR, RT-PCR, sequencing (e.g., next generation sequencing methods, deep sequencing), a DNA microarray, a microRNA microarray, a SNP microarray, fluorescent in situ hybridization (FISH), restriction fragment length polymorphism (RFLP), gel electrophoresis, Northern blot analysis, Southern blot analysis, chromogenic in situ hybridization (CISH), chromatin immunoprecipitation (ChIP), SNP genotyping, and DNA methylation assay. See, e.g., Meldrum et al. (2011) Clin. Biochem. Rev. 32(4): 177-195; Sidranksy (1997) Science 278(5340): 1054-9.
In some embodiments, the diagnostic testing method includes determining the presence of a protein biomarker in a sample (e.g., a plasma biomarker (Mirus et al. (2015) Clin. Cancer Res. 21(7): 1764-1771)). Non-limiting examples of methods of determining the presence of a protein biomarker include: western blot analysis, immunohistochemistry (IHC), immunofluorescence, mass spectrometry (MS) (e.g., matrix assisted laser desorption/ionization (MALDI)-MS, surface enhanced laser desorption/ionization time-of-flight (SELDI-TOF)-MS), enzyme-linked immunosorbent assay (ELISA), flow cytometry, proximity assay (e.g., VeraTag proximity assay (Shi et al. (2009) Diagnostic molecular pathology: the American journal of surgical pathology, part B: 18: 11-21, Huang et al. (2010) AM. J. Clin. Pathol. 134: 303-11)), a protein microarray (e.g., an antibody microarray (Ingvarsson et al. (2008) Proteomics 8: 2211-9, Woodbury et al. (2002) J. Proteome Res. 1: 233-237), an IHC-based microarray (Stromberg et al. (2007) Proteomics 7: 2142-50), a microarray ELISA (Schroder et al. (2010) Mol. Cell. Proteomics 9: 1271-80). In some embodiments, the method of determining the presence of a protein biomarker is a functional assay. In some embodiments, the functional assay is a kinase assay (Ghosh et al. (2010) Biosensors & Bioelectronics 26: 424-31, Mizutani et al. (2010) Clin. Cancer Res. 16: 3964-75, Lee et al. (2012) Biomed. Microdevices 14: 247-57), a protease assay (Lowe et al. (2012) ACS nano. 6: 851-7, Fujiwara et al. (2006) Breast cancer 13: 272-8, Darragh et al. (2010) Cancer Res 70: 1505-12). See, e.g., Powers and Palecek (2015) J. Heathc Eng. 3(4): 503-534, for a review of protein analytical assays for diagnosing cancer patients.
In some embodiments, any appropriate disease or condition associated with one or more chromosomal anomalies as described herein (e.g., based at least in part on the presence of one or more chromosomal anomalies, such as, without limitation, an aneuploidy) is identified as described herein. In some embodiments, the disease is cancer. Examples of cancers that can be associated with one or more chromosomal anomalies include, without limitation, lung cancer (e.g., small cell lung carcinoma or non-small cell lung carcinoma), papillary thyroid cancer, medullary thyroid cancer, differentiated thyroid cancer, recurrent thyroid cancer, refractory differentiated thyroid cancer, lung adenocarcinoma, bronchioles lung cell carcinoma, multiple endocrine neoplasia type 2A or 2B (MEN2A or MEN2B, respectively), pheochromocytoma, parathyroid hyperplasia, breast cancer, colorectal cancer (e.g., metastatic colorectal cancer), papillary renal cell carcinoma, ganglioneuromatosis of the gastroenteric mucosa, inflammatory myofibroblastic tumor, or cervical cancer, acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), cancer in adolescents, adrenal cancer, adrenocortical carcinoma, anal cancer, appendix cancer, astrocytoma, atypical teratoid/rhabdoid tumor, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancer, brain stem glioma, brain tumor, breast cancer, bronchial tumor, Burkitt lymphoma, carcinoid tumor, unknown primary carcinoma, cardiac tumors, cervical cancer, childhood cancers, chordoma, chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), chronic myeloproliferative neoplasms, colon cancer, colorectal cancer, craniopharyngioma, cutaneous T-cell lymphoma, bile duct cancer, ductal carcinoma in situ, embryonal tumors, endometrial cancer, ependymoma, esophageal cancer, esthesioneuroblastoma, Ewing sarcoma, extracranial germ cell tumor, extragonadal germ cell tumor, extrahepatic bile duct cancer, eye cancer, fallopian tube cancer, fibrous histiocytoma of bone, gallbladder cancer, gastric cancer, gastrointestinal carcinoid tumor, gastrointestinal stromal tumors (GIST), germ cell tumor, gestational trophoblastic disease, glioma, hairy cell tumor, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular cancer, histiocytosis, Hodgkin's lymphoma, hypopharyngeal cancer, intraocular melanoma, islet cell tumors, pancreatic neuroendocrine tumors, Kaposi sarcoma, kidney cancer, Langerhans cell histiocytosis, laryngeal cancer, leukemia, lip and oral cavity cancer, liver cancer, lung cancer, lymphoma, macroglobulinemia, malignant fibrous histiocytoma of bone, osteocarcinoma, melanoma, Merkel cell carcinoma, mesothelioma, metastatic squamous neck cancer, midline tract carcinoma, mouth cancer, multiple endocrine neoplasia syndromes, multiple myeloma, mycosis fungoides, myelodysplastic syndromes, myelodysplastic/myeloproliferative neoplasms, myelogenous leukemia, myeloid leukemia, multiple myeloma, myeloproliferative neoplasms, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, non-Hodgkin's lymphoma, non-small cell lung cancer, oral cancer, oral cavity cancer, lip cancer, oropharyngeal cancer, osteosarcoma, ovarian cancer, pancreatic cancer, hepatobiliary cancer, upper urinary tract cancer, papillomatosis, paraganglioma, paranasal sinus and nasal cavity cancer, parathyroid cancer, penile cancer, pharyngeal cancer, pheochromosytoma, pituitary cancer, plasma cell neoplasm, pleuropulmonary blastoma, pregnancy and breast cancer, primary central nervous system lymphoma, primary peritoneal cancer, prostate cancer, rectal cancer, renal cell cancer, retinoblastoma, rhabdomyosarcoma, salivary gland cancer, sarcoma, Sezary syndrome, skin cancer, small cell lung cancer, small intestine cancer, soft tissue sarcoma, squamous cell carcinoma, squamous neck cancer, stomach cancer, T-cell lymphoma, testicular cancer, throat cancer, thymoma and thymic carcinoma, thyroid cancer, transitional cell cancer of the renal pelvis and ureter, unknown primary carcinoma, urethral cancer, uterine cancer, uterine sarcoma, vaginal cancer, vulvar cancer, Waldenstrom Macroglobulinemia, Wilms' tumor, 1p36 deletion syndrome, 1q21.1 deletion syndrome, 2q37 deletion syndrome, Wolf-Hirschhorn syndrome, Cri du chat, 5q deletion syndrome, Williams syndrome, Monosomy 8p, Monosomy 8q, Alfi's syndrome, Kleefstra syndrome, Monosomy 10p, Monosomy 10q, Jacobsen syndrome, Patau syndrome, Angelman syndrome, Prader-Willi syndrome, Miller-Dieker syndrome, Smith-Magenis syndrome, Edwards syndrome, Down syndrome, DiGeorge syndrome, Phelan-McDermid syndrome, 22q11.2 distal deletion syndrome, Cat eye syndrome, XYY syndrome, Triple X syndrome, Klinefelter syndrome, Wolf-Hirschhorn syndrome, Jacobsen syndrome, Charcot-Marie-Tooth disease type 1A, and Lynch Syndrome.
Once identified as having a disease associated with one or more chromosomal anomalies as described herein (e.g., based at least in part on the presence of one or more chromosomal anomalies, such as, without limitation, an aneuploidy), a mammal (e.g., a human) can be treated accordingly. For example, when a mammal is identified as having a cancer associated with one or more chromosomal anomalies as described herein, the mammal can be treated with one or more cancer treatments. The one or more cancer treatments can include any appropriate cancer treatments. A cancer treatment can include surgery. A cancer treatment can include radiation therapy. A cancer treatment can include administration of a pharmacotherapy such chemotherapy, hormone therapy, targeted therapy, and/or cytotoxic therapy. Examples of cancer treatments include, without limitation, platinum compounds (such as cisplatin or carboplatin), taxanes (such as paclitaxel or docetaxel), albumin bound paclitaxel (nab-paclitaxel), altretamine, capecitabine, cyclophosphamide, etoposide (vp-16), gemcitabine, ifosfamide, irinotecan (cpt-11), liposomal doxorubicin, melphalan, pemetrexed, topotecan, vinorelbine, luteinizing-hormone-releasing hormone (LHRH) agonists (such as goserelin and leuprolide), anti-estrogen therapy (such as tamoxifen), aromatase inhibitors (such as letrozole, anastrozole, and exemestane), angiogenesis inhibitors (such as bevacizumab), poly(ADP)-ribose polymerase (PARP) inhibitors (such as olaparib, rucaparib, and niraparib), external beam radiation therapy, brachytherapy, radioactive phosphorus, and any combinations thereof.
In some embodiments, methods provided herein to detect aneuploidy (e.g., using the analysis of chromosomal sequences (see e.g., Table 1 for an exemplary list of repetitive elements that can be analyzed)) increase sensitivity of cancer detection compared to cancer detection using the presence of one or more genetic biomarkers as indicators of cancer. In some embodiments, methods provided herein to detect aneuploidy (e.g., using the analysis of chromosomal sequences (see e.g., Table 1 for an exemplary list of repetitive elements that can be analyzed)) increase sensitivity of cancer detection compared to cancer detection using the presence of one or more protein biomarkers as indicators of cancer.
In some embodiments, methods provided herein to detect aneuploidy (e.g., using the analysis of chromosomal sequences (see e.g., Table 1 for an exemplary list of repetitive elements that can be analyzed)) are combined with one or more methods to detect the presence of one or more genetic biomarkers (e.g., mutations). In some embodiments, the combination of aneuploidy detection with genetic biomarker detection increases the specificity and/or sensitivity of detecting cancer. In some embodiments, methods provided herein to detect aneuploidy (e.g., using the analysis of chromosomal sequences (see e.g., Table 1 for an exemplary list of repetitive elements that can be analyzed)) are combined with one or more methods to detect the presence of one or more members of a panel of protein biomarkers (e.g., peptides). In some embodiments, the combination of aneuploidy detection with protein biomarker detection increases the specificity and/or sensitivity of detecting cancer. In some embodiments, methods provided herein to detect aneuploidy (e.g., using the analysis of chromosomal sequences (see e.g., Table 1 for an exemplary list of repetitive elements that can be analyzed)) are combined with methods to detect the presence of one or more genetic biomarkers (e.g., mutations) and/or methods to detect the presence of one or more members of a panel of protein biomarkers (e.g., peptide). In some embodiments, the combination of aneuploidy detection with genetic and/or protein biomarker detection increases the specificity and/or sensitivity of detecting cancer.
In some embodiments, methods provided herein to detect aneuploidy are combined with methods to detect the presence of one or more genetic biomarkers (e.g., mutations) in one or more genes selected from the group consisting of: NRAS, PTEN, FGFR2, KRAS, POLE, AKT1, TP53, RNF43, PPP2R1A, MAPK1, CTNNB1, PIK3CA, FBXW7, PIK3R1, APC, EGFR, BRAF. In some embodiments, methods provided herein to detect aneuploidy are combined with methods to detect the presence of one or more genetic biomarkers (e.g., mutations) in one or more genes selected from the group consisting of: PTEN, TP53, PIK3CA, PIK3R1, CTNNB1, KRAS, FGFR2, POLE, APC, FBXW7, RNF43, and PPP2R1A. In some embodiments, an assay includes detection of genetic biomarkers (e.g., mutations) in one or more of any of the genes disclosed herein including, without limitation, CDKN2A, FGF2, GNAS, ABL1, EVIL MYC, APC, IL2, TNFAIP3, ABL2, EWSR1, MYCL1, ARHGEF12, JAK2, TP53, AKT1, FEV, MYCN, ATM, MAP2K4, TSC1, AKT2, FGFR1, NCOA4, BCL11B, MDM4, TSC2, ATF1, FGFR1OP, NFKB2, BLM, MEN1, VHL, BCL11A, FGFR2, NRAS, BMPR1A, MLH1, WRN, BCL2, FUS, NTRK1, BRCA1, MSH2, WT1, BCL3, GOLGA5, NUP214, BRCA2, NF1, BCL6, GOPC, PAX8, CARS, NF2, BCR, HMGA1, PDGFB, CBFA2T3, NOTCH1, BRAF, HMGA2, PIK3CA, CDH1, NPM1, CARD11, HRAS, PIM1, CDH11, NR4A3, CBLB, IRF4, PLAG1, CDK6, NUP98, CBLC, JUN, PPARG, SMAD4, PALB2, CCND1, KIT, PTPN11, CEBPA, PML, CCND2, KRAS, RAF1, CHEK2, PTEN, CCND3, LCK, REL, CREB1, RB1, CDX2, LMO2, RET, CREBBP, RUNX1, CTNNB1, MAF, ROS1, CYLD, SDHB, DDB2, MAFB, SMO, DDX5, SDHD, DDIT3, MAML2, SS18, EXT1, SMARCA4, DDX6, MDM2, TCL1A, EXT2, SMARCB1, DEK, MET, TET2, FBXW7, SOCS1, EGFR, MITF, TFG FH, STK11, ELK4, MLL, TLX1, FLT3, SUFU, ERBB2, MPL, TPR, FOXPL SUZ12, ETV4, MYB, USP6, GPC3, SYK, ETV6, IDH1, and/or TCF3. In some embodiments, combining the detection of aneuploidy with the detection of one or more genetic biomarkers (e.g., mutations) increases the specificity and/or sensitivity of detecting cancer.
In some embodiments, detection of a genetic biomarker (e.g., one or more genetic biomarkers) includes any of the variety of methods described in U.S. Pat. No. 7,700,286, which is hereby incorporated by reference in its entirety. Any of the variety of methods of messenger RNA (“mRNA”) isolation known in the art may be used to isolate RNA from a sample (e.g., Qiagen RNeasy Kit). Any of the variety of methods of genomic DNA (“gDNA”) isolation known in the art may be used to isolate gDNA from the sample (e.g., Qiagen DNeasy Kit). In some embodiments, detection of a genetic biomarker includes a cancer detection assay. In some embodiments, the amount of gDNA and/or mRNA in a sample are measured for any of the genetic biomarkers disclosed herein. Changes in the amount of gDNA and/or mRNA may indicate cancer. For example, when measuring gDNA, gene amplification (e.g., increased copy number of chromosomal sequences (e.g., coding regions of genes or non-coding DNA (see e.g., Table 1 for an exemplary list of repetitive elements that can be measured)) may indicate cancer. For example, when measuring mRNA, increases in the amount of RNA (e.g., increased expression of a genetic biomarker) may indicate cancer. In some cases, changes in DNA and RNA may correlate.
In some embodiments, methods provided herein to detect aneuploidy can be combined with methods to detect the presence of one or more protein biomarkers (e.g., peptides) in one or more proteins selected from the group consisting of: AFP, CA19-9, CEA, HGF, OPN, CA-125, CA15-3, MPO, prolactin (PRL) and/or TIMP-1 to determine the presence of cancer (e.g., ovarian or endometrial). In some embodiments, a protein biomarker can be any appropriate peptide biomarker. In some embodiments, a peptide biomarker can be a peptide biomarker associated with cancer. For example, a peptide biomarker can be a peptide having elevated levels in a cancer (e.g., as compared to a reference level of the peptide).
Exemplary and non-limiting threshold levels for certain protein biomarkers include: CA19-9 (>92 U/ml), CEA (>7,507 pg/ml), CA125 (>577 U/ml), AFP (>21,321 pg/ml), Prolactin (>145,345 pg/ml), HGF (>899 pg/ml), OPN (>157,772 pg/ml), TIMP-1 (>176,989 pg/ml), Follistatin (>1,970 pg/ml), and CA15-3 (>98 U/ml). In some embodiments, threshold levels for protein biomarkers can be higher (e.g., about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, about 90%, about 100%, or higher) than the exemplary threshold levels described herein. In some embodiments, threshold levels for protein biomarkers can be lower (e.g., about 10%, about 20%, about 30%, about 40%, about 50%, or lower) than the exemplary threshold levels described herein.
In some embodiments, a threshold level of CA19-9 can be at least about 92 U/mL (e.g., about 92 U/mL). In some embodiments, a threshold level of CA19-9 can be 92 U/mL. In some embodiments, a threshold level of CEA can be at least about 7,507 pg/ml (e.g., about 7,507 pg/ml). In some embodiments, a threshold level of CEA can be 7.5 ng/mL. In some embodiments, a threshold level of HGF can be at least about 899 pg/ml (e.g., about 899 pg/ml). In some embodiments, a threshold level of HGF can be 0.92 ng/mL. In some embodiments, a threshold level of OPN can be at least about 157,772 pg/ml (e.g., about 157,772 pg/ml). In some embodiments, a threshold level of OPN can be 158 ng/mL. In some embodiments, a threshold level of CA125 can be at least about 577 U/ml (e.g., about 577 U/ml). In some embodiments, a threshold level of CA125 can be 577 U/mL. In some embodiments, a threshold level of AFP can be at least about 21,321 pg/ml (e.g., about 21,321 pg/ml). In some embodiments, a threshold level of AFP can be 21,321 pg/ml. In some embodiments, a threshold level of prolactin can be at least about 145,345 pg/ml (e.g., about 145,345 pg/ml). In some embodiments, a threshold level of prolactin can be 145,345 pg/ml. In some embodiments, a threshold level of TIMP-1 can be at least about 176,989 pg/ml (e.g., about 176,989 pg/ml). In some embodiments, a threshold level of TIMP-1 can be 176,989 pg/ml. In some embodiments, a threshold level of follistatin can be at least about 1,970 pg/ml (e.g., about 1,970 pg/ml). In some embodiments, a threshold level of CA15-3 can be at least about 98 U/ml (e.g., about 98 U/ml). In some embodiments, a threshold level of CA15-3 can be 98 U/ml. In some embodiments, a threshold level of CA19-9, CEA, and/or OPN can be 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 100% or more greater than the threshold levels listed above (e.g., greater than a threshold level of 92 U/mL for CA-19-9, 7,507 pg/ml for CEA, 899 pg/ml for HGF, 157,772 pg/ml for OPN, 577 U/ml for CA125, 21,321 pg/ml for AFP, 145,345 pg/ml for prolactin, 176,989 pg/ml for TIMP-1, 1,970 pg/ml for follistatin, and/or 98 U/ml for CA15-3).
In some embodiments, a threshold level of protein biomarker can be greater than the levels that are typically tested for diagnostic or clinical purposes. For example, the threshold level of CA19-9 can be greater than about 37 U/ml (e.g., greater than about 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or more U/mL). Additionally or alternatively, the threshold level of CEA can be greater than about 2.5 ug/L (e.g., greater than about 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5 or more ug/L). Additionally or alternatively, the threshold level of CA125 can be greater than about 35 U/mL (e.g., greater than about 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550 or more U/mL). Additionally or alternatively, the threshold level of AFP can be greater than about 21 ng/mL (e.g., greater than about 25, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400 or more ng/L). Additionally or alternatively, the threshold level of TIMP-1 can be greater than about 2300 ng/mL (e.g., greater than about 2,500, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000 or more ng/L). Additionally or alternatively, the threshold level of follistatin can be greater than about 2 ug/mL (e.g., greater than about 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5 or more ug/L). Additionally or alternatively, the threshold level of CA15-3 can be greater than about 30 U/mL (e.g., greater than about 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or more U/mL). In some embodiments, detecting one or more protein biomarkers at threshold levels that are higher than are typically tested for during traditional diagnostic or clinical assays can improve the sensitivity of cancer detection.
Examples of peptide biomarkers include, without limitation, AFP, Angiopoietin-2, AXL, CA125, CA 15-3, CA19-9, CD44, CEA, CYFRA 21-1, DKK1, Endoglin, FGF2, Follistatin, Galectin-3, G-CSF, GDF15, HE4, HGF, IL-6, IL-8, Kallikrein-6, Leptin, LRG-1, Mesothelin, Midkine, Myeloperoxidase, NSE, OPG OPN, PAR, Prolactin, sEGFR, sFas, SHBG sHER2/sEGFR2/sErbB2, sPECAM-1, TGFa, Thrombospondin-2, TIMP-1, TIMP-2, and Vitronectin. For example, a peptide biomarker can include one or more of OPN, IL-6, CEA, CA125, HGF, Myeloperoxidase, CA19-9, Midkine and/or TIMP-1. In some embodiments, combining the detection of aneuploidy with the detection of one or more protein biomarkers (e.g., peptides) increases the specificity and/or sensitivity of detecting cancer.
In some embodiments, the presence of a genetic and/or protein biomarker may be detected in any of a variety of biological samples isolated or obtained from a subject (e.g., a human subject) including, but not limited to blood, plasma, serum, urine, cerebrospinal fluid, saliva, sputum, broncho-alveolar lavage, bile, lymphatic fluid, cyst fluid, stool, ascites, and combinations thereof. Any protein biomarker known in the art may be detected when a threshold value is obtained above which normal, healthy human subjects do not fall, but human subjects with cancer do fall. Any appropriate method can be used to detect the level of one or more protein biomarkers as described herein. In some embodiments, the level of one or more protein biomarkers is compared to a predetermined threshold. In some embodiments, the predetermined threshold is a general or global threshold. In some embodiments, the predetermined threshold is a threshold that is relevant to a particular protein biomarker. In some embodiments, the level of the one or more protein biomarkers is compared to an absolute amount of a reference protein biomarker. In some embodiments, the level of the one or more protein biomarkers is relative to an amount of a reference protein biomarker. In some embodiments, the level of the one or more protein biomarkers is an elevated level. In some embodiments, the level of the one or more protein biomarkers is above a predetermined threshold. In some embodiments, the level of the one or more protein biomarkers is within a predetermined threshold range. In some embodiments, the level of the one or more protein biomarkers is or approximates a predetermined threshold. In some embodiments, the level of the one or more protein biomarkers is below a predetermined threshold. In some embodiments, the level of the one or more protein biomarkers from a biological sample is lower than a particular threshold. In some embodiments, the level of the one or more protein biomarkers from a biological sample is depressed compared to a predetermined threshold.
In some embodiments, methods and materials described herein can be used for detecting one or more polymorphisms (e.g., somatic mutations) in a genome of a mammal. For example, a plurality of amplicons obtained from a sample obtained from a first mammal (e.g., a test mammal or a mammal suspected of harboring one or more polymorphisms) can be sequenced, a plurality of amplicons obtained from a sample obtained from a second mammal (e.g., a reference mammal) can be sequenced, variant sequencing reads from the sample obtained from the first mammal can be grouped into clusters of genomic intervals, reference sequencing reads from the sample obtained from the second mammal can be grouped into clusters of genomic intervals, a chromosome arm having a sum of the variant sequencing reads and the reference sequencing reads on both alleles that is greater than about 3 (e.g., greater than about 4, greater than about 5, greater than about 6, greater than about 7, greater than about 8, greater than about 9, greater than about 10, greater than about 12, greater than about 15, greater than about 18, greater than about 20, greater than about 22, greater than about 25, or greater than about 30) can be selected, a variant-allele frequency (VAF) of the selected chromosome arm can be determined, and the presence or absence of one or more polymorphisms on the selected chromosome arm can be identified. A VAF of the selected chromosome arm can be determined using any appropriate technique. For example, a VAF of the selected chromosome arm can be the number of variant sequencing reads/total number of sequencing reads. The presence of one or more polymorphisms in the genome of the mammal can be identified in the genome of the mammal when the VAF is between about 0.2 and about 0.8 (e.g., between about 0.3 and about 0.8, between about 0.4 and about 0.8, between about 0.5 and about 0.8, between about 0.6 and about 0.8, between about 0.2 and about 0.7, between about 0.2 and about 0.6, between about 0.2 and about 0.5, or between about 0.2 and about 0.4), and the absence of one or more polymorphisms in the genome of the mammal can be identified in the genome of the mammal when the VAF is within a predetermined significance threshold. For example, without limitation, the presence of one or more polymorphisms in the genome of the mammal can be identified in the genome of the mammal when the VAF is between about 0.4 and 0.6.
In some embodiments, methods and materials described herein can be used for sample identification. The repetitive elements amplified by the methods described herein include common polymorphisms that can be used to establish or refute sample identify among samples (e.g., plasma, tumor, and blood). For example, the genotype at each polymorphic location can be identified and compared across samples. Overall similarities between samples at polymorphic locations can be used to determine sample identity.
In some cases the diseases associated with one or more chromosomal anomalies as described herein (e.g., based at least in part on the presence of one or more chromosomal anomalies, such as, without limitation, an aneuploidy) are also associated with increased mutation rates (e.g., increased mutation rates can be associated with stage of disease) when compared to a control (e.g., non-disease sample). In such cases, the materials and methods described herein can be used to (a) identify the presence of one or more chromosomal anomalies (e.g., aneuploidy) and (b) identify the stage (e.g., cancer stages I, II, III, and IV) of the disease based on a determination of the mutation rate (e.g., number of mutations) compared to a control.
The invention will be further described in the following examples, which do not limit the scope of the invention described in the claims.
This example describes a novel adaptation of amplicon-based aneuploidy detection. An approach called WALDO for Within-Sample-AneupLoidy-DetectiOn, which employs supervised machine learning to detect changes in chromosome arms, improved aneuploidy detection sensitivity compared to previous methods. It is shown here that using WALDO to analyze amplicons of short interspersed nucleotide elements (SINEs) from a DNA sample increases sensitivity of aneuploidy detection. In addition, the 1,000,000 SINE amplicons with an average length of about 100 bp reduce the input requirement for cell free DNA input while also increasing sensitivity of detection.
To generate a list of candidate primers, the frequency of all possible 6-mers (4{circumflex over ( )}6=4096) within the RepeatMasker track of hg19 were calculated. Next, the frequency of all possible 4-mers (4{circumflex over ( )}4=256) within 75 bp upstream or downstream from the 6-mers were calculated. Joining the 6-mers with the 4-mers generated 2,097,152 candidate pairs. These pairs were selected for further assessment based on the number of unique genomic loci expected from their PCR-mediated amplification, the average size between the 6-mer and its corresponding 4-mers, and the distribution of these sizes, aiming for a unimodal distribution. This filtering criteria generated 16 potential k-mer pairs, leading to the design of 16 primer pairs that incorporated these k-mer pairs at their 3-ends. A k-mer is understood in the art to refer to a subsequence of length k which is contained within a sequence.
In total, 16 primers were initially designed and tested (Table 2). One primer (SEQ ID NO: 1) consistently had fewer primer dimers and was selected for use in testing a cohort. A primer pair having SEQ ID NO: 1 as one of the primers uniquely amplified 745,184 amplicons, which amplicons had an average amplicon size of ˜88 bp (
The first primer having SEQ ID NO: 1 included from the 5′ to 3′ end: a universal primer sequence (UPS), a unique identifier DNA sequence (UID), and an amplification sequence. Polymerase chain reaction (PCR) was performed in 25 uL reactions containing 7.25 uL of water, 0.125 uL of each primer, 12.5 uL of NEBNext Ultra II Q5 Master Mix (New England Biolabs cat #M0544S), and 5 uL of DNA. The cycling conditions were: one cycle of 98° C. for 120 s, then 15 cycles of 98° C. for 10 s, 57° C. for 120 s, and 72° C. for 120 s. For experiments with plasma, the amount of DNA in 5 uL was 0.14 ng. A second round of PCR was then performed to add dual indexes (barcodes) to each PCR prior to sequencing. The forward and reverse primers used for the second round of PCR are listed in Table 2. The initial amplification primers were not removed and the amplification product from the first reaction was diluted 1:20. The dilution was used directly for a second round of amplification using primers that annealed to the UPS site introduced by the first round primers and that additionally contained the 5′ grafting sequences necessary for hybridization to the Illumina flow cell.
Flndexes (e.g., sequences used to differentiate between samples) were introduced to each sample using the second reverse primer to later allow multiplexed sequencing. The second round of PCR was performed in 25 uL reactions containing 7.25 uL of water, 0.125 uL of each primer, 12.5 uL of NEBNext Ultra II Q5 Master Mix (New England Biolabs cat #M0544S), and 5 uL of DNA containing 5% of the PCR product from the first round. The cycling conditions were: one cycle of 98° C. for 120 s, then 15 cycles of 98° C. for 10 s, 65° C. for 15 s, and 72° C. for 120 s. Amplification products were run on agarose gels to check for amplification. Amplification products were purified with AMPure XP beads at 1.2× and were quantified by spectrophotometry, real time PCR, an Agilent 2100 Bioanalyzer or an automated electrophoresis using an Aiglent TapeStation. All oligonucleotides were purchased from Integrated DNA Technologies (Coralville, Iowa).
Bowtie2 was used to align reads of the amplicons generated with each of the 7 primer pairs to the human reference genome assembly GRC37 (Langmead et al. 2012). With primer pair 1 (the primer having SEQ ID NO: 1 and the primer having SEQ ID NO: 10), an average of 51.1% of the total reads could be uniquely aligned and the average amplicon size was 88 bp (
Read-depth-based analytical methods have been widely applied to whole-genome sequencing (WGS) protocols. Under the assumption that reads are uniformly and independently distributed, regions of normal copy number are expected to follow a Poisson or normal distribution (Zhao et al 2013 and Pirooznia et al 2015). Amplicon-based protocols achieve high coverage depth at relatively low cost, and they are an attractive alternative to WGS, but aligned reads from amplicon sequencing such as those resulting from the above described assay have properties different from those resulting from WGS and WES. Because these reads are limited to a relatively small number of discrete loci, they are discontinuous. The reads are also not randomly distributed, which makes it difficult to use the statistical models of read depth coverage designed for WGS and WES. The Within-Sample AneupLoidy DetectiOn (WALDO), is an algorithm specifically designed for amplicon-based aneuploidy detection (see, e.g., Douville et al. PNAS 201 115(8):1871-1876). WALDO was applied to sequencing reads that mapped to the above described genomic loci (e.g., SINE). The genome-wide aneuploidy score was used to identify whether a sample had the presence of aneuploidy.
Unlike most conventional approaches for assessing copy number changes, WALDO does not compare normalized read counts from each chromosome arm in a test sample to the fraction of reads in each chromosome arm in other samples. Such conventional comparisons are subject to batch effects and other artifacts associated with variables that are difficult to control. To evaluate whole genome sequencing data, aneuploidy was detected by comparing the read counts within 5344 genomic intervals each containing 500-kb of sequence. The read counts within the 500-kb genomic intervals within a sample were only compared to the read counts of other genomic intervals within the same sample—hence the “Within-Sample” designation in WALDO. The previously described WALDO protocol was tailored in this Example, which resulted in several analytical changes (see
In euploid samples, the number reads within each 500-kb genomic interval should track with the number of reads in certain other genomic regions. Genomic intervals that track together do so because the amplicons within them amplify to similar extents. Here, such genomic regions that track together are called “clusters”. It is possible identify clusters from sequencing data on euploid samples. In a test sample, it is determined whether the number of reads in each genomic interval in each pre-defined cluster is within the expected bound of the other clusters from that same sample. If the reads within a genomic interval are outside the statistically expected bound, and there are many such outsiders on the same chromosome arm, then that chromosome arm is classified as aneuploid. The statistical basis of this test is described elsewhere (e.g., Douville et al. PNAS 201 115(8):1871-1876). In brief, while the number of reads is not randomly distributed across the genome, the distribution of scaled reads within each cluster is approximately Normal. A convenient property of Normal distributions is that the sum of multiple Normal distributions is also a Normal distribution. It is thus possible to compute the theoretical mean and variance of the summed reads on each chromosome arm simply by summing the means and variances of all the clusters represented on that chromosome arm.
WALDO also employs several other innovations that make it applicable to the analysis of PCR-generated amplicons from clinical samples. One of these innovations is controlling amplification bias stemming from the strong dependence of the data on the size of the initial template. Another is the use of a machine learning algorithm (e.g., a Support Vector Machine (SVM)) to enable the detection of aneuploidy in samples containing low neoplastic fractions.
The improved WALDO methods described in this Example include a new method of normalization that reduced the amount of variability between samples. In this normalization, a principal component analysis (PCA) was first performed on sequencing data from the controls. PCA reduced the number of 500 kb genomic intervals from n=5,344 to a more manageable number of dimensions. Using the PCA coordinates of the controls, a modeled was created to predict whether a particular 500 kb interval will be amplified more or less efficiently in future samples based on their PCA coordinates.
Correction Factor for 500 kb Intervali=βoi+β1i*PCA1β2i*PCA2+β3i*PCA3+β4i*PCA4+β5i*PCA5
For each test sample, the sample was projected into PCA space and the correction factor was calculated for each 500 kb interval as function of its PCA coordinates. After applying the correction factor to each 500 kb genomic interval, the test sample was matched to 7 control samples based on the closest Euclidean distance of the 500 kb intervals.
Data was selected from 84 presumably euploid plasma samples, each containing at least 10 million reads, and each derived from the DNA of normal WBCs. Synthetic aneuploid samples were created by adding (or subtracting) reads from several chromosome arms to the reads from these normal DNA samples. The reads were added or subtracted from 1, 10, 15, or 20 chromosome arms to each sample. The additions and subtractions were designed to represent neoplastic cell fractions ranging from 0.5% to 1.5% and resulted in synthetic samples containing exactly ten million reads. The reads from each chromosome arm were added or subtracted uniformly. For example, when modeling five chromosome arms that were lost, each was lost to the identical degree and we did not incorporate tumor heterogeneity into the model. Furthermore, synthetic samples were not created containing more than three of any chromosome arm; e.g. 4 copies of chromosome 3p. This simplified approach did not comprehensively cover all biologically plausible aneuploidy events. However, limiting the possible combinations of altered arms made sample generation computationally tractable, and the resulting support vector machine worked well in practice. The synthetically generated samples in which reads from only a single chromosome arm were added or subtracted enabled us to estimate the performance of WALDO when only a single chromosome arm of interest was gained or lost. The pseudocode to generate synthetic samples is shown in
A two-class support vector machine (SVM) was trained to discriminate between euploid samples and aneuploid samples. The training set contained a negative class of 1348 presumably euploid plasma samples from normal individuals containing at least 2.5M reads and 635 aneuploid samples. The aneuploid class contained a mixture of synthetic and actual aneuploid samples. SVM training was done with the e1071 package in R, using radial basis kernel and default parameters. Each sample had 39 Z-score features, representing chromosome arm gains and losses. During training, the positive class was randomly sampled so that the positive class was 10% the size of the negative class. The positive class was randomly sampled at a ratio of two real samples to one synthetic sample. Ten iterations of this procedure were performed. The final genome wide aneuploidy score was the average of the raw svm score across the 10 iterations.
The performance of this assay was assessed on a cohort of 1348 euploid plasma samples and 883 plasma samples from cancer patients (Table 3). The samples from cancer patients included Breast, Colorectum, Esophagus, Liver, Lung, Ovary, Pancreas, and Stomach cancers (
To ensure that all samples included in the results section of paper were of high quality, several exclusion criteria were developed. First, samples with fewer than 2.5M reads were excluded. Second, samples with sufficient evidence of contamination were excluded. To be labeled as contaminated, the sample had to have at least 10 significant allelic imbalanced chromosome arms (z score>=2.5) and fewer than ten significant chromosome arms gains or losses (z>=2.5 or z<=−2.5). Allelic imbalance is determined from SNPs, while gains or losses were assessed through WALDO. As determined through mixing experiments, a relatively large number of allelic imbalanced chromosome arms in the absence of a large number of gains or losses indicated contamination of the sample with DNA from another individual. Third, in plasma analyses, samples in which more than 8.5% of the amplicons were larger than 94 bps (50 base pairs between the forward and reverse primers) were excluded. Such samples were likely to be contaminated with leukocyte DNA. Fourth, samples outside the dynamic range of the assay, as defined by the equation below, were excluded.
The distribution of this metric has long tails. The values of >0.2450 and 0.2320 were selected as a dynamic range that we could evaluate cutoffs. Fifth, plasma samples with known aneuploidy in the leukocytes of the same patients; such patients were assumed to have Clonal Hematopoiesis of Indeterminate Potential (CHIP) or congenital disorders.
Whether aneuploidy could be integrated as an additional biomarker into the published framework, as well as the predictive ability of a logistic regression model with aneuploidy and protein markers against the original logistic regression model that uses somatic mutations and protein markers, was compared.
Here, 1348 plasma samples from healthy people and 883 cancer patients were analyzed. Of the 1348 healthy samples, only 248 overlapped with the original study. All 883 cancer samples were included in the original study. The sample demographic information was provided in Table 3.
Using the original 812 healthy samples (Cohen et al.) and the 883 cancer samples, a logistic regression model was trained and then used to assess performance using ten rounds of tenfold cross validation. A full list of samples and their biomarker values was provided in Table 3. Because 564 of the original healthy samples were not analyzed for aneuploidy, the list of scores from the 1348 normal samples was randomly sampled and assigned each missing sample an aneuploidy value. Ten rounds of analysis were performed and each new round, the collection of 1348 normal scores again randomly sampled to assign the 564 samples a new score.
To account for variations in the lower limits of detection across different experiments, the 90th percentile feature value was used in the healthy training samples. Any feature value below this threshold and set all values to the 90th percentile threshold. This transformation was done for all training and testing samples. This procedure was done for aneuploidy scores, somatic mutation scores, and protein concentrations. The 90th percentile thresholds and final feature coefficients from the logistic regression model were listed in Table 4.
Comparison of Aneuploidy Sensitivity Detection with Other Cancer Biomarkers
The aneuploidy results were benchmarked against a driver gene mutation panel and collection of 7 proteins markers (AFP, CA-125, CA15-3, CA19-9, CEA, HGF, OPN, TIMP1) that were recently published as key biomarkers for cancer detection in plasma samples (
Reliably detecting aneuploidy in only a few picograms (pg) of DNA is necessary for preimplantation diagnostics as well as forensic applications. In preimplantation diagnosis, a few cells picked from a blastocyst are used to assess copy number variations. For example, preimplantation diagnosis includes identifying a mammals as having aneuploidy related to Down Syndrome. To test the limit of detection with respect to input DNA for the methods featured in this disclosure, samples with aneuploidy associated with trisomy 21 were analyzed at input DNA concentrations ranging from 3-225 pg. The relationship of reads to DNA was based on negative controls (water wells with no DNA) and the known concentration of the euploid control (
Samples from biobanks with low input DNA were assessed for either aneuploidy or identification purposes. The methods as described herein were applied to 793 plasma DNA samples, which had been stored in PCR plates for as long as 10 years. For each of the wells in the PCR plates, all of the DNA volume had been used for other experiments. Five microliters of water was added to the dried (empty) wells and then subjected to the methods as described herein. In 728 samples, more than 2.5 million aligned reads were sequenced, which is a number sufficient to reliably assess aneuploidy. In 768 of these samples, more than 1 million aligned reads were sequenced, a number sufficient to confirm the identity of the plasma DNA to other samples from the same donor.
Plasma cfDNA is often contaminated with DNA that has leaked out of leukocytes, either through phlebotomy or preparation of plasma. This contaminating leukocyte DNA can reduce the sensitivity of aneuploidy testing from plasma samples because leukocytes are not derived from either fetal cells (in NIPT) or cancer cells (in liquid biopsies). Leukocyte genomic DNA (gDNA) has an average fragment size of >1000 bp while cell-free plasma DNA has an average size of <160 bp. Given that small fragments are amplified more efficiently during a PCR reaction, detection of contaminating leukocyte gDNA is difficult because the shorter cfDNA is preferentially amplified. Application of the methods described herein enabled the detection of contaminating leukocyte gDNA by virtue of the amplicons generated with primers SEQ ID NO: 1 and SEQ ID NO: 10. Using these methods, 1241 amplicons were identified that are typically present in gDNA but not cfDNA. Sequencing reads of these amplicons thereby indicated leukocyte contamination in plasma samples. Through mixing of leukocyte DNA with cell-free plasma DNA and using the methods described herein, samples containing >4% of leukocyte DNA could be detected, as shown in Table 5.
Copy number variants of indeterminate length were detected. First, the log ratio of the observed test sample and WALDO predicted values from every 500 kb interval across each chromosomal arm were calculated. Using the log ratio, a circular binary segmentation algorithm was applied to find copy number variants throughout each chromosome arm. Any copy number variant ≤5 Mb in size was flagged. Before calculating the statistical significance across each chromosome arm, these flagged CNVs were removed. In general, small CNVs can be used to assess microdeletions or microamplifications, such as those occurring in DiGeorge Syndrome (chromosome 22q11.2 or in breast cancers (chromosome 17q12).
This Example describes the sensitivity of cancer detection with different multi-analyte tests.
Three different multi-analyte tests were used to evaluate the sensitivity of detecting eight cancers: breast, ovary, liver, lung, pancreas, esophagus, stomach, and colorectum, in the plasma sample from patients. The three tests were: (1) a three component test using aneuploidy status, somatic mutation analysis and protein biomarker evaluation; (2) a two component test using aneuploidy status and somatic mutation analysis; and (3) a two component test using aneuploidy status and protein biomarker evaluation. The eight protein biomarkers tested and somatic mutations tested were as described in Cohen et al., Science 359, pp. 926-930, the entire contents of which are hereby incorporated by reference.
As shown in
As shown in
As shown in
Thus, the data disclosed in this Example shows that the three component multi-analyte test with aneuploidy status, somatic mutation analysis and protein biomarker evaluation can increase the sensitivity of detecting cancer while maintaining a high specificity of cancer detection.
The materials and methods described herein can be used to identify somatic mutations within the sequences of repetitive elements amplified from a sample (e.g., a tumor sample or a non-tumor sample (i.e., a normal sample)). For example, when two samples, a non-tumor sample and a tumor sample, are available from the same patient, mutations that are in one sample but not the other can be discerned. For each sample, the number of somatic mutations can be counted and the spectrum of single base substitutions (SBS) (e.g., A->T, A->C, etc.) determined. When the samples are also analyzed by exomic sequencing, a correlation between the number of SBSs in the repetitive elements amplified herein and the number of SBS in the exomes can be determined. Thus, the materials and methods as described herein can be used identify somatic mutations within a sample.
The materials and methods described herein can be used to identify and/or distinguish samples (e.g., distinguish between a sample from one subject from a sample from a second subject). In such cases, samples are identified based on the common polymorphisms present in the repetitive elements amplified by materials and methods described herein. Samples are then distinguished from other samples by comparing the sequence at common polymorphisms between samples. Determining the genotype of each polymorphism for each of the amplicons assigns a genotype to the sample. Genotypes can be compared across samples in order to identify samples (e.g., distinguish tumor sample from non-tumor sample or a sample from one subject from a sample from a different subject). Samples can be considered to be from different samples if concordance (e.g., percent similarity between the genoytpes) was <0.99 and at least 5,000 amplicons had adequate coverage.
A set of experiments was performed to assess detection of aneuploidy in different stages and different types of cancer. In these experiments, plasma from subjects having different stages of breast, colorectum, esophagus, liver, lung, ovary, pancreas and stomach cancers were isolated according to the methods described herein.
Using the Real Seq method, aneuploidy was detected more commonly than mutations in plasma samples from cancer patients. Aneuploidy was detected more commonly than mutations in plasma samples from cancer patients (49% and 34% of 883 samples, respectively; P<10-20, one sided binomial test,
A set of experiments was performed to assess sensitivity of cancer detection when combining aneuploidy detection with protein biomarker detection as described herein. In these experiments, plasma from the same cohort as in example 8 (e.g., different stages of breast, colorectum, esophagus, liver, lung, ovary, pancreas and stomach cancer) were assayed for aneuploidy and protein biomarkers.
A set of experiments was performed to assess performance of Real Seq compared to other next generation sequencing technologies.
In the most common form of NIPT, detection of a gain or loss of a chromosome (e.g., chromosome 21 in Down Syndrome) is the goal. Whole genome sequencing (WGS), FAST-SeqS, and RealSeqS were used to assess performance on samples for DNA admixtures typically encountered in non-invasive prenatal testing (NIPT), i.e., when the fraction of fetal DNA was 5%. For this purpose, actual data obtained with the three methods was used, but then a defined number of reads from various chromosome regions from the same samples were added to simulate what would happen if there was aneuploidy in these regions. The pseudocode used to generate these in silico simulated samples is described
As shown in
Another important aspect of assays for copy number variation is the detection of relatively small regions which are deleted or amplified. For example, the DiGeorge Syndrome deletions are often as small as 1.5 Mb. For data simulating a 5% deletion-containing cell fraction, RealSeqS had 75.0% sensitivity for the 1.5 Mb DiGeorge deletion (at 99% Specificity) while WGS and FAST-SeqS had 19.0% and 29.0% sensitivity, respectively (
The detection of amplifications, such as those on ERBB2 in breast cancer, are important for deciding whether patients should be treated with trastuzumab or other targeted therapies. Following the same protocol as described above in this Example, in silico simulated samples with focal amplifications of the ˜42 Kb ERBB2 gene (20 copies) were generated for WGS, FAST-SeqS, and RealSeqS. RealSeqS detected amplifications in the in silico simulated samples with significantly less sequencing compared to WGS or Fast-SeqS. For a 1% cell fraction, RealSeqS had a 91.0% sensitivity while WGS had 50.0% (
This data shows that the Real SEQ technique can detect small regions that are amplified or deleted and the method has a higher sensitivity at lower amounts of sequencing.
A set of experiments was performed to assess detection of aneuploidy using the Real SEQ method in samples with varying concentrations of tumor-derived DNA. In assessing 302 samples in which the mutant allele fraction had been determined by the analysis of mutations that were present in the plasma (Cohen et al., Science 359; 926-930), aneuploidy was detected in 92% of 65 samples that had mutant allele fraction ≥2%, 71% of 65 samples with mutant allele fractions of 0.5% to 2%, and in 49% of 172 samples with mutant allele frequencies ranging from 0.01% to 0.5% (
The data shows that the Real Seq method can detect aneuploidy, e.g., even at low concentrations of tumor DNA. Therefore, the sensitivity of detecting aneuploidy is related to the concentration of circulating tumor DNA in the sample.
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
I
PL
indicates data missing or illegible when filed
This application claims the benefit of U.S. Provisional Application Ser. No. 62/849,662, filed on May 17, 2019; U.S. Provisional Application Ser. No. 62/905,327, filed on Sep. 24, 2019 and U.S. Provisional Application Ser. No. 62/971,050, filed on Feb. 6, 2020. The disclosures of the prior applications are considered part of (and are incorporated by reference herein) the disclosure of this application.
This invention was made with government support under grants CA230691 and CA230400 awarded by the National Institutes of Health. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/033209 | 5/15/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62971050 | Feb 2020 | US | |
62905327 | Sep 2019 | US | |
62849662 | May 2019 | US |