The DNA of most tumors has a reduced content of methylated cytosine residues. This so-called global “hypomethylation” affects primarily DNA sequences that belong to interspersed DNA repeats. In normal human tissues, DNA repeats are predominantly methylated, consistent with the requirement to maintain genomic stability by transcriptional silencing of retroelements whose potential deleterious functions include DNA mobilization as well as the facilitation of recombination events in somatic cells.
Disclosed are methods and compositions of assessing one or more statuses of a subject. Also disclosed are methods and compositions of identifying status biomarkers associated with a status of a subject. Also disclosed are sets of one or more status biomarkers. Also disclosed are methods and compositions of producing status biomarker capture probes.
In some forms of the methods and compositions of assessing one or more statuses of a subject, the method can comprise, for example, determining the methylation state of one or more status biomarkers in the subject, and comparing one or more of the determined methylation states to one or more reference methylation states, wherein a difference, lack of a difference, or both in one or more of the determined methylation states and one or more of the reference methylation states indicates one or more statuses of the subject.
In some forms of the methods and compositions of identifying status biomarkers associated with a status of a subject, the method can comprise, for example, determining the methylation state of one or more status biomarkers in one or more DNA samples, wherein the DNA samples are from sources that are relevant to one or more specific statuses, and comparing one or more of the determined methylation states to one or more reference methylation states, wherein a difference in one or more of the determined methylation states and one or more of the reference methylation states indicates that the status biomarkers for which the difference in the methylation states is found is a status biomarker associated with one or more of the specific statuses.
In some forms, the methylation state can be determined by, for example, treating a DNA sample of the subject to differentiate methylated and unmethylated nucleotides, and detecting the level of methylated forms of the one or more status biomarkers in the treated DNA, detecting the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both, wherein the level of methylated forms of the status biomarkers, the level of unmethylated forms of the status biomarkers, or both indicates the methylation state of the status biomarkers.
In some forms, treating the DNA sample can be accomplished by, for example, incubating the DNA sample with one or more restriction endonucleases and amplifying the incubated DNA, wherein the restriction endonucleases are methylation-sensitive restriction endonucleases, wherein the level of the status biomarkers in the amplified DNA is lower when the status biomarkers have reduced methylation and the level of the status biomarkers in the amplified DNA is higher when the status biomarkers have increased methylation, wherein the level of the status biomarkers comprise the level of methylated forms of the one or more status biomarkers in the treated DNA, the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both.
In some forms, the restriction endonucleases can further comprise at least one methylation-dependent restriction endonuclease. In some forms, the restriction endonucleases can further comprise at least one methylation-independent restriction endonuclease. In some forms, the restriction endonucleases can comprise AciI and HhaI. In some forms, the restriction endonucleases can comprise McrBC. In some forms, incubating the DNA sample with one or more endonucleases can be accomplished by, for example, incubating different aliquots of the DNA sample with different restriction endonucleases. In some forms, amplifying the incubated DNA can be accomplished by, for example, multiple displacement amplification.
In some forms, treating the DNA sample can be accomplished by, for example, processing the DNA sample with sodium bisulfite.
In some forms, treating the DNA sample can be accomplished by, for example, fragmenting the DNA and separating methylated DNA from unmethylated DNA. In some forms, the DNA can be fragmented by, for example, nebularization, cleavage with a restriction endonuclease, sonication, or a combination. In some forms, methylated DNA can be separated from unmethylated DNA by, for example, binding methylated DNA with a specific binding molecule specific for methyl groups and separating the bound from the unbound DNA. In some forms, the specific binding molecule can comprise, for example, an antibody specific for 5-methyl cytosine, methyl-biding protein MBD1, methyl-biding protein MECP2, or a combination.
In some forms, treating the DNA sample can be accomplished by, for example, capturing status biomarker DNA fragments and sequencing the captured status biomarker DNA fragments, wherein the sequencing distinguishes cytosine from methylcytosine, wherein the level of methylcytosine indicates level of methylated forms of the status biomarkers. In some forms, the status biomarker DNA fragments can be captured by, for example, binding DNA fragments in the DNA sample to status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in, for example, Table 1. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in, for example, Table 1. In some forms, the one or more of the status biomarker probes can comprise at least 20 different degenerate sequences each representing a different consensus sequence for a different one of the families of repetitive DNA sequences listed in, for example, Table 1. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not captured can be separated from the captured status biomarker DNA fragments. In some forms, the sequencing can be a form of SMRT sequencing.
In some forms, the method can further comprise, after capturing status biomarker DNA fragments and prior to sequencing the captured status biomarker DNA fragments, releasing the captured status biomarker DNA fragments and recapturing the released status biomarker DNA fragments. In some forms, the status biomarker DNA fragments can be recaptured by binding DNA fragments in the DNA sample to secondary status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in, for example, Table 16 and Table 17. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in Table 16 or 17. For example, the family of repetitive DNA sequences can be the AluY, AluSx, AluSp, AluSg, or AluSc family of repetitive DNA sequences. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in, for example, Table 16 and Table 17. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, or AluSc. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not recaptured can be separated from the recaptured status biomarker DNA fragments.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, an array of probes specific for the status biomarkers. In some forms, the array of probes can be, for example, a microarray.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, amplifying the processed DNA and determining the ratio of cytosine to thymidine in the amplified DNA and converting the ratio to the level of methylated forms of the status biomarkers. In some forms, the processed DNA can be amplified via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers. In some forms, the PCR amplification can be quantitative PCR. In some forms, the PCR amplification can be nanoliter-microarray quantitative PCR.
In some forms, the level of the status biomarkers can be grouped into a plurality of status biomarker families, wherein the level of the status biomarkers in one or more of the families is analyzed, wherein the analyzed level of the status biomarkers in the one or more of the families indicates the methylation state of the status biomarkers in the family. In some forms, the analyzed level of the status biomarkers in one or more of the families can be the average of the levels of the individual status biomarkers in the family. In some forms, one or more of the status biomarker families each independently can consist of, for example, a single class of repetitive DNA element, a single subclass of repetitive DNA element, a single family of repetitive DNA element, a single subfamily of repetitive DNA element, or a combination. In some forms, the analyzed level of the status biomarkers in one or more of the families can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarkers can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarker families can be normalized to one or more of the reference methylation states. In some forms, the status biomarkers can be grouped according to one or more repetitive DNA sequences that the status biomarkers comprise, wherein each biomarker in each status biomarker family comprises one or more repetitive DNA sequences that belong to a single family of repetitive DNA sequences listed in, for example, Table 1.
In some forms, one or more of the one or more reference methylation states can be a normal methylation state. In some forms, the normal methylation state can be, for example, the methylation state of a healthy subject, the average of the methylation states of healthy subjects, or the average of the methylation states of a population of subjects. In some forms, one or more of the one or more reference methylation states can be, for example, the methylation state of the same subject at a different time, the methylation state of the same subject at an earlier time, the methylation state of the same subject at a later time, or the methylation state of one or more normal cells, tissues, organs, or a combination of the same subject. In some forms, one or more of the one or more reference methylation states can be the methylation state from non-tumor adjacent tissue. In some forms, one or more of the one or more reference methylation states can be a normal methylation state of a status biomarker family.
In some forms, the method can further comprise determining the genetic state of one or more status biomarkers by, for example, comparing one or more of the determined genetic states to one or more reference genetic states, wherein a difference, lack of a difference, or both in one or more of the determined genetic states and one or more of the reference genetic states indicates one or more statuses of the subject. In some forms, determining the genetic state of one or more status biomarkers can be determined in one or more of the DNA samples.
In some forms, the source of one or more of the DNA samples can be one or more tissues of the subject, organs of the subject, or both. In some forms, the source of one or more of the DNA samples can be a tissue or organ of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells, tissue, skin, lung, head, neck, prostate, breast, ovary, brain, liver, stomach, intestine, kidney, testicle, cervix, uterus, spleen, bone, throat, esophagus, muscle, bodily fluids, blood, urine, semen, lymphatic fluid, cerebrospinal fluid, amniotic fluid, biological samples, tissue culture cells, buccal swabs, mouthwash, stool, tissues slices, biopsy aspiration, or a combination.
In some forms, the subject can be assessed for the status of wellness, level of health, risk to wellness, risk to level of health, or a combination. In some forms, the subject can be assessed for the status of the genome. In some forms, the subject can be assessed for the status of aging, risk of aging, or both. In some forms, the subject can be assessed for the status of cancer, risk of cancer, or both. In some forms, the subject can be assessed for the status of stress response. In some forms, the subject can be assessed for the status of diabetes, risk of diabetes, or both. In some forms, the subject can be assessed for the status of heart disease, risk of heart disease, or both. In some forms, the subject can be assessed for the status of genomic instability. In some forms, the subject can be assessed for the status of tumor burden. In some forms, the subject can be assessed for the status of response to treatment.
In some forms, the subject can be assessed for a change in one or more statuses. In some forms, the change in one or more of the one or more statuses can be assessed compared to an earlier assessment. In some forms, the earlier assessment can have been made at, for example, an earlier time, prior to diagnosis of a disease or condition, prior to a treatment, following diagnosis of a disease or condition, following treatment, or a combination. In some forms, the change in one or more of the one or more statuses can be assessed following the passage of time, prior to diagnosis of a disease or condition, prior to a treatment, following diagnosis of a disease or condition, following treatment, or a combination. In some forms, assessing the subject can comprise assessing one or more tissues of the subject, organs of the subject, or both. In some forms, assessing the subject can comprise assessing a tissue or organ of the subject. In some forms, assessing the subject can comprise assessing one or more cells of the subject.
In some forms, the status biomarkers can comprise nucleic acid sequences in the genome of the species to which the subject belongs. In some forms of the sets of one or more status biomarkers the status biomarkers can comprise, for example, nucleic acid sequences in a genome. In some forms, the nucleic acid sequences can be in proximity to CpG islands or islets, wherein the CpG islands or islets comprise nucleic acid regions greater than 100 nucleotides in length that contain a minimum of 5 CpG residues and have a ratio of CG content to GC content greater than 0.3. In some forms, the CpG islands or islets can comprise nucleic acid regions greater than 200 nucleotides in length. In some forms, the CpG islands or islets can comprise nucleic acid regions greater than 300 nucleotides in length. In some forms, the nucleic acid regions can have a ratio of CG content to GC content greater than 0.4. In some forms, the nucleic acid regions can have a ratio of CG content to GC content greater than 0.5. In some forms, the status biomarkers can be in proximity to CpG islands or islets when they are within 1200 bases of a CpG island or islet.
In some forms, one or more of the status biomarkers can overlap with all or part of a CpG island or islet. In some forms, the one or more of the status biomarkers can comprise a probe binding site, wherein the probe binding site of the one or more of the status biomarkers is specific for a probe. In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein independently for each of the one or more of the probes one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences listed in, for example, Table 1. In some forms, each probe can be specific for a repetitive DNA sequence locus, wherein independently for each probe one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences listed in, for example, Table 1. In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein for one or more of the probes one or more of the repetitive DNA sequences is an interspersed repeat element. In some forms, each probe can be specific for a repetitive DNA sequence locus, wherein for each probe one or more of the repetitive DNA sequences is an interspersed repeat element.
In some forms, one or more of the status biomarkers can comprise a PCR amplicon. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by a first primer specific for a single one of the status biomarkers and a second primer. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by the same first primer specific for a first type of repetitive DNA sequence and a second primer, wherein the second primer is specific for a second type of repetitive DNA sequence, wherein the second primer is the same for some and different for some of the one or more of the status biomarkers. In some forms, the first primer can be specific for one of the families of repetitive DNA sequences listed in Table 16 or 17, wherein independently for each of the one or more of the status biomarkers the second primer is specific for a family of repetitive DNA sequences listed in, for example, Table 1.
In some forms, one or more of the status biomarkers can comprise one or more repetitive DNA sequences, wherein independently for each of the one or more of the status biomarkers that comprise repetitive DNA sequences one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences listed in, for example, Table 1. In some forms, each status biomarker can comprise a repetitive DNA sequence, wherein independently for each of the status biomarkers the repetitive DNA sequence belongs to a family of repetitive DNA sequences listed in, for example, Table 1. In some forms, one or more of the status biomarkers can comprise one or more repetitive DNA sequences, wherein for one or more of the status biomarkers that comprise repetitive DNA sequences one or more of the repetitive DNA sequences is an interspersed repeat element. In some forms, each status biomarker can comprise a repetitive DNA sequence, wherein for each status biomarker the repetitive DNA sequence is an interspersed repeat element.
In some forms, the methylation state of more than 100 biomarkers is determined. In some forms, the methylation state of more than 1000 biomarkers can be determined. In some forms, the methylation state of more than 10,000 biomarkers can be determined. In some forms, the methylation state of more than 100,000 biomarkers can be determined. In some forms, the methylation state of more than 200,000 biomarkers can be determined. In some forms, the status biomarkers can comprise a set of status biomarkers. In some forms, the set can comprise more than 100 status biomarkers. In some forms, the set can comprise more than 1000 status biomarkers. In some forms, the set can comprise more than 10,000 status biomarkers. In some forms, the set can comprise more than 100,000 status biomarkers. In some forms, the set can comprise more than 200,000 status biomarkers.
In some forms, a plurality of the biomarkers can independently belong to one or more status biomarker families, wherein each biomarker in each status biomarker family comprises one or more repetitive DNA sequences that belong to a single family of repetitive DNA sequences listed in, for example, Table 1. In some forms, a plurality of biomarkers can independently belong to two or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to three or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to four or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to five or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to ten or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to twenty or more status biomarker families.
In some forms, 100 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 100 or more biomarkers can belong to each of the status biomarker families. In some forms, 200 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 200 or more biomarkers can belong to each of the status biomarker families. In some forms, 300 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 300 or more biomarkers can belong to each of the status biomarker families. In some forms, 400 or more biomarkers can belong to one or more of the status biomarker families. In some forms, the 400 or more biomarkers can belong to each of the status biomarker families.
In some forms, the status biomarkers can comprise a set of status biomarkers. In some forms, the members of the set of status biomarkers can be status biomarkers that indicate the status of one or more specific statuses. In some forms, the one or more specific statuses can comprise, for example, wellness, level of health, risk to wellness, risk to level of health, status of the genome, genomic instability, aging, risk of aging, cancer, risk of cancer, head and neck cancer, risk of head and neck cancer, breast cancer, risk of breast cancer, lung cancer, risk of lung cancer, prostate cancer, risk of prostate cancer, colon cancer, risk of colon cancer, esophageal cancer, risk of esophageal cancer, ovarian cancer, risk of ovarian cancer, liver cancer, risk of liver cancer, pancreatic cancer, risk of pancreatic cancer, skin cancer, risk of skin cancer, melanoma, risk of melanoma, lymphoma, risk of lymphoma, leukemia, risk of leukemia, cervical cancer, risk of cervical cancer, cervical dysplasia, risk of cervical dysplasia, cervical intraepithelial neoplasia, risk of cervical intraepithelial neoplasia, tumor burden, stress response, diabetes, risk of diabetes, heart disease, risk of heart disease, and/or response to treatment.
In some forms, the one or more specific statuses can comprise the presence of a disease or condition. In some forms, the one or more specific statuses can comprise, for example, a lack of wellness, low level of health, risk to wellness, risk to level of health, poor status of the genome, genomic instability, aging, risk of aging, cancer, risk of cancer, head and neck cancer, risk of head and neck cancer, breast cancer, risk of breast cancer, lung cancer, risk of lung cancer, prostate cancer, risk of prostate cancer, colon cancer, risk of colon cancer, esophageal cancer, risk of esophageal cancer, ovarian cancer, risk of ovarian cancer, liver cancer, risk of liver cancer, pancreatic cancer, risk of pancreatic cancer, skin cancer, risk of skin cancer, melanoma, risk of melanoma, lymphoma, risk of lymphoma, leukemia, risk of leukemia, cervical cancer, risk of cervical cancer, cervical dysplasia, risk of cervical dysplasia, cervical intraepithelial neoplasia, risk of cervical intraepithelial neoplasia, tumor burden, stress response, diabetes, risk of diabetes, heart disease, and/or risk of heart disease.
In some forms of the methods and compositions of producing status biomarker capture probes, the method can comprise, for example, selecting a subset of repetitive DNA sequence loci from a set of repetitive DNA sequence loci, generating a set of status biomarker capture probe sequences, and synthesizing one or more status biomarker capture probes. In some forms, the repetitive DNA sequence loci in the set of repetitive DNA sequence loci can belong to a single one of the families of repetitive DNA sequence listed in, for example, Table 1, wherein the subset of repetitive DNA sequence loci can be selected by identifying those repetitive DNA sequence loci that comprise a repetitive DNA sequence belonging to one of the families of repetitive DNA sequences listed in, for example, Table 16 and Table 17.
In some forms, each status biomarker capture probe sequence in the set can have a length of 50 bases or more, wherein each status biomarker capture probe represented in the set of status biomarker capture probe sequences can hybridize to at least 5% of the repetitive DNA sequence loci in the selected subset of repetitive DNA sequence loci. In some forms, each status biomarker capture probe can have the sequence of one of the status biomarker capture probe sequences.
In some forms, the repetitive DNA sequence loci in the set of repetitive DNA sequence loci can belong to a single one of the families of repetitive DNA sequence LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, and MLTD2. In some forms, the repetitive DNA sequence in the subset of repetitive DNA sequence loci can belong to one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, AluSc, LTR9, or LTR9B.
In some forms, the method can further comprise selecting one or more additional subsets of repetitive DNA sequence loci each from a different additional set of repetitive DNA sequence loci, generating one or more additional sets of status biomarker capture probe sequences each based on one of the one or more additional subsets, and synthesizing one or more additional status biomarker capture probes, wherein each additional status biomarker capture probe has the sequence of one of the additional status biomarker capture probe sequences. In some forms, the repetitive DNA sequence loci in each additional set of repetitive DNA sequence loci can independently belong to a different single one of the families of repetitive DNA sequence listed in, for example, Table 1, wherein the repetitive DNA sequence loci in the set of repetitive DNA sequence loci and in each additional set of repetitive DNA sequence loci belong to different families of repetitive DNA sequence.
In some forms, the repetitive DNA sequence loci in the each additional set of repetitive DNA sequence loci can independently belong to a single one of the families of repetitive DNA sequence LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, and MLTD2. In some forms, each status biomarker capture probe sequence in the set can have a length of 100 bases or more. In some forms, each status biomarker capture probe represented in the set of status biomarker capture probe sequences can hybridize to at least 10% of the repetitive DNA sequence loci in the selected subset of repetitive DNA sequence loci. In some forms, the set of status biomarker capture probe sequences can comprise from 1 to 100 status biomarker probe capture sequences. In some forms, the set of status biomarker capture probe sequences can comprise from 5 to 100 status biomarker probe capture sequences. In some forms, the set of status biomarker capture probe sequences can comprise from 10 to 100 status biomarker probe capture sequences. In some forms, one or more of the additional sets of status biomarker capture probe sequences each can comprise from 1 to 100 status biomarker probe capture sequences. In some forms, the one or more additional sets of status biomarker capture probe sequences each can comprise from 5 to 100 status biomarker probe capture sequences. In some forms, the one or more additional sets of status biomarker capture probe sequences each can comprise from 10 to 100 status biomarker probe capture sequences.
Additional advantages of the disclosed method and compositions will be set forth in part in the description which follows, and in part will be understood from the description, or may be learned by practice of the disclosed method and compositions. The advantages of the disclosed method and compositions will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments of the disclosed method and compositions and together with the description, serve to explain the principles of the disclosed method and compositions.
The disclosed method and compositions may be understood more readily by reference to the following detailed description of particular embodiments and the Example included therein and to the Figures and their previous and following description.
It has been discovered that the methylation status and/or level of certain loci in genomes can be used to assess and determine the status of subjects, tissues, and cells. For example, it has been discovered that the methylation status and/or level of certain repetitive DNA sequence loci and families of repetitive DNA sequence loci can distinguish the presence, absence, and/or risk or progress toward a variety of diseases and conditions.
The DNA of most tumors has a reduced content of methylated cytosine residues. This so-called global “hypomethylation” affects primarily DNA sequences that belong to interspersed DNA repeats. In normal human tissues, DNA repeats are predominantly methylated, consistent with the requirement to maintain genomic stability by transcriptional silencing of retroelements whose potential deleterious functions include DNA mobilization as well as the facilitation of recombination events in somatic cells. There have been a considerable number of reports of transcriptional activation of retrotransposons in the context of loss of DNA methylation. Expression of human endogenous retroviruses (HERVs) has been detected in breast cancer (Wang-Johanning et al., 2001), ovarian cancer (Menendez et al., 2004, Wang-Johanning et al., 2007), leukemia cell lines, (Patzke et al., 2002), urothelial and renal cell carcinomas (Florl et al., 1999). Increased transcriptional expression of HERV-K has been reported in teratocarcinoma (Löwer et al., 1984; Herbst et al., 1998), breast cancer cells and adjacent tissues (Wang-Johanning et al., 2003, Golan et al., 2008), and in melanoma (Muster et al., 2003; Büscher et al., 2006, Serafino et al., 2009). Stauffer et al. (2004) used massively parallel signature sequencing (MPSS) to define the number and type of transcripts of endogenous retroviruses of the LTR family in various cancers. This study reported that HERV-H, a relatively young retrotransposon, was expressed in cancers of the intestine, bone marrow, bladder and cervix, and was more highly expressed than the other families in cancers of the stomach, colon and prostate. Recently Alves et al. (2008) have reported that a specific HERVH element present in the X chromosome is selectively transcribed in 60% of colon cancers, and in a high proportion of metastatic colon cancers. There is evidence for context-specific induction of LINE-1 transcription during oxidative stress (Teneng et al., 2007). In a relatively large study of squamous head and neck carcinomas, Smith et al. (2007) reported that the DNA methylation level of LINE-1 elements was significantly reduced, and correlated with environmental insults such as alcohol use and smoking, as well as tumor stage.
Disclosed are methods and compositions of assessing one or more statuses of a subject. Also disclosed are methods and compositions of identifying status biomarkers associated with a status of a subject. Also disclosed are sets of one or more status biomarkers. Also disclosed are methods and compositions of producing status biomarker capture probes.
It is to be understood that the disclosed method and compositions are not limited to specific synthetic methods, specific analytical techniques, or to particular reagents unless otherwise specified, and, as such, may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
Disclosed are materials, compositions, and components that can be used for, can be used in conjunction with, can be used in preparation for, or are products of the disclosed method and compositions. These and other materials are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these materials are disclosed that while specific reference of each various individual and collective combinations and permutation of these compounds may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a status biomarker is disclosed and discussed and a number of modifications that can be made to a number of molecules including the status biomarker are discussed, each and every combination and permutation of status biomarker and the modifications that are possible are specifically contemplated unless specifically indicated to the contrary. Thus, if a class of molecules A, B, and C are disclosed as well as a class of molecules D, E, and F and an example of a combination molecule, A-D is disclosed, then even if each is not individually recited, each is individually and collectively contemplated. Thus, is this example, each of the combinations A-E, A-F, B-D, B-E, B-F, C-D, C-E, and C-F are specifically contemplated and should be considered disclosed from disclosure of A, B, and C; D, E, and F; and the example combination A-D. Likewise, any subset or combination of these is also specifically contemplated and disclosed. Thus, for example, the sub-group of A-E, B-F, and C-E are specifically contemplated and should be considered disclosed from disclosure of A, B, and C; D, E, and F; and the example combination A-D. This concept applies to all aspects of this application including, but not limited to, steps in methods of making and using the disclosed compositions. Thus, if there are a variety of additional steps that can be performed it is understood that each of these additional steps can be performed with any specific embodiment or combination of embodiments of the disclosed methods, and that each such combination is specifically contemplated and should be considered disclosed.
Status biomarkers as used herein refer to nucleic acid sequences in a genome the methylation levels of which can be used to assess the status of a subject and/or one or more diseases, conditions, and/or states in a subject. Status biomarkers also include groups of such nucleic acid sequences, in the case of collective status biomarkers. Example 2 provides an example of identification of biomarkers that can be used to identify status biomarkers and all of the examples provide examples of how to identify status biomarkers and use status biomarkers for assessing the status of subjects and samples. Biomarkers from which status biomarkers are selected can be referred to as prospective status biomarkers.
Useful nucleic acid sequences for use as status biomarkers and nucleic acid sequences from which status biomarkers can be selected can include CpG islands or CpG islets and a unique sequence in proximity to a CpG island or CpG islet. Thus, status biomarkers and prospective status biomarkers can be loci having a unique sequence in proximity to a CpG island or CpG islet. CpG islands and CpG islets are described below and elsewhere herein. Proximity to a CpG island or CpG islet is described below and elsewhere herein. By unique sequence, in the context of status biomarkers, is meant a sequence of sufficient length and having a nucleotide sequence distinctive enough to be uniquely in the genome identified by a probe. For example, nucleic acid sequences of or at least 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 nucleotides in length can be used as unique sequences. Unique sequences can be identified by, for example, analysis of a genome sequence or by analysis of probe hybridization. The examples of selection of unique sequences herein make use of analysis of the human genome sequence. Status biomarkers are referred to herein by different terms such as variables, classifiers, and category classifiers.
In some forms of the sets of one or more status biomarkers the status biomarkers can comprise, for example, nucleic acid sequences in a genome. In some forms, the status biomarkers can comprise nucleic acid sequences in the genome of the species to which the subject belongs. In some forms, the nucleic acid sequences can be in proximity to CpG islands or islets. CpG islands and CpG islets are one significant location of DNA methylation that can affect gene expression. Example 2 describes the criteria used for selecting CpG islands and CpG islets, which was more lax than standard selection criteria. The CpG islands or islets can comprise nucleic acid regions of or greater than, for example, 20, 30, 40, 40, 60, 70, 80, 90, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 350, 400, or 500 nucleotides in length that contain a minimum of 5, 6, 7, 8, 9, 10, 11, or 12 CpG residues. The CpG islands and islets can have a ratio of CG content to GC content of or greater than, for example, 0.2, 0.3, 0.35, 0.38, 0.4, 0.40, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.50, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.58, 0.59, 0.6, 0.60, 0.62, 0.65, 0.7, or 0.8. The sequence(s) that define the status biomarkers can be considered to be in proximity to CpG islands or islets when they are within 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250, 1300, 1350, 1400, 1450, 1500, 1600, 1700, 1800, 1900, or 2000 bases of a CpG island or islet.
569 repetitive DNA sequence families were identified from among the loci identified as CpG island- or CpG islet-containing loci as described in Example 2. Table 18 is a list of these repetitive DNA sequence families. Among the 569 repetitive element families comprising the full set of repetitive DNA sequence status biomarkers, a subset of 138 was identified that are most effective as classifiers. This subset was generated by merging the top 75 categories identified by a Random Forest analysis with another 75 categories that were the best performers using a Support Vector Machine classifier. This produced the list of Top 138 status biomarkers (Table 1). Each of these families represents multiple repetitive DNA sequence loci. Selected loci belonging to these families can be probed via unique sequences in the loci. Useful loci for the Top 138 families are specifically identified in Table 15 by listing of start and ending coordinates of example probe sequences in the loci. The loci identified by these probe sequences can be assessed, probed, detected, etc. according to the disclosed methods. The probe sequences identified in Table 15 are only examples of probe sequences that can be used to detect and assess the identified loci.
In some forms, one or more of the status biomarkers can overlap with all or part of a CpG island or islet. In some forms, the one or more of the status biomarkers can comprise a probe binding site, wherein the probe binding site of the one or more of the status biomarkers is specific for a probe. Probe binding sites can be, for example, all or a portion of a unique sequence in the status biomarker. In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein independently for each of the one or more of the probes one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences listed in, for example, Table 1.
A repetitive DNA sequence is a DNA sequence that is repeated numerous times in a genome. Repetitive DNA sequences can also be referred to as repetitive DNA elements, repetitive sequences, repetitive elements, and repetitive DNA sequence elements. Repetitive DNA sequences can be repeated in different patterns in the genome, such as interspersed repetitive DNA sequences and tandem repetitive DNA sequences. A repetitive DNA sequence locus refers to a locus that includes one or more repetitive DNA sequences. An example of a repetitive DNA sequence locus is shown in
In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein for one or more of the probes one or more of the repetitive DNA sequences is an interspersed repeat element. In some forms, each probe can be specific for a repetitive DNA sequence locus, wherein for each probe one or more of the repetitive DNA sequences is an interspersed repeat element.
In some forms, one or more of the status biomarkers can comprise a PCR amplicon. A PCR amplicon is a region of nucleic acid including and between the binding sites of PCR primers. PCR amplicons can be said to be defined by the binding sites of the primers and by the primers themselves. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by a first primer specific for a single one of the status biomarkers and a second primer. A primer specific for a status biomarker refers to a primer that can bind to a sequence in, and prime replication of, the status biomarker. A primer specific for a repetitive DNA sequence refers to a primer that can bind to a sequence in, and prime replication of, the repetitive DNA sequence. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by the same first primer specific for a first type of repetitive DNA sequence and a second primer, wherein the second primer is specific for a second type of repetitive DNA sequence, wherein the second primer is the same for some and different for some of the one or more of the status biomarkers. In some forms, the first primer can be specific for one of the families of repetitive DNA sequences listed in Table 16 or 17, wherein independently for each of the one or more of the status biomarkers the second primer is specific for a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. A primer specific for a family of repetitive DNA sequence refers to a primer that can bind to a sequence in, and prime replication of, one or more repetitive DNA sequences in the family of repetitive DNA sequences.
In some forms, one or more of the status biomarkers can comprise one or more repetitive DNA sequences, wherein independently for each of the one or more of the status biomarkers that comprise repetitive DNA sequences one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. By independently is meant that, for each component in the group referred to, the specific identity of each component can be the same or different from the specific identity of any other of the components in the group. For example, in the group of status biomarkers above each different status biomarker can comprise the same or a different repetitive DNA sequence as any of the other status biomarkers in the group. In some forms, each status biomarker can comprise a repetitive DNA sequence, wherein independently for each of the status biomarkers the repetitive DNA sequence belongs to a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, one or more of the status biomarkers can comprise one or more repetitive DNA sequences, wherein for one or more of the status biomarkers that comprise repetitive DNA sequences one or more of the repetitive DNA sequences is an interspersed repeat element. In some forms, each status biomarker can comprise a repetitive DNA sequence, wherein for each status biomarker the repetitive DNA sequence is an interspersed repeat element.
The disclosed components, such as status biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers, can be used in sets or groups. For example, sets of biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers can include, for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2200, 2400, 2500, 2600, 2800, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 4200, 4400, 4500, 4600, 4800, 5000, 5500, 6000, 6500, 700, 7500, 8000, 8500, 9000, 9500, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 16,000, 17,000, 18,000, 19,000, 20,000, 22,000, 24,000, 25,000, 26,000, 28,000, 30,000, 32,000, 34,000, 35,000, 36,000, 38,000, 40,000, 42,000, 44,000, 45,000, 46,000, 48,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, 100,000, 110,000, 120,000, 130,000, 140,000, 150,000, 160,000, 170,000, 180,000, 190,000, 200,000, 210,000, 220,000, 230,000, 240,000, 250,000, 260,000, 270,000, 280,000, 290,000, 300,000, 320,000, 340,000, 350,000, 360,000, 380,000, 400,000 or more biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively. For collective biomarkers, the group of biomarkers making up the collective biomarker can include a number of individual biomarkers as described herein.
As another example, sets of biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers can include, for example, exactly or at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2200, 2400, 2500, 2600, 2800, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 4200, 4400, 4500, 4600, 4800, 5000, 5500, 6000, 6500, 700, 7500, 8000, 8500, 9000, 9500, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 16,000, 17,000, 18,000, 19,000, 20,000, 22,000, 24,000, 25,000, 26,000, 28,000, 30,000, 32,000, 34,000, 35,000, 36,000, 38,000, 40,000, 42,000, 44,000, 45,000, 46,000, 48,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, 100,000, 110,000, 120,000, 130,000, 140,000, 150,000, 160,000, 170,000, 180,000, 190,000, 200,000, 210,000, 220,000, 230,000, 240,000, 250,000, 260,000, 270,000, 280,000, 290,000, 300,000, 320,000, 340,000, 350,000, 360,000, 380,000, 400,000 biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively.
As another example, sets of biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers can include, for example, any range of from 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2200, 2400, 2500, 2600, 2800, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 4200, 4400, 4500, 4600, 4800, 5000, 5500, 6000, 6500, 700, 7500, 8000, 8500, 9000, 9500, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 16,000, 17,000, 18,000, 19,000, 20,000, 22,000, 24,000, 25,000, 26,000, 28,000, 30,000, 32,000, 34,000, 35,000, 36,000, 38,000, 40,000, 42,000, 44,000, 45,000, 46,000, 48,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, 100,000, 110,000, 120,000, 130,000, 140,000, 150,000, 160,000, 170,000, 180,000, 190,000, 200,000, 210,000, 220,000, 230,000, 240,000, 250,000, 260,000, 270,000, 280,000, 290,000, 300,000, 320,000, 340,000, 350,000, 360,000, or 380,000 biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively, to 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2200, 2400, 2500, 2600, 2800, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 4200, 4400, 4500, 4600, 4800, 5000, 5500, 6000, 6500, 700, 7500, 8000, 8500, 9000, 9500, 10,000, 11,000, 12,000, 13,000, 14,000, 15,000, 16,000, 17,000, 18,000, 19,000, 20,000, 22,000, 24,000, 25,000, 26,000, 28,000, 30,000, 32,000, 34,000, 35,000, 36,000, 38,000, 40,000, 42,000, 44,000, 45,000, 46,000, 48,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, 100,000, 110,000, 120,000, 130,000, 140,000, 150,000, 160,000, 170,000, 180,000, 190,000, 200,000, 210,000, 220,000, 230,000, 240,000, 250,000, 260,000, 270,000, 280,000, 290,000, 300,000, 320,000, 340,000, 350,000, 360,000, 380,000, or 400,000 biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively.
The methylation state of any number (such as the numbers and ranges described above) of, for example, biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers can be determined. In some forms, the methylation state of more than 100 biomarkers can be determined. In some forms, the methylation state of more than 1000 biomarkers can be determined. In some forms, the methylation state of more than 10,000 biomarkers can be determined. In some forms, the methylation state of more than 100,000 biomarkers can be determined. In some forms, the methylation state of more than 200,000 biomarkers can be determined. In some forms, the biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers can comprise a set of biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively. The set can comprise any number (such as the numbers and ranges described above) of, for example, biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers. In some forms, the set can comprise more than 100 status biomarkers. In some forms, the set can comprise more than 1000 status biomarkers. In some forms, the set can comprise more than 10,000 status biomarkers. In some forms, the set can comprise more than 100,000 status biomarkers. In some forms, the set can comprise more than 200,000 status biomarkers.
In some forms, a plurality of the biomarkers can independently belong to one or more status biomarker families, wherein each biomarker in each status biomarker family comprises one or more repetitive DNA sequences that belong to a single family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, a plurality of biomarkers can independently belong to 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200 or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to three or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to four or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to five or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to ten or more status biomarker families. In some forms, a plurality of biomarkers can independently belong to twenty or more status biomarker families.
In some forms, 100 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 100 or more biomarkers can belong to each of the status biomarker families. In some forms, 200 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 200 or more biomarkers can belong to each of the status biomarker families. In some forms, 300 or more biomarkers can belong to one or more of the status biomarker families. In some forms, 300 or more biomarkers can belong to each of the status biomarker families. In some forms, 400 or more biomarkers can belong to one or more of the status biomarker families. In some forms, the 400 or more biomarkers can belong to each of the status biomarker families. In some forms, a plurality of, for example, biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, and collective prospective status biomarkers can independently belong to 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240, 250, 260, 280, 300, 320, 340, 350, 360, 380, 400, 420, 440, 450, 460, 480, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1100, 1200 or more families of biomarker loci, repetitive DNA sequences, repetitive DNA loci, biomarkers, status biomarkers prospective status biomarkers, collective biomarkers, collective status biomarkers, or collective prospective status biomarkers, respectively.
In some forms, the status biomarkers can comprise a set of status biomarkers. In some forms, the members of the set of status biomarkers can be status biomarkers that indicate the status of one or more specific statuses. In some forms, the one or more specific statuses can comprise, for example, wellness, level of health, risk to wellness, risk to level of health, status of the genome, genomic instability, aging, risk of aging, cancer, risk of cancer, head and neck cancer, risk of head and neck cancer, breast cancer, risk of breast cancer, lung cancer, risk of lung cancer, prostate cancer, risk of prostate cancer, colon cancer, risk of colon cancer, esophageal cancer, risk of esophageal cancer, ovarian cancer, risk of ovarian cancer, liver cancer, risk of liver cancer, pancreatic cancer, risk of pancreatic cancer, skin cancer, risk of skin cancer, melanoma, risk of melanoma, lymphoma, risk of lymphoma, leukemia, risk of leukemia, cervical cancer, risk of cervical cancer, cervical dysplasia, risk of cervical dysplasia, cervical intraepithelial neoplasia, risk of cervical intraepithelial neoplasia, tumor burden, stress response, diabetes, risk of diabetes, heart disease, risk of heart disease, and/or response to treatment.
In some forms, the one or more specific statuses can comprise the presence of a disease or condition. In some forms, the one or more specific statuses can comprise, for example, a lack of wellness, low level of health, risk to wellness, risk to level of health, poor status of the genome, genomic instability, aging, risk of aging, cancer, risk of cancer, head and neck cancer, risk of head and neck cancer, breast cancer, risk of breast cancer, lung cancer, risk of lung cancer, prostate cancer, risk of prostate cancer, colon cancer, risk of colon cancer, esophageal cancer, risk of esophageal cancer, ovarian cancer, risk of ovarian cancer, liver cancer, risk of liver cancer, pancreatic cancer, risk of pancreatic cancer, skin cancer, risk of skin cancer, melanoma, risk of melanoma, lymphoma, risk of lymphoma, leukemia, risk of leukemia, cervical cancer, risk of cervical cancer, cervical dysplasia, risk of cervical dysplasia, cervical intraepithelial neoplasia, risk of cervical intraepithelial neoplasia, tumor burden, stress response, diabetes, risk of diabetes, heart disease, and/or risk of heart disease.
1. Lists of Status Biomarkers
Analysis of methylation levels in biological samples relevant to subject status (for example, normal, margin, tumor) of loci having CpG islands or CpG islets as described herein resulted in identification of various loci showing significant differences in methylation levels based on different status. Such loci are a useful form of status biomarker. Status biomarkers can be grouped in various ways. One useful way to group status biomarkers is into families of repetitive DNA sequences to which the status marker belongs. As used herein, a status biomarker belongs to a repetitive DNA sequence family (or category, or subcategory, or class) if the status biomarker comprises a repetitive DNA sequence belonging to that repetitive DNA sequence family (or category, or subcategory, or class). Loci analyzed according to the methods described herein can also be grouped in various ways. One useful way to group loci is into families of repetitive DNA sequences to which the locus belongs. As used herein, a locus belongs to a repetitive DNA sequence family (or category, or subcategory, or class) if the locus comprises a repetitive DNA sequence belonging to that repetitive DNA sequence family (or category, or subcategory, or class). Groups of status biomarkers and groups of loci can themselves be considered status biomarkers. For example, a group of status biomarkers belonging to the LTR54B family of repetitive DNA sequences can be a status biomarker. Such status biomarkers that comprise a group of components (such as a group of individual status biomarkers) can be referred to as a collective status biomarker. The collective status biomarker comprising status biomarkers belonging to the LTR54B family of repetitive DNA sequences can be referred to as a LTR54B family status biomarker. Collective status biomarkers are useful when determining a collective property of the individual status biomarkers in the group of status biomarkers, such as the average methylation of the individual loci that make up the status biomarkers in a group of status biomarkers. Status biomarkers are referred to herein by different terms such as variables, classifiers, and category classifiers.
Various lists of such status biomarker markers (both individual and collective) are presented herein. The lists below are lists of collective status biomarkers (groups of status biomarkers) determined to exhibit showing significant differences in methylation levels based on different status of the tissue (normal, margin, tumor).
The first two lists below arose from utilizing a list of 569 variables (collective status biomarkers) (listed in Table 18), each comprising the average methylation value of all members of a family of repetitive elements that were probed in the microarrays. Each case (a case being a biological sample; 62 total samples were probed) was associated with a corresponding list of 569 values generated by the microarray analysis. After all 62 cases were analyzed using a statistical tool for classification (Support Vector Machine (SVM) or, alternatively, Random Forrest) two different lists emerged that yield the best classification results (that is, best identify the status of the case based on the methylation level). The status classification is the same in both experiments: normal tissue sample, vs. tumor tissue sample, vs. nontumor adjacent tissue sample. Both status classification runs were supervised, in the sense that the assignment of each sample (normal, tumor, or margin) was made by a pathologist. The resulting lists are not the same, since different combinations of variable are capable of yielding a reasonably good classifier, and particularly because there are many more variables (569) than there are cases (62). The third list below is the union of the top 75 categories in the first two lists. The resulting list of 138 categories is referred to herein as the Top 138 categories (or status biomarkers or repetitive DNA sequence families).
i. List of Top 75 Classifier Categories Obtained Using SVM Analysis, by Rank
LTR54B, MER11B, U1, MER34B, LTR56, THE1B-int, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, centr, MLT1D-int, MER67D, HERVK11, PABL_B, MSR1, AluYa5/8, LTR10B, HERVK22, GSAT, LTR10B1, LTR17, LTR51, MER11A, Other, L1PA12, ERVL-B4, HERVK14, LTR29, LTR6A, ALR/Alpha, LTR48B, MER105, MER67A, HUERS-P1, LTR7B, L1PB1, L1PA15-16, LTR28, MSTB-int, LTR45B, LTR7Y, HERVL18, LTR30, HERVK9, LTR45C, LTR47A, THE1C, LTR66, SST1, MER34B-int, LTR65, MER44D, MER57A-int, HUERS-P2, MER6A, MER50B, MER41E, 7SK, HERVP71A, L1PBa1, MER44C, GSATII, LTR1B, LTR7, MER91C, LTR22, Harlequin, MLT1F1, L1M3f, THE1B, HUERS-P3, MER92B, Charlie3
ii. List of Top 75 Classifier Categories Obtained Using Random Forest Analysis, by Rank
MER67D, MER6, ERVL, MER66C, HUERS-P3, MLT1G1, MER4D, MLT2D, THE1B, MLT1A1, MER11B, Charlie5, MLT2B3, MER50B, MER70A, Charlie3, MER50, LTR2, MLT1A, HERVL, LTR33A, MSTB-int, Cheshire, MSTA, MER51B, MLT2B2, MSTC, LTR9B, LTR14B, HUERS-P2, MSTB1, MSTD, LTR52, LTR8, LTR8A, MER92B, LTR22, MER51A, LTR36, LTR54B, PABL_A, MER4D1, AcHobo, LTR48, MLT1A0, LTR1B, MSTB, MER11D, LTR19A, MLT1E2, MER115, MER11A, MER34C, THE1C, MLT2B1, L1PA10, MER4A1, MLT1E, MLT2B4, LTR10B, L1MA7, HERVFH21, LTR5, THE1D, L1MA1, LTR9, MER63A, LTR5A, L1PB4, ERVL-B4, MLT1F, MLT2A2, LTR14, HERVK9, MER11C.
iii. List of Top 138 Classifier Categories, by Rank
LTR54B, MER67D, MER11B, MER6, ERVL, U1, MER34B, MER66C, HUERS-P3, LTR56, MLT1G1, THE1B-int, HERV9, MER4D, LTR14C, MLT2D, HERVFH21, THE1B, LTR6B, MLT1A1, LTR46, centr, Charlie5, MLT1D-int, MLT2B3, MER50B, HERVK11, MER70A, Charlie3, PABL_B, MER50, MSR1, AluYa5/8, LTR2, LTR10B, MLT1A, HERVK22, HERVL, GSAT, LTR33A, LTR10B1, MSTB-int, Cheshire, LTR17, LTR51, MSTA, MER11A, MER51B, MLT2B2, SVA, SVA_A, SVA_B, SVA_C, SVA_D, SVA_E, SVA_F, L1PA12, MSTC, ERVL-B4, LTR9B, HERVK14, LTR14B, HUERS-P2, LTR29, LTR6A, MSTB1, ALR/Alpha, MSTD, LTR48B, LTR52, LTR8, MER105, LTR8A, MER67A, HUERS-P1, MER92B, LTR22, LTR7B, L1PB1, MER51A, L1PA15-16, LTR36, LTR28, PABL_A, LTR45B, MER4D1, AcHobo, LTR7Y, HERVL18, LTR48, LTR30, MLT1A0, HERVK9, LTR1B, LTR45C, MSTB, LTR47A, MER11D, LTR19A, THE1C, LTR66, MLT1E2, MER115, SST1, MER34B-int, LTR65, MER34C, MER44D, MER57A-int, MLT2B1, L1PA10, MER4A1, MER6A, MLT1E, MER41E, MLT2B4, 7SK, HERVP71A, L1MA7, L1PBa1, LTR5, MER44C, GSATII, THE1D, L1MA1, LTR7, LTR9, MER63A, MER91C, LTR5A, Harlequin, L1PB4, MLT1F1, L1M3f, MLT1F, MLT2A2, LTR14, MER11C.
569 repetitive DNA sequence families were identified from among the loci identified as CpG island- or CpG islet-containing loci as described in Example 2. Table 18 is a list of these repetitive DNA sequence families. Among the 569 repetitive element families comprising the full set of repetitive DNA sequence status biomarkers, a subset of 138 was identified that are most effective as classifiers. This subset was generated by merging the top 75 categories identified by a Random Forest analysis with another 75 categories that were the best performers using a Support Vector Machine classifier. This produced the list of Top 138 status biomarkers (Table 1). A Random Forest classification analysis was performed utilizing the set of Top 138 status biomarkers, and a second one utilizing the remainder of the 569 (a subset of 431). The list of this subset of 431 status biomarkers can be derived by eliminating the Top 138 status biomarkers in Table 1 from the list of 569 status biomarkers in Table 18. Random Forest analysis using the top 138 status biomarkers gave a classification error of 8.1%. The Receiver Operator Characteristic curves for this analysis gave an AUC of 1 for margin versus normal and an AUC of 0.91 for tumor versus margin. The second Random Forest analysis was performed using the remaining 431 status biomarkers. The classification error in this analysis was 19.0%. Thus, the Top 138 status biomarkers are significantly better for assessing the status of the samples and subjects than the remaining status biomarkers. In a separate experiment using SVM analysis, the superior performance of the top 138 status biomarkers compared to the remaining 431 variables was confirmed. These results provide an objective metric for claiming superior utility of the top 138 biomarkers for assessing status of subjects.
The utility of the Status Biomarkers for distinguishing dysplasia from cancer was optimized by performing a classification analysis that does not include the data from the normal samples, and which can be called a nontumor margin vs. tumor classification. Taking the 569 repetitive element categories as variables (Table 18), classification of margin vs. tumor using Random Forest was performed, and the best 75 variables were saved. Then, again taking the 569 repetitive element categories as variables, classification of margin vs. tumor using the Support vector machine was performed, and the best 75 variables were saved. The union of the best 75 RF variables and the best 75 SVM variables was then calculated, and this yielded 137 variables, which are called the Top performing variables for margin vs. tumor classification (Table 12).
The Top 137 variables were used to perform an RF classification, which yielded a classification error of 9.6%. Using the remaining 432 variables yielded a classification error of 17%, confirming the superior performance of the Top 137 variables.
The overlap between the Top 138 Variables (Table 1) that are the best classifiers for normal vs. margin vs. tumor and the Top 137 categories (Table 12) that are the best classifiers for margin vs. tumor was calculated. The comparison shows that only 48 variables are common to both lists and thus are good classifiers for both tumor-margin and tumor-margin-normal comparison experiments. The 48 common variables are listed below in Table 13.
The 137 categories from Table 12 minus the 48 common variables from Table 13 result in a list of 89 different variables that are good classifiers among tumor and margin comparison experiments but not for tumor-margin-normal comparison experiments. The list of 89 different variables is as follows: AluSg/x, AluYa5, AluYa8, tRNA, Charlie 10, ERVK, FLAM_A, HAL1, HERV16, HERV351, HERVL-A1, HERVL40, HSMAR1, L1M3d, L1M4b, L1MA10, L1MA5, L1MA5A, L1MA9, L1MB1, L1MB4, L1MC1, L1MC2, L1MC3, L1MCb, L1MD, L1MD1, L1ME2, L1P1, L1P2, L1P3, L1P4, L1P5, L1PA13, L1PA15, L1PA2, L1PA3, L1PA6, L1PA7, L1PB2, L3b, LTR12, LTR12D, LTR16A, LTR16B, LTR18B, LTR1D, LTR22C, LTR23, LTR24C, LTR26, LTR26B, LTR27, LTR2B, LTR5_Hs, LTR54, LTR67, MER102b, MER106B, MER110A, MER119, MER21-int, MER21A, MER31B, MER34, MER44B, MER46B, MER50-int, MER57A, MER63D, MER65D, MER69B, MER77, MER81, MER90a, MER91A, MER93B, MER94, MIR3, MIRb, MLT1B, MLT1E1, MLT1J2, MLT1L, MSTA-int, PRIMA4-int, Tigger1, Tigger7, Tip100.
The 138 categories from Table 1 minus the 48 common variables in Table 13 result in a list of 90 different variables that are good classifiers among tumor-margin-normal comparison experiments but not for tumor-margin comparisons. The list of 90 different variables is as follows: 7SK, centr, SVA, Charlie5, Cheshire, ERVL-B4, GSAT, GSATII, Harlequin, HERVFH21, HERVK22, HERVK9, HERVP71A, HUERS-P1, L1M3f, L1MA1, L1MA7, L1PA10, L1PA12, L1PA15-16, L1PB1, L1PB4, LTR14, LTR14B, LTR17, LTR1B, LTR2, LTR22, LTR28, LTR29, LTR30, LTR33A, LTR45B, LTR45C, LTR46, LTR47A, LTR48, LTR48B, LTR5, LTR52, LTR5A, LTR65, LTR66, LTR6A, LTR7, LTR7B, LTR7Y, LTR8, LTR8A, MER105, MER115, MER11B, MER11C, MER34C, MER41E, MER44C, MER44D, MER4A1, MER4D, MER4D1, MER51A, MER51B, MER66C, MER67D, MER6A, MER70A, MER92B, MLT1A1, MLT1D-int, MLT1F, MLT1F1, MLT1G1, MLT2A2, MLT2B1, MLT2B2, MLT2B4, MSTA, MSTB-int, MSTB, MSTB1, MSTD, PABL_A, PABL_B, SVA_A, SVA_B, SVA_C, SVA_D, SVA_E, SVA_F, THE1D.
Table 14 reports the repetitive element families present in a 600-base window centered on each microarray probe. This is an example of neighbor repeat analysis. The presence of repetitive DNA sequences belonging to different families of repetitive DNA sequences in the same, for example, status biomarker or repetitive DNA sequence locus can facilitate some of the forms of the disclosed methods. For example, the different repetitive DNA sequences can be used to define a PCR amplicon by, for example, using primers specific for two of the different repetitive DNA sequences.
A very interesting feature of this analysis is the presence of LTR2 and LTR2B repetitive elements in the vicinity of Harlequin repeats, which are a special type of LTR repeat. A report in the journal “Oncogene” described an unusual set of human genes known as HOST genes, which contain sequences comprising a mixture of Harlequin repetitive elements joined to LTR2 repetitive elements (Rangel et al., 2003). HOST genes are overexpressed in ovarian cancer (Rangel et al., 2003). The presence of the Harlequin class of repeats in the list of the best classifier probes found by the Support Vector Machine analysis indicates the existence of a large number of genomic loci with a structure similar to that of the ovarian cancer HOST genes. These unusual loci suffer major changes in DNA methylation status in cancers of the head and neck, as revealed by analysis herein.
Table 16 is a list of 126 repetitive element families that occur as neighbors in a window of 2×300 bases near the Top 138 classifier probes.
There are a variety of molecules disclosed herein that are nucleic acid based, including, for example, riboswitches, aptamers, and nucleic acids that encode riboswitches and aptamers. The disclosed nucleic acids can be made up of for example, nucleotides, nucleotide analogs, or nucleotide substitutes. Non-limiting examples of these and other molecules are discussed herein. It is understood that for example, when a vector is expressed in a cell that the expressed mRNA will typically be made up of A, C, G, and U. Likewise, it is understood that if a nucleic acid molecule is introduced into a cell or cell environment through for example exogenous delivery, it is advantageous that the nucleic acid molecule be made up of nucleotide analogs that reduce the degradation of the nucleic acid molecule in the cellular environment.
So long as their relevant function is maintained, riboswitches, aptamers, expression platforms and any other oligonucleotides and nucleic acids can be made up of or include modified nucleotides (nucleotide analogs). Many modified nucleotides are known and can be used in oligonucleotides and nucleic acids. A nucleotide analog is a nucleotide which contains some type of modification to either the base, sugar, or phosphate moieties. Modifications to the base moiety would include natural and synthetic modifications of A, C, G, and T/U as well as different purine or pyrimidine bases, such as uracil-5-yl, hypoxanthin-9-yl (I), and 2-aminoadenin-9-yl. A modified base includes but is not limited to 5-methylcytosine (5-me-C), 5-hydroxymethyl cytosine, xanthine, hypoxanthine, 2-aminoadenine, 6-methyl and other alkyl derivatives of adenine and guanine, 2-propyl and other alkyl derivatives of adenine and guanine, 2-thiouracil, 2-thiothymine and 2-thiocytosine, 5-halouracil and cytosine, 5-propynyl uracil and cytosine, 6-azo uracil, cytosine and thymine, 5-uracil (pseudouracil), 4-thiouracil, 8-halo, 8-amino, 8-thiol, 8-thioalkyl, 8-hydroxyl and other 8-substituted adenines and guanines, 5-halo particularly 5-bromo, 5-trifluoromethyl and other 5-substituted uracils and cytosines, 7-methylguanine and 7-methyladenine, 8-azaguanine and 8-azaadenine, 7-deazaguanine and 7-deazaadenine and 3-deazaguanine and 3-deazaadenine. Another modified base contains one or more of the 2′-O,4′-C-methylene-β-D-ribofuranosyl nucleosides which are known as locked nucleic acid (LNA™) monomers (Petersen and Wengel, Trends Biotech 21:74-81, 2003). Additional base modifications can be found for example in U.S. Pat. No. 3,687,808, Englisch et al., Angewandte Chemie, International Edition, 1991, 30, 613, and Sanghvi, Y. S., Chapter 15, Antisense Research and Applications, pages 289-302, Crooke, S. T. and Lebleu, B. ed., CRC Press, 1993. Certain nucleotide analogs, such as 5-substituted pyrimidines, 6-azapyrimidines and N-2, N-6 and O-6 substituted purines, including 2-aminopropyladenine, 5-propynyluracil and 5-propynylcytosine. 5-methylcytosine can increase the stability of duplex formation. Other modified bases are those that function as universal bases. Universal bases include 3-nitropyrrole and 5-nitroindole. Universal bases substitute for the normal bases but have no bias in base pairing. That is, universal bases can base pair with any other base. Base modifications often can be combined with for example a sugar modification, such as 2′-β-methoxyethyl, to achieve unique properties such as increased duplex stability. There are numerous United States patents such as U.S. Pat. Nos. 4,845,205; 5,130,302; 5,134,066; 5,175,273; 5,367,066; 5,432,272; 5,457,187; 5,459,255; 5,484,908; 5,502,177; 5,525,711; 5,552,540; 5,587,469; 5,594,121, 5,596,091; 5,614,617; and 5,681,941, which detail and describe a range of base modifications. Each of these patents is herein incorporated by reference in its entirety, and specifically for their description of base modifications, their synthesis, their use, and their incorporation into oligonucleotides and nucleic acids.
LNA™ monomers are a class of nucleic acid analogues in which the ribose ring is “locked” into the ideal conformation for base stacking and backbone pre-organization and can be used just like a regular nucleotide. The nucleic acid contains a methylene bridge connecting the 2′-O and the 4′-C. The “locked” structure increases the stability of oligonucleotides by means of increasing the melting temperature (Kaur et al. Biochemistry 45:7347-55, 2006). LNA™ can be used for a variety of molecular biology techniques. Locked nucleic acids can be used for but are not limited to microarrays, FISH probes, real-time PCR probes, small RNA research, SNP genotyping, mRNA antisense oligonucleotides, allele-specific PCR, RNAi, DNAzymes, fluorescence polarization probes, gene repair/exon skipping, splice variant detection and comparative genome hybridization.
Nucleotide analogs can also include modifications of the sugar moiety. Modifications to the sugar moiety would include natural modifications of the ribose and deoxyribose as well as synthetic modifications. Sugar modifications include but are not limited to the following modifications at the 2′ position: OH; F; O-, S-, or N-alkyl; O-, S-, or N-alkenyl; O-, S- or N-alkynyl; or O-alkyl-O-alkyl, wherein the alkyl, alkenyl and alkynyl can be substituted or unsubstituted C1 to C10, alkyl or C2 to C10 alkenyl and alkynyl. 2′ sugar modifications also include but are not limited to —O[(CH2)n O]m CH3, —O(CH2)n OCH3, —O(CH2)n NH2, —O(CH2)n CH3, —O(CH2)n —ONH2, and —O(CH2)nON[(CH2)n CH3)]2, where n and m are from 1 to about 10.
Other modifications at the 2′ position include but are not limited to: C1 to C10 lower alkyl, substituted lower alkyl, alkaryl, aralkyl, O-alkaryl or O-aralkyl, SH, SCH3, OCN, Cl, Br, CN, CF3, OCF3, SOCH3, SO2 CH3, ONO2, NO2, N3, NH2, heterocycloalkyl, heterocycloalkaryl, aminoalkylamino, polyalkylamino, substituted silyl, an RNA cleaving group, a reporter group, an intercalator, a group for improving the pharmacokinetic properties of an oligonucleotide, or a group for improving the pharmacodynamic properties of an oligonucleotide, and other substituents having similar properties. Similar modifications can also be made at other positions on the sugar, particularly the 3′ position of the sugar on the 3′ terminal nucleotide or in 2′-5′ linked oligonucleotides and the 5′ position of 5′ terminal nucleotide. Modified sugars would also include those that contain modifications at the bridging ring oxygen, such as CH2 and S. Nucleotide sugar analogs can also have sugar mimetics such as cyclobutyl moieties in place of the pentofuranosyl sugar. There are numerous United States patents that teach the preparation of such modified sugar structures such as U.S. Pat. Nos. 4,981,957; 5,118,800; 5,319,080; 5,359,044; 5,393,878; 5,446,137; 5,466,786; 5,514,785; 5,519,134; 5,567,811; 5,576,427; 5,591,722; 5,597,909; 5,610,300; 5,627,053; 5,639,873; 5,646,265; 5,658,873; 5,670,633; and 5,700,920, each of which is herein incorporated by reference in its entirety, and specifically for their description of modified sugar structures, their synthesis, their use, and their incorporation into nucleotides, oligonucleotides and nucleic acids.
Nucleotide analogs can also be modified at the phosphate moiety. Modified phosphate moieties include but are not limited to those that can be modified so that the linkage between two nucleotides contains a phosphorothioate, chiral phosphorothioate, phosphorodithioate, phosphotriester, aminoalkylphosphotriester, methyl and other alkyl phosphonates including 3′-alkylene phosphonate and chiral phosphonates, phosphinates, phosphoramidates including 3′-amino phosphoramidate and aminoalkylphosphoramidates, thionophosphoramidates, thionoalkylphosphonates, thionoalkylphosphotriesters, and boranophosphates. It is understood that these phosphate or modified phosphate linkages between two nucleotides can be through a 3′-5′ linkage or a 2′-5′ linkage, and the linkage can contain inverted polarity such as 3′-5′ to 5′-3′ or 2′-5′ to 5′-2′. Various salts, mixed salts and free acid forms are also included. Numerous United States patents teach how to make and use nucleotides containing modified phosphates and include but are not limited to, U.S. Pat. Nos. 3,687,808; 4,469,863; 4,476,301; 5,023,243; 5,177,196; 5,188,897; 5,264,423; 5,276,019; 5,278,302; 5,286,717; 5,321,131; 5,399,676; 5,405,939; 5,453,496; 5,455,233; 5,466,677; 5,476,925; 5,519,126; 5,536,821; 5,541,306; 5,550,111; 5,563,253; 5,571,799; 5,587,361; and 5,625,050, each of which is herein incorporated by reference its entirety, and specifically for their description of modified phosphates, their synthesis, their use, and their incorporation into nucleotides, oligonucleotides and nucleic acids.
It is understood that nucleotide analogs need only contain a single modification, but can also contain multiple modifications within one of the moieties or between different moieties.
Nucleotide substitutes are molecules having similar functional properties to nucleotides, but which do not contain a phosphate moiety, such as peptide nucleic acid (PNA). Nucleotide substitutes are molecules that will recognize and hybridize to (base pair to) complementary nucleic acids in a Watson-Crick or Hoogsteen manner, but which are linked together through a moiety other than a phosphate moiety. Nucleotide substitutes are able to conform to a double helix type structure when interacting with the appropriate target nucleic acid.
Nucleotide substitutes can also include nucleotides or nucleotide analogs that have had the phosphate moiety and/or sugar moieties replaced. Nucleotide substitutes do not contain a standard phosphorus atom. Substitutes for the phosphate can be for example, short chain alkyl or cycloalkyl internucleoside linkages, mixed heteroatom and alkyl or cycloalkyl internucleoside linkages, or one or more short chain heteroatomic or heterocyclic internucleoside linkages. These include those having morpholino linkages (formed in part from the sugar portion of a nucleoside); siloxane backbones; sulfide, sulfoxide and sulfone backbones; formacetyl and thioformacetyl backbones; methylene formacetyl and thioformacetyl backbones; alkene containing backbones; sulfamate backbones; methyleneimino and methylenehydrazino backbones; sulfonate and sulfonamide backbones; amide backbones; and others having mixed N, O, S and CH2 component parts. Numerous United States patents disclose how to make and use these types of phosphate replacements and include but are not limited to U.S. Pat. Nos. 5,034,506; 5,166,315; 5,185,444; 5,214,134; 5,216,141; 5,235,033; 5,264,562; 5,264,564; 5,405,938; 5,434,257; 5,466,677; 5,470,967; 5,489,677; 5,541,307; 5,561,225; 5,596,086; 5,602,240; 5,610,289; 5,602,240; 5,608,046; 5,610,289; 5,618,704; 5,623,070; 5,663,312; 5,633,360; 5,677,437; and 5,677,439, each of which is herein incorporated by reference its entirety, and specifically for their description of phosphate replacements, their synthesis, their use, and their incorporation into nucleotides, oligonucleotides and nucleic acids.
It is also understood in a nucleotide substitute that both the sugar and the phosphate moieties of the nucleotide can be replaced, by for example an amide type linkage (aminoethylglycine) (PNA). U.S. Pat. Nos. 5,539,082; 5,714,331; and 5,719,262 teach how to make and use PNA molecules, each of which is herein incorporated by reference. (See also Nielsen et al., Science 254:1497-1500 (1991)).
It is also possible to link other types of molecules (conjugates) to nucleotides or nucleotide analogs to enhance for example, cellular uptake. Conjugates can be chemically linked to the nucleotide or nucleotide analogs. Such conjugates include but are not limited to lipid moieties such as a cholesterol moiety. (Letsinger et al., Proc. Natl. Acad. Sci. USA, 1989, 86, 6553-6556). There are many varieties of these types of molecules available in the art and available herein.
A Watson-Crick interaction is at least one interaction with the Watson-Crick face of a nucleotide, nucleotide analog, or nucleotide substitute. The Watson-Crick face of a nucleotide, nucleotide analog, or nucleotide substitute includes the C2, N1, and C6 positions of a purine based nucleotide, nucleotide analog, or nucleotide substitute and the C2, N3, C4 positions of a pyrimidine based nucleotide, nucleotide analog, or nucleotide substitute.
A Hoogsteen interaction is the interaction that takes place on the Hoogsteen face of a nucleotide or nucleotide analog, which is exposed in the major groove of duplex DNA.
The Hoogsteen face includes the N7 position and reactive groups (NH2 or O) at the C6 position of purine nucleotides.
Oligonucleotides and nucleic acids can be comprised of nucleotides and can be made up of different types of nucleotides or the same type of nucleotides. For example, one or more of the nucleotides in an oligonucleotide can be ribonucleotides, 2′-O-methyl ribonucleotides, or a mixture of ribonucleotides and 2′-O-methyl ribonucleotides; about 10% to about 50% of the nucleotides can be ribonucleotides, 2′-O-methyl ribonucleotides, or a mixture of ribonucleotides and 2′-O-methyl ribonucleotides; about 50% or more of the nucleotides can be ribonucleotides, 2′-O-methyl ribonucleotides, or a mixture of ribonucleotides and 2′-O-methyl ribonucleotides; or all of the nucleotides are ribonucleotides, 2′-O-methyl ribonucleotides, or a mixture of ribonucleotides and 2′-O-methyl ribonucleotides. Such oligonucleotides and nucleic acids can be referred to as chimeric oligonucleotides and chimeric nucleic acids.
It is understood that as discussed herein the use of the terms homology and identity mean the same thing as similarity. Thus, for example, if the use of the word homology is used between two sequences (non-natural sequences, for example) it is understood that this is not necessarily indicating an evolutionary relationship between these two sequences, but rather is looking at the similarity or relatedness between their nucleic acid sequences. Many of the methods for determining homology between two evolutionarily related molecules are routinely applied to any two or more nucleic acids or proteins for the purpose of measuring sequence similarity regardless of whether they are evolutionarily related or not.
In general, it is understood that one way to define any known variants and derivatives or those that might arise, of the disclosed riboswitches, aptamers, expression platforms, genes and proteins herein, is through defining the variants and derivatives in terms of homology to specific known sequences. This identity of particular sequences disclosed herein is also discussed elsewhere herein. In general, variants of sequences herein disclosed typically have at least, about 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99 percent homology to a stated sequence or a native sequence. Those of skill in the art readily understand how to determine the homology of two proteins or nucleic acids, such as genes. For example, the homology can be calculated after aligning the two sequences so that the homology is at its highest level.
Another way of calculating homology can be performed by published algorithms. Optimal alignment of sequences for comparison can be conducted by the local homology algorithm of Smith and Waterman Adv. Appl. Math. 2: 482 (1981), by the homology alignment algorithm of Needleman and Wunsch, J. MoL Biol. 48: 443 (1970), by the search for similarity method of Pearson and Lipman, Proc. Natl. Acad. Sci. U.S.A. 85: 2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, Wis.), or by inspection.
The same types of homology can be obtained for nucleic acids by for example the algorithms disclosed in Zuker, M. Science 244:48-52, 1989, Jaeger et al. Proc. Natl. Acad. Sci. USA 86:7706-7710, 1989, Jaeger et al. Methods Enzymol. 183:281-306, 1989 which are herein incorporated by reference for at least material related to nucleic acid alignment. It is understood that any of the methods typically can be used and that in certain instances the results of these various methods can differ, but the skilled artisan understands if identity is found with at least one of these methods, the sequences would be said to have the stated identity.
For example, as used herein, a sequence recited as having a particular percent homology to another sequence refers to sequences that have the recited homology as calculated by any one or more of the calculation methods described above. For example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using the Zuker calculation method even if the first sequence does not have 80 percent homology to the second sequence as calculated by any of the other calculation methods. As another example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using both the Zuker calculation method and the Pearson and Lipman calculation method even if the first sequence does not have 80 percent homology to the second sequence as calculated by the Smith and Waterman calculation method, the Needleman and Wunsch calculation method, the Jaeger calculation methods, or any of the other calculation methods. As yet another example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using each of calculation methods (although, in practice, the different calculation methods will often result in different calculated homology percentages).
The term hybridization typically means a sequence driven interaction between at least two nucleic acid molecules, such as a primer or a probe and a riboswitch or a gene. Sequence driven interaction means an interaction that occurs between two nucleotides or nucleotide analogs or nucleotide derivatives in a nucleotide specific manner. For example, G interacting with C and A interacting with T are sequence driven interactions. Typically sequence driven interactions occur on the Watson-Crick face or Hoogsteen face of the nucleotide. The hybridization of two nucleic acids is affected by a number of conditions and parameters known to those of skill in the art. For example, the salt concentrations, pH, and temperature of the reaction all affect whether two nucleic acid molecules will hybridize.
Parameters for selective hybridization between two nucleic acid molecules are well known to those of skill in the art. For example, in some embodiments selective hybridization conditions can be defined as stringent hybridization conditions. For example, stringency of hybridization is controlled by both temperature and salt concentration of either or both of the hybridization and washing steps. For example, the conditions of hybridization to achieve selective hybridization can involve hybridization in high ionic strength solution (6×SSC or 6×SSPE) at a temperature that is about 12-25° C. below the Tm (the melting temperature at which half of the molecules dissociate from their hybridization partners) followed by washing at a combination of temperature and salt concentration chosen so that the washing temperature is about 5° C. to 20° C. below the Tm. The temperature and salt conditions are readily determined empirically in preliminary experiments in which samples of reference DNA immobilized on filters are hybridized to a labeled nucleic acid of interest and then washed under conditions of different stringencies. Hybridization temperatures are typically higher for DNA-RNA and RNA-RNA hybridizations. The conditions can be used as described above to achieve stringency, or as is known in the art (Sambrook et al., Molecular Cloning: A Laboratory Manual, 2nd Ed., Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y., 1989; Kunkel et al. Methods Enzymol. 1987:154:367, 1987 which is herein incorporated by reference for material at least related to hybridization of nucleic acids). A preferable stringent hybridization condition for a DNA:DNA hybridization can be at about 68° C. (in aqueous solution) in 6×SSC or 6×SSPE followed by washing at 68° C. Stringency of hybridization and washing, if desired, can be reduced accordingly as the degree of complementarity desired is decreased, and further, depending upon the G-C or A-T richness of any area wherein variability is searched for. Likewise, stringency of hybridization and washing, if desired, can be increased accordingly as homology desired is increased, and further, depending upon the G-C or A-T richness of any area wherein high homology is desired, all as known in the art.
Another way to define selective hybridization is by looking at the amount (percentage) of one of the nucleic acids bound to the other nucleic acid. For example, in some embodiments selective hybridization conditions would be when at least about, 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the limiting nucleic acid is bound to the non-limiting nucleic acid. Typically, the non-limiting nucleic acid is in for example, 10 or 100 or 1000 fold excess. This type of assay can be performed at under conditions where both the limiting and non-limiting nucleic acids are for example, 10 fold or 100 fold or 1000 fold below their kd, or where only one of the nucleic acid molecules is 10 fold or 100 fold or 1000 fold or where one or both nucleic acid molecules are above their kd.
Another way to define selective hybridization is by looking at the percentage of nucleic acid that gets enzymatically manipulated under conditions where hybridization is required to promote the desired enzymatic manipulation. For example, in some embodiments selective hybridization conditions would be when at least about, 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the nucleic acid is enzymatically manipulated under conditions which promote the enzymatic manipulation, for example if the enzymatic manipulation is DNA extension, then selective hybridization conditions would be when at least about 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the nucleic acid molecules are extended. Preferred conditions also include those suggested by the manufacturer or indicated in the art as being appropriate for the enzyme performing the manipulation.
Just as with homology, it is understood that there are a variety of methods herein disclosed for determining the level of hybridization between two nucleic acid molecules. It is understood that these methods and conditions can provide different percentages of hybridization between two nucleic acid molecules, but unless otherwise indicated meeting the parameters of any of the methods would be sufficient. For example if 80% hybridization was required and as long as hybridization occurs within the required parameters in any one of these methods it is considered disclosed herein.
It is understood that those of skill in the art understand that if a composition or method meets any one of these criteria for determining hybridization either collectively or singly it is a composition or method that is disclosed herein.
1. Probes and Primers
Disclosed are compositions including primers and probes, which are capable of interacting with the disclosed nucleic acids such as status biomarkers, DNA fragments, repetitive DNA sequences, unique sequences, PCR amplicons, and probe binding sequences. In certain embodiments the primers are used to support DNA amplification reactions. Typically the primers will be capable of being extended in a sequence specific manner. Extension of a primer in a sequence specific manner includes any methods wherein the sequence and/or composition of the nucleic acid molecule to which the primer is hybridized or otherwise associated directs or influences the composition or sequence of the product produced by the extension of the primer. Extension of the primer in a sequence specific manner therefore includes, but is not limited to, PCR, DNA sequencing, DNA extension, DNA polymerization, RNA transcription, or reverse transcription. Techniques and conditions that amplify the primer in a sequence specific manner are preferred. In certain embodiments the primers are used for the DNA amplification reactions, such as PCR or direct sequencing. It is understood that in certain embodiments the primers can also be extended using non-enzymatic techniques, where for example, the nucleotides or oligonucleotides used to extend the primer are modified such that they will chemically react to extend the primer in a sequence specific manner. Typically the disclosed primers hybridize with the disclosed nucleic acids or region of the nucleic acids or they hybridize with the complement of the nucleic acids or complement of a region of the nucleic acids.
Probe for biomarkers can be designed in any suitable manner. Examples of methods and techniques for designing probes are described herein, but any other methods and techniques can be used. Useful probes can be specific for particular biomarkers, loci, families of biomarkers, families of loci, etc. Sequence analysis of biomarker and loci sequences (such as nucleic acid regions containing CpG islands and CpG islets) can be used to identify specific and/or selective probes. Particularly useful probes can be complementary to unique sequences in biomarkers and loci of interest or to characteristic or consensus sequences in biomarker and locus families.
The size of the primers or probes for interaction with the nucleic acids in certain embodiments can be any size that supports the desired enzymatic manipulation of the primer, such as DNA amplification or the simple hybridization of the probe or primer. A typical primer or probe would be at least 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
In other embodiments a primer or probe can be less than or equal to 6, 7, 8, 9, 10, 11, 12 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
The disclosed methods involve the use of probes and primers. For example, probes for status biomarkers are used to capture, detect, measure, and/or assess status biomarkers. These and other probes can be designed and made using any suitable techniques. Many such techniques are known in the art. The examples and other description herein provide examples of the design of probes and of features useful to the probes to be used in the disclosed methods. The disclosed probes can be used, for example, to detect the level of the status biomarkers by using, for example, an array of probes specific for the status biomarkers. In some forms, the array of probes can be, for example, a microarray.
Useful forms of the disclosed probes can be complementary to, and/or specific for, any sequence in a status biomarker. Such complementary sequences in status biomarkers can be referred to as probe binding sites. Particularly useful target sequences for probes are unique sequences and repetitive DNA sequences. Useful probes for unique sequences can have a sequence of sufficient length and having a nucleotide sequence distinctive enough to hybridize uniquely in the genome at the unique sequence. For example, nucleic acid sequences of or at least 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 nucleotides in length can be used as probes for unique sequences. Unique sequences can be identified by, for example, analysis of a genome sequence or by analysis of probe hybridization. Probes for repetitive DNA sequences and other targets can have any useful length. For example, nucleic acid sequences of or at least 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 nucleotides in length can be used as probes.
Probes can be specific for probe binding sites in status biomarkers. In some forms, the one or more of the status biomarkers can comprise a probe binding site, wherein the probe binding site of the one or more of the status biomarkers is specific for a probe. Probe binding sites can be, for example, all or a portion of a unique sequence in the status biomarker. In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein independently for each of the one or more of the probes one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences listed in, for example, Table 1, Table 12, or Table 13. In some forms, each probe can be specific for a repetitive DNA sequence locus, wherein independently for each probe one or more of the repetitive DNA sequences belongs to a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13.
In some forms, one or more of the probes can be specific for a repetitive DNA sequence locus, wherein the repetitive DNA sequence locus comprises one or more repetitive DNA sequences, wherein for one or more of the probes one or more of the repetitive DNA sequences is an interspersed repeat element. In some forms, each probe can be specific for a repetitive DNA sequence locus, wherein for each probe one or more of the repetitive DNA sequences is an interspersed repeat element.
Primers can be used in the disclosed methods to replicate and/or amplify nucleic acids. For example, primers for PCR can be used to amplify genomic sequences and sequences of status biomarkers. Primers can also be used for other replication and replication techniques, such a multiple displacement amplification and replication-based nucleic acid sequencing techniques. Many such techniques are known and principles and techniques for design of primers for use in such techniques are known and can be used for the disclosed primers and methods.
In some forms of the disclosed methods, part or all of a status biomarker can be replicated and/or amplified as a PCR amplicon. In some forms, one or more of the status biomarkers can comprise a PCR amplicon. A PCR amplicon is a region of nucleic acid including and between the binding sites of PCR primers. PCR amplicons can be said to be defined by the binding sites of the primers and by the primers themselves. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by a first primer specific for a single one of the status biomarkers and a second primer. A primer specific for a status biomarker refers to a primer that can bind to a sequence in, and prime replication of, the status biomarker. A primer specific for a repetitive DNA sequence refers to a primer that can bind to a sequence in, and prime replication of, the repetitive DNA sequence. In some forms, the PCR amplicon of each of the one or more of the status biomarkers can be defined by the same first primer specific for a first type of repetitive DNA sequence and a second primer, wherein the second primer is specific for a second type of repetitive DNA sequence, wherein the second primer is the same for some and different for some of the one or more of the status biomarkers. In some forms, the first primer can be specific for one of the families of repetitive DNA sequences listed in Table 16 or 17, wherein independently for each of the one or more of the status biomarkers the second primer is specific for a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. A primer specific for a family of repetitive DNA sequence refers to a primer that can bind to a sequence in, and prime replication of, one or more repetitive DNA sequences in the family of repetitive DNA sequences.
The presence of repetitive DNA sequences belonging to different families of repetitive DNA sequences in the same, for example, status biomarker or repetitive DNA sequence locus can facilitate some of the forms of the disclosed methods. For example, the different repetitive DNA sequences can be used to define a PCR amplicon by, for example, using primers specific for two of the different repetitive DNA sequences.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, amplifying the processed DNA and determining the ratio of cytosine to thymidine in the amplified DNA and converting the ratio to the level of methylated forms of the status biomarkers. In some forms, the processed DNA can be amplified via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers. In some forms, the PCR amplification can be quantitative PCR. In some forms, the PCR amplification can be nanoliter-microarray quantitative PCR.
Probes can also be used to capture status biomarkers and sequences derived from status biomarkers. Such probes can be referred to as capture probes, status biomarker capture probes, or status biomarker probes. In some forms, treating the DNA sample can be accomplished by, for example, capturing status biomarker DNA fragments. In some forms, the status biomarker DNA fragments can be captured by, for example, binding DNA fragments in the DNA sample to status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences. Such probes can be specific for specific repetitive DNA sequences. Such probes can also be specific for a group or family of repetitive DNA sequences or a group or family of status biomarkers. For example, one or more of the status biomarker probes can comprise degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, or 135 different degenerate sequences each representing a different consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. The families of repetitive DNA sequences can be selected for in any manner, including by selecting the first at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, or 135 families in rank order. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not captured can be separated from the captured status biomarker DNA fragments. In some forms, the sequencing can be a form of SMRT sequencing.
In some forms, the method can further comprise, after capturing status biomarker DNA fragments and prior to sequencing the captured status biomarker DNA fragments, releasing the captured status biomarker DNA fragments and recapturing the released status biomarker DNA fragments. In some forms, the status biomarker DNA fragments can be recaptured by binding DNA fragments in the DNA sample to secondary status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in Table 16 or 17. For example, the family of repetitive DNA sequences can be the AluY, AluSx, AluSp, AluSg, or AluSc family of repetitive DNA sequences. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, or AluSc.
In some forms, status biomarker probes can be produced by, for example, selecting a subset of repetitive DNA sequence loci from a set of repetitive DNA sequence loci, generating a set of status biomarker probe sequences, and synthesizing one or more status biomarker probes. In some forms, the method for producing status biomarker probes can further comprise selecting one or more additional subsets of repetitive DNA sequence loci each from a different additional set of repetitive DNA sequence loci, generating one or more additional sets of status biomarker probe sequences each based on one of the one or more additional subsets, and synthesizing one or more additional status biomarker probes, wherein each additional status biomarker probe has the sequence of one of the additional status biomarker probe sequences. In some forms, the repetitive DNA sequence loci in the set of repetitive DNA sequence loci can belong to a single one of the families of repetitive DNA sequence such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13, wherein the subset of repetitive DNA sequence loci can be selected by identifying those repetitive DNA sequence loci that comprise a repetitive DNA sequence belonging to one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the repetitive DNA sequence loci in each additional set of repetitive DNA sequence loci can independently belong to a different single one of the families of repetitive DNA sequence such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13, wherein the repetitive DNA sequence loci in the set of repetitive DNA sequence loci and in each additional set of repetitive DNA sequence loci belong to different families of repetitive DNA sequence.
In some forms, each status biomarker probe sequence in a set can have a length of, for example, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 bases or more. In some forms, each status biomarker probe represented in the set of status biomarker probe sequences can hybridize to, for example, at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15% of the repetitive DNA sequence loci in the selected subset of repetitive DNA sequence loci. In some forms, each status biomarker probe can have the sequence of one of the generated status biomarker probe sequences.
In some forms, the set of status biomarker probe sequences can include, for example, any range of from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 status biomarker probe sequences. In some forms, the set of status biomarker probe sequences can comprise from 5 to 100 status biomarker probe sequences. In some forms, the set of status biomarker probe sequences can comprise from 10 to 100 status biomarker probe sequences. In some forms, one or more of the additional sets of status biomarker probe sequences each can comprise from 1 to 100 status biomarker probe sequences. In some forms, the one or more additional sets of status biomarker probe sequences each can comprise from 5 to 100 status biomarker probe sequences. In some forms, the one or more additional sets of status biomarker probe sequences each can comprise from 10 to 100 status biomarker probe sequences.
The disclosed methods and compositions can use supports. For example, probes and primers can be attached or associated with supports for use in the disclosed methods. Such probe and primer associated supports can take the form of, for example, arrays and microarrays. Solid supports are solid-state substrates or supports with which molecules (such as probes and primers) can be associated. Probes, primers, and other molecules can be associated with solid supports directly or indirectly. For example, probes can be bound to the surface of a solid support or associated with capture agents (e.g., oligonucleotides or molecules that bind a probe) immobilized on solid supports. As another example, probes can be bound to the surface of a solid support or associated with oligonucleotides immobilized on solid supports. An array is a solid support to which multiple probes, primers, or other molecules have been associated in an array, grid, or other organized pattern.
Solid-state substrates for use in solid supports can include any solid material with which components can be associated, directly or indirectly. This includes materials such as gel, acrylamide, agarose, cellulose, nitrocellulose, glass, gold, polystyrene, polyethylene vinyl acetate, polypropylene, polymethacrylate, polyethylene, polyethylene oxide, polysilicates, polycarbonates, teflon, fluorocarbons, nylon, silicon rubber, polyanhydrides, polyglycolic acid, polylactic acid, polyorthoesters, functionalized silane, polypropylfumerate, collagen, glycosaminoglycans, and polyamino acids. Solid-state substrates can have any useful form including thin film, membrane, bottles, dishes, plates, slides, fibers, woven fibers, shaped polymers, chromatography matrix, particles, magnetic particles, beads, magnetic beads, microparticles, magnetic microparticles, nanoparticles, magnetic nanoparticles, or a combination. Solid-state substrates and solid supports can be porous or non-porous. A chip is a rectangular or square small piece of material. Useful forms for solid-state substrates are thin films, beads, or chips. A useful form for a solid-state substrate is a microtiter dish. In some embodiments, a multiwell glass slide can be employed.
An array can include a plurality of probes, other molecules, compounds or primers immobilized at identified or predefined locations on the solid support. Each predefined location on the solid support generally has one type of component (that is, all the components at that location are the same). Alternatively, multiple types of components can be immobilized in the same predefined location on a solid support. Each location will have multiple copies of the given components. The spatial separation of different components on the solid support allows separate detection and identification.
Although useful, it is not required that the solid support be a single unit or structure. A set of probes, other molecules, compounds and/or primers can be distributed over any number of solid supports. For example, at one extreme, each component can be immobilized in a separate reaction tube or container, or on separate beads or microparticles.
Methods for immobilization of oligonucleotides to solid-state substrates are well established. Oligonucleotides, including address probes and detection probes, can be coupled to substrates using established coupling methods. For example, suitable attachment methods are described by Pease et al., Proc. Natl. Acad. Sci. USA 91(11):5022-5026 (1994), and Khrapko et al., Mol Biol (Mosk) (USSR) 25:718-730 (1991). A method for immobilization of 3′-amine oligonucleotides on casein-coated slides is described by Stimpson et al., Proc. Natl. Acad. Sci. USA 92:6379-6383 (1995). A useful method of attaching oligonucleotides to solid-state substrates is described by Guo et al., Nucleic Acids Res. 22:5456-5465 (1994).
Each of the components (for example, probes, primers, or other molecules) immobilized on the solid support can be located in a different predefined region of the solid support. The different locations can be different reaction chambers. Each of the different predefined regions can be physically separated from each other of the different regions. The distance between the different predefined regions of the solid support can be either fixed or variable. For example, in an array, each of the components can be arranged at fixed distances from each other, while components associated with beads will not be in a fixed spatial relationship. In particular, the use of multiple solid support units (for example, multiple beads) will result in variable distances.
Components can be associated or immobilized on a solid support at any density. Components can be immobilized to the solid support at a density exceeding 400 different components per cubic centimeter. Arrays of components can have any number of components. For example, an array can have at least 1,000 different components immobilized on the solid support, at least 10,000 different components immobilized on the solid support, at least 100,000 different components immobilized on the solid support, or at least 1,000,000 different components immobilized on the solid support.
Any nucleic acid sample can be used with the disclosed methods. Examples of suitable nucleic acid samples include DNA samples, genomic samples, mRNA samples, cDNA samples, nucleic acid libraries (including cDNA and genomic libraries), whole cell samples, culture samples, tissue samples, bodily fluids, biopsy samples, or a combination. Numerous other sources of nucleic acid samples are known or can be developed and any can be used with the disclosed method. Generally, it is useful to use a genomic sample from cells, tissues, subjects, that are relevant to the status being assessed.
The source, identity, and preparation of many such nucleic acid samples are known. The nucleic acid sample can be, for example, a nucleic acid sample from one or more cells, tissue, skin, lung, head, neck, prostate, breast, ovary, brain, liver, stomach, intestine, kidney, testicle, cervix, uterus, spleen, bone, throat, esophagus, muscle, or bodily fluids such as blood, urine, semen, lymphatic fluid, cerebrospinal fluid, or amniotic fluid, or other biological samples, such as tissue culture cells, buccal swabs, mouthwash, stool, tissues slices, and biopsy aspiration. Types of useful DNA samples include blood samples, urine samples, semen samples, lymphatic fluid samples, cerebrospinal fluid samples, amniotic fluid samples, biopsy samples, needle aspiration biopsy samples, cancer samples, tumor samples, tissue samples, cell samples, cell lysate samples, crude cell lysate samples, forensic samples, infection samples, and/or nosocomial infection samples.
Nucleic acid fragments are segments of larger nucleic molecules. Nucleic acid fragments, as used in the disclosed method, generally refer to nucleic acid molecules that have been cleaved. A nucleic acid sample that has been incubated with a nucleic acid cleaving reagent is referred to as a digested sample. A nucleic acid sample that has been digested using a restriction enzyme is referred to as a digested sample.
The materials described herein as well as other materials can be packaged together in any suitable combination as a kit useful for performing, or aiding in the performance of, the disclosed method. It is useful if the kit components in a given kit are designed and adapted for use together in the disclosed method. For example disclosed are kits for assessing status of a subject, the kit comprising probes for status biomarkers. The kits also can contain status biomarker capture probes, primers for multiple displacement amplification, PCR primers, restriction endonucleases, or a combination.
Disclosed are mixtures formed by performing or preparing to perform the disclosed method. For example, disclosed are mixtures comprising a DNA sample and restriction endonucleases, a DNA sample and primers, a DNA sample and probes, digested, amplified DNA and probes, treated DNA and probes, etc.
Whenever the method involves mixing or bringing into contact compositions or components or reagents, performing the method creates a number of different mixtures. For example, if the method includes 3 mixing steps, after each one of these steps a unique mixture is formed if the steps are performed separately. In addition, a mixture is formed at the completion of all of the steps regardless of how the steps were performed. The present disclosure contemplates these mixtures, obtained by the performance of the disclosed methods as well as mixtures containing any disclosed reagent, composition, or component, for example, disclosed herein.
Disclosed are systems useful for performing, or aiding in the performance of, the disclosed method. Systems generally comprise combinations of articles of manufacture such as structures, machines, devices, and the like, and compositions, compounds, materials, and the like. Such combinations that are disclosed or that are apparent from the disclosure are contemplated. For example, disclosed and contemplated are systems comprising detection apparatus and arrays of probes.
Disclosed are data structures used in, generated by, or generated from, the disclosed method. Data structures generally are any form of data, information, and/or objects collected, organized, stored, and/or embodied in a composition or medium. A pattern of methylation states and/or levels for status biomarkers stored in electronic form, such as in RAM or on a storage disk, is a type of data structure.
The disclosed method, or any part thereof or preparation therefor, can be controlled, managed, or otherwise assisted by computer control. Such computer control can be accomplished by a computer controlled process or method, can use and/or generate data structures, and can use a computer program. Such computer control, computer controlled processes, data structures, and computer programs are contemplated and should be understood to be disclosed herein.
The disclosed methods and compositions are applicable to numerous areas including, but not limited to, assessment of status of cells, tissues, and or subjects, such as by assessment of the presence, stage, risk, etc. of a disease or condition. Other uses include assessing aging and/or general health of cells, tissues, and/or subjects. Other uses are disclosed, apparent from the disclosure, and/or will be understood by those in the art.
Disclosed are methods of assessing one or more statuses of a subject. Also disclosed are methods of identifying status biomarkers associated with a status of a subject. Also disclosed are methods of producing status biomarker capture probes.
Status biomarkers can be used to assessing one or more statuses of a subject. This can be done by, for example, determining the methylation state of one or more status biomarkers in the subject, and comparing one or more of the determined methylation states to one or more reference methylation states, wherein a difference, lack of a difference, or both in one or more of the determined methylation states and one or more of the reference methylation states indicates one or more statuses of the subject.
i. Determining Methylation State
The methylation state of status biomarkers can be determined using any suitable technique or method. A number of techniques for detecting and determining the presence and level of methylation of DNA are known. Such methods and techniques can be used in the disclosed methods. Generally, methylation can be determined via direct detection of methylated nucleotides or indirectly by altering or separating nucleotides or nucleic acid acids based on the presence or absence of methylation. In some forms, the methylation state can be determined by, for example, treating a DNA sample of the subject to differentiate methylated and unmethylated nucleotides, and detecting the level of methylated forms of the one or more status biomarkers in the treated DNA, detecting the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both, wherein the level of methylated forms of the status biomarkers, the level of unmethylated forms of the status biomarkers, or both indicates the methylation state of the status biomarkers.
a. Treating DNA Samples
In some forms, treating the DNA sample can be accomplished by, for example, incubating the DNA sample with one or more restriction endonucleases and amplifying the incubated DNA, wherein the restriction endonucleases are methylation-sensitive restriction endonucleases, wherein the level of the status biomarkers in the amplified DNA is lower when the status biomarkers have reduced methylation and the level of the status biomarkers in the amplified DNA is higher when the status biomarkers have increased methylation, wherein the level of the status biomarkers comprise the level of methylated forms of the one or more status biomarkers in the treated DNA, the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both. An example of such forms of the methods is described in Example 3. A methylation-sensitive restriction endonuclease is a restriction endonuclease that cleaves only at unmethylated recognition and/or cleavage sites. Amplification can distinguish methylated and unmethylated status biomarkers via differential cleavage of restriction endonuclease based on the methylation state of the DNA. For example, cleaving DNA into smaller fragments can reduce the amplification of the DNA. Multiple displacement amplification is useful for this purpose. The methylation state can then be determined by detecting or assessing the presence, absence, or level of amplified nucleic acid.
In some forms, the restriction endonucleases can further comprise at least one methylation-dependent restriction endonuclease. A methylation-dependent restriction endonuclease is a restriction endonuclease that cleaves only at methylated recognition and/or cleavage sites. In some forms, the restriction endonucleases can further comprise at least one methylation-independent restriction endonuclease. A methylation-independent restriction endonuclease is a restriction endonuclease that cleaves at both methylated and unmethylated recognition and/or cleavage sites. In some forms, the restriction endonucleases can comprise AciI and HhaI. In some forms, the restriction endonucleases can comprise McrBC. In some forms, incubating the DNA sample with one or more endonucleases can be accomplished by, for example, incubating different aliquots of the DNA sample with different restriction endonucleases. In some forms, amplifying the incubated DNA can be accomplished by, for example, multiple displacement amplification. An example of such forms of the methods is described in Example 3. Techniques useful for these forms of assessment of methylation states are described in U.S. Patent Application Publication No. 20060292585.
In some forms, treating the DNA sample can be accomplished by, for example, processing the DNA sample with sodium bisulfite. An example of such forms of the methods is described in Example 4. Sodium bisulfite converts cytosine to uridine but does not convert methylcytosine. This allows detection of methylation and methylation levels by detecting cytosine and thymidine. The ratio of cytosine to thymidine can be converted to the relative methylation level.
In some forms, treating the DNA sample can be accomplished by, for example, fragmenting the DNA and separating methylated DNA from unmethylated DNA. An example of such forms of the methods is described in Example 5. In some forms, the DNA can be fragmented by, for example, nebularization, cleavage with a restriction endonuclease, sonication, or a combination. In some forms, methylated DNA can be separated from unmethylated DNA by, for example, binding methylated DNA with a specific binding molecule specific for methyl groups and separating the bound from the unbound DNA. In some forms, the specific binding molecule can comprise, for example, an antibody specific for 5-methyl cytosine, methyl-biding protein MBD1, methyl-biding protein MECP2, or a combination. Numerous techniques and methods for binding and separating molecules are known and can be adapted for use with the disclosed methods to bind and separate methylated form unmethylated DNA.
In some forms, treating the DNA sample can be accomplished by, for example, capturing status biomarker DNA fragments and sequencing the captured status biomarker DNA fragments, wherein the sequencing distinguishes cytosine from methylcytosine, wherein the level of methylcytosine indicates level of methylated forms of the status biomarkers. Examples of such forms of the methods are described in Examples 6 and 7. In some forms, the status biomarker DNA fragments can be captured by, for example, binding DNA fragments in the DNA sample to status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, or 135 different degenerate sequences each representing a different consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not captured can be separated from the captured status biomarker DNA fragments. In some forms, the sequencing can be a form of SMRT sequencing.
In some forms, the method can further comprise, after capturing status biomarker DNA fragments and prior to sequencing the captured status biomarker DNA fragments, releasing the captured status biomarker DNA fragments and recapturing the released status biomarker DNA fragments. An example of such forms of the methods is described in Example 7. In some forms, the status biomarker DNA fragments can be recaptured by binding DNA fragments in the DNA sample to secondary status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in Table 17. For example, the family of repetitive DNA sequences can be the AluY, AluSx, AluSp, AluSg, or AluSc family of repetitive DNA sequences. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, or AluSc. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not recaptured can be separated from the recaptured status biomarker DNA fragments.
ii. Detecting the Level of Status Biomarkers
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, an array of probes specific for the status biomarkers. An example of such forms of the methods is described in Example 3. This detection is useful for DNA that has been treated to differentially amplify or retain DNA based on the methylation state. In some forms, the array of probes can be, for example, a microarray. Myriad techniques are known for detecting and assessing nucleic acid sequences. Such techniques can be used with the disclosed methods to detect and assess status biomarkers and the status or biomarkers. Multiplex and high throughput techniques are particular useful for this purpose. Thus, for example, the use of arrays and microarrays for detection are particularly useful.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, amplifying the processed DNA and determining the ratio of cytosine to thymidine in the amplified DNA and converting the ratio to the level of methylated forms of the status biomarkers. An example of such forms of the methods is described in Example 4. This detection is useful for DNA that has been treated with sodium, bisulfite. In some forms, the processed DNA can be amplified via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers. An example of such forms of the methods is described in Example 5. This detection is useful for DNA that has been separated based on methylation of lack of methylation. In some forms, the PCR amplification can be quantitative PCR. In some forms, the PCR amplification can be nanoliter-microarray quantitative PCR.
iii. Analysis of Groups of Status Biomarkers
In some forms, the level of the status biomarkers can be grouped into a plurality of status biomarker families, wherein the level of the status biomarkers in one or more of the families is analyzed, wherein the analyzed level of the status biomarkers in the one or more of the families indicates the methylation state of the status biomarkers in the family. In some forms, the analyzed level of the status biomarkers in one or more of the families can be the average of the levels of the individual status biomarkers in the family. In some forms, one or more of the status biomarker families each independently can consist of, for example, a single class of repetitive DNA element, a single subclass of repetitive DNA element, a single family of repetitive DNA element, a single subfamily of repetitive DNA element, or a combination. In some forms, the analyzed level of the status biomarkers in one or more of the families can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarkers can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarker families can be normalized to one or more of the reference methylation states. In some forms, the status biomarkers can be grouped according to one or more repetitive DNA sequences that the status biomarkers comprise, wherein each biomarker in each status biomarker family comprises one or more repetitive DNA sequences that belong to a single family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13.
iv. Reference Methylation State
In some forms, one or more of the one or more reference methylation states can be a normal methylation state. In some forms, the normal methylation state can be, for example, the methylation state of a healthy subject, the average of the methylation states of healthy subjects, or the average of the methylation states of a population of subjects. In some forms, one or more of the one or more reference methylation states can be, for example, the methylation state of the same subject at a different time, the methylation state of the same subject at an earlier time, the methylation state of the same subject at a later time, or the methylation state of one or more normal cells, tissues, organs, or a combination of the same subject. In some forms, one or more of the one or more reference methylation states can be the methylation state from non-tumor adjacent tissue. In some forms, one or more of the one or more reference methylation states can be a normal methylation state of a status biomarker family.
v. Determining Genetic State of Status Biomarkers
In some forms, the method can further comprise determining the genetic state of one or more status biomarkers by, for example, comparing one or more of the determined genetic states to one or more reference genetic states, wherein a difference, lack of a difference, or both in one or more of the determined genetic states and one or more of the reference genetic states indicates one or more statuses of the subject. As used herein, “genetic state” refers to a particular sequence or mutation in the biomarker. Thus, for example, a particular SNP in a biomarker is a genetic state of the biomarker. In some forms, determining the genetic state of one or more status biomarkers can be determined in one or more of the DNA samples. The genetic state of biomarkers can be determined using any technique or method that can determine the sequence of a biomarker. Myriad techniques and methods for sequencing and determining the sequence of nucleic acids are known. Such techniques and methods can be used with the disclosed methods.
In some forms, the source of one or more of the DNA samples can be one or more tissues of the subject, organs of the subject, or both. In some forms, the source of one or more of the DNA samples can be a tissue or organ of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells, tissue, skin, lung, head, neck, prostate, breast, ovary, brain, liver, stomach, intestine, kidney, testicle, cervix, uterus, spleen, bone, throat, esophagus, muscle, bodily fluids, blood, urine, semen, lymphatic fluid, cerebrospinal fluid, amniotic fluid, biological samples, tissue culture cells, buccal swabs, mouthwash, stool, tissues slices, biopsy aspiration, or a combination.
vi. Status of Diseases and Conditions Assessed in Subject
In some forms, the subject can be assessed for the status of wellness, level of health, risk to wellness, risk to level of health, or a combination. In some forms, the subject can be assessed for the status of the genome. The status of the genome can be, for example, the level of methylation of status biomarkers in the genome relative to a reference or normal state. A useful reference state for this purpose can be the average methylation state for young subjects and/or healthy subjects. In some forms, the subject can be assessed for the status of aging, risk of aging, or both. In some forms, the subject can be assessed for the status of cancer, risk of cancer, or both. In some forms, the subject can be assessed for the status of stress response. In some forms, the subject can be assessed for the status of diabetes, risk of diabetes, or both. In some forms, the subject can be assessed for the status of heart disease, risk of heart disease, or both. In some forms, the subject can be assessed for the status of genomic instability. In some forms, the subject can be assessed for the status of tumor burden. In some forms, the subject can be assessed for the status of response to treatment. In all of these, changes in methylation state of relevant status biomarkers can indicate the presence or absence of the disease or condition and/or positive or negative changes and/or risks.
vii. Timing and Comparison of Status Assessments
In some forms, the subject can be assessed for a change in one or more statuses. In some forms, the change in one or more of the one or more statuses can be assessed compared to an earlier assessment. In some forms, the earlier assessment can have been made at, for example, an earlier time, prior to diagnosis of a disease or condition, prior to a treatment, following diagnosis of a disease or condition, following treatment, or a combination. In some forms, the change in one or more of the one or more statuses can be assessed following the passage of time, prior to diagnosis of a disease or condition, prior to a treatment, following diagnosis of a disease or condition, following treatment, or a combination. In some forms, assessing the subject can comprise assessing one or more tissues of the subject, organs of the subject, or both. As used herein, assessing a tissue or organ of a subject being assessed for a particular status means that the tissue or organ is assessed for that status and that such assessment of the tissue or organ constitutes the assessment of the subject. In some forms, assessing the subject can comprise assessing a tissue or organ of the subject. In some forms, assessing the subject can comprise assessing one or more cells of the subject.
B. Method of Identifying Status Biomarkers Associated with Diseases and Conditions
Status biomarkers useful for particular states, diseases, and conditions can be identified using the disclosed methods. For example, status biomarkers associated with a status of a subject can be identified by, for example, determining the methylation state of one or more status biomarkers in one or more DNA samples, wherein the DNA samples are from sources that are relevant to one or more specific statuses, and comparing one or more of the determined methylation states to one or more reference methylation states, wherein a difference in one or more of the determined methylation states and one or more of the reference methylation states indicates that the status biomarkers for which the difference in the methylation states is found is a status biomarker associated with one or more of the specific statuses. Particularly useful status biomarkers can be identified by determining the statistical significance of the change in methylation state in the affected sample versus a relevant reference methylation state.
i. Determining Methylation State
In some forms, the methylation state can be determined by, for example, treating a DNA sample of the subject to differentiate methylated and unmethylated nucleotides, and detecting the level of methylated forms of the one or more status biomarkers in the treated DNA, detecting the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both, wherein the level of methylated forms of the status biomarkers, the level of unmethylated forms of the status biomarkers, or both indicates the methylation state of the status biomarkers.
In some forms, treating the DNA sample can be accomplished by, for example, incubating the DNA sample with one or more restriction endonucleases and amplifying the incubated DNA, wherein the restriction endonucleases are methylation-sensitive restriction endonucleases, wherein the level of the status biomarkers in the amplified DNA is lower when the status biomarkers have reduced methylation and the level of the status biomarkers in the amplified DNA is higher when the status biomarkers have increased methylation, wherein the level of the status biomarkers comprise the level of methylated forms of the one or more status biomarkers in the treated DNA, the level of unmethylated forms of the one or more status biomarkers in the treated DNA, or both. An example of such forms of the methods is described in Example 3.
In some forms, the restriction endonucleases can further comprise at least one methylation-dependent restriction endonuclease. In some forms, the restriction endonucleases can further comprise at least one methylation-independent restriction endonuclease. In some forms, the restriction endonucleases can comprise AciI and HhaI. In some forms, the restriction endonucleases can comprise McrBC. In some forms, incubating the DNA sample with one or more endonucleases can be accomplished by, for example, incubating different aliquots of the DNA sample with different restriction endonucleases. In some forms, amplifying the incubated DNA can be accomplished by, for example, multiple displacement amplification.
In some forms, treating the DNA sample can be accomplished by, for example, processing the DNA sample with sodium bisulfite. An example of such forms of the methods is described in Example 4.
In some forms, treating the DNA sample can be accomplished by, for example, fragmenting the DNA and separating methylated DNA from unmethylated DNA. An example of such forms of the methods is described in Example 5. In some forms, the DNA can be fragmented by, for example, nebularization, cleavage with a restriction endonuclease, sonication, or a combination. In some forms, methylated DNA can be separated from unmethylated DNA by, for example, binding methylated DNA with a specific binding molecule specific for methyl groups and separating the bound form the unbound DNA. In some forms, the specific binding molecule can comprise, for example, an antibody specific for 5-methyl cytosine, methyl-biding protein MBD1, methyl-biding protein MECP2, or a combination.
In some forms, treating the DNA sample can be accomplished by, for example, capturing status biomarker DNA fragments and sequencing the captured status biomarker DNA fragments, wherein the sequencing distinguishes cytosine from methylcytosine, wherein the level of methylcytosine indicates level of methylated forms of the status biomarkers. Examples of such forms of the methods is described in Examples 6 and 7. In some forms, the status biomarker DNA fragments can be captured by, for example, binding DNA fragments in the DNA sample to status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the one or more of the status biomarker probes can comprise at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, or 135 different degenerate sequences each representing a different consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not captured can be separated from the captured status biomarker DNA fragments. In some forms, the sequencing can be a form of SMRT sequencing.
In some forms, the method can further comprise, after capturing status biomarker DNA fragments and prior to sequencing the captured status biomarker DNA fragments, releasing the captured status biomarker DNA fragments and recapturing the released status biomarker DNA fragments. An example of such forms of the methods is described in Example 7. In some forms, the status biomarker DNA fragments can be recaptured by binding DNA fragments in the DNA sample to secondary status biomarker probes attached to a support. In some forms, one or more of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein the one or more of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, each of the status biomarker probes can specifically hybridize to one or more repetitive DNA sequences, wherein each of the status biomarker probes comprises degenerate sequence representing a consensus sequence for a family of repetitive DNA sequences. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the family of repetitive DNA sequences can be a family of repetitive DNA sequences listed in Table 17. For example, the repetitive DNA sequence family can be the AluY, AluSx, AluSp, AluSg, or AluSc family of repetitive DNA sequences. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17. In some forms, the one or more of the status biomarker probes can comprise different degenerate sequences each representing a consensus sequence for a different one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, or AluSc. In some forms, the support can comprise, for example, gel, a bead, a magnetic bead, a plate, a slide, a surface, or a microparticle. In some forms, DNA not recaptured can be separated from the recaptured status biomarker DNA fragments.
ii. Detecting the Level of Status Biomarkers
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, an array of probes specific for the status biomarkers. An example of such forms of the methods is described in Example 3. In some forms, the array of probes can be, for example, a microarray.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, amplifying the processed DNA and determining the ratio of cytosine to thymidine in the amplified DNA and converting the ratio to the level of methylated forms of the status biomarkers. An example of such forms of the methods is described in Example 4. In some forms, the processed DNA can be amplified via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers.
In some forms, detecting the level of the status biomarkers can be accomplished via, for example, PCR amplification of the status biomarkers using primers specific for the status biomarkers. In some forms, the PCR amplification can be quantitative PCR. An example of such forms of the methods is described in Example 5. In some forms, the PCR amplification can be nanoliter-microarray quantitative PCR.
iii. Analysis of Groups of Status Biomarkers
In some forms, the level of the status biomarkers can be grouped into a plurality of status biomarker families, wherein the level of the status biomarkers in one or more of the families is analyzed, wherein the analyzed level of the status biomarkers in the one or more of the families indicates the methylation state of the status biomarkers in the family. In some forms, the analyzed level of the status biomarkers in one or more of the families can be the average of the levels of the individual status biomarkers in the family. In some forms, one or more of the status biomarker families each independently can consist of, for example, a single class of repetitive DNA element, a single subclass of repetitive DNA element, a single family of repetitive DNA element, a single subfamily of repetitive DNA element, or a combination. In some forms, the analyzed level of the status biomarkers in one or more of the families can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarkers can be normalized to one or more of the reference methylation states. In some forms, the level of one or more of the status biomarker families can be normalized to one or more of the reference methylation states. In some forms, the status biomarkers can be grouped according to one or more repetitive DNA sequences that the status biomarkers comprise, wherein each biomarker in each status biomarker family comprises one or more repetitive DNA sequences that belong to a single family of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13.
iv. Reference Methylation State
In some forms, one or more of the one or more reference methylation states can be a normal methylation state. In some forms, the normal methylation state can be, for example, the methylation state of a healthy subject, the average of the methylation states of healthy subjects, or the average of the methylation states of a population of subjects. In some forms, one or more of the one or more reference methylation states can be, for example, the methylation state of the same subject at a different time, the methylation state of the same subject at an earlier time, the methylation state of the same subject at a later time, or the methylation state of one or more normal cells, tissues, organs, or a combination of the same subject. In some forms, one or more of the one or more reference methylation states can be the methylation state from non-tumor adjacent tissue. In some forms, one or more of the one or more reference methylation states can be a normal methylation state of a status biomarker family.
v. Step of Determining Genetic State of Status Biomarkers
In some forms, the method can further comprise determining the genetic state of one or more status biomarkers by, for example, comparing one or more of the determined genetic states to one or more reference genetic states, wherein a difference, lack of a difference, or both in one or more of the determined genetic states and one or more of the reference genetic states indicates one or more statuses of the subject. In some forms, determining the genetic state of one or more status biomarkers can be determined in one or more of the DNA samples.
In some forms, the source of one or more of the DNA samples can be one or more tissues of the subject, organs of the subject, or both. In some forms, the source of one or more of the DNA samples can be a tissue or organ of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells of the subject. In some forms, the source of one or more of the DNA samples can be one or more cells, tissue, skin, lung, head, neck, prostate, breast, ovary, brain, liver, stomach, intestine, kidney, testicle, cervix, uterus, spleen, bone, throat, esophagus, muscle, bodily fluids, blood, urine, semen, lymphatic fluid, cerebrospinal fluid, amniotic fluid, biological samples, tissue culture cells, buccal swabs, mouthwash, stool, tissues slices, biopsy aspiration, or a combination.
The disclosed methods can be used to design and/or produce probes for status biomarkers, including status biomarker capture probes. For example, status biomarker probes can be designed by, for example, selecting a subset of repetitive DNA sequence loci from a set of repetitive DNA sequence loci, and generating a set of status biomarker probe sequences. Status biomarker probes can then be produced by synthesizing one or more status biomarker probes from the status biomarker probe sequences. In some forms, the repetitive DNA sequence loci in the set of repetitive DNA sequence loci can belong to a single one of the families of repetitive DNA sequence such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13, wherein the subset of repetitive DNA sequence loci can be selected by identifying those repetitive DNA sequence loci that comprise a repetitive DNA sequence belonging to one of the families of repetitive DNA sequences such as the repetitive DNA sequence families listed in, for example, Table 16 and Table 17.
In some forms, each status biomarker probe sequence in the set can have a length of, for example, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 bases or more, wherein each status biomarker probe represented in the set of status biomarker probe sequences can hybridize to, for example, at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15% of the repetitive DNA sequence loci in the selected subset of repetitive DNA sequence loci. In some forms, each status biomarker probe can have the sequence of one of the status biomarker probe sequences.
In some forms, the repetitive DNA sequence loci in the set of repetitive DNA sequence loci can belong to a single one of the families of repetitive DNA sequence LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, and MLTD2. In some forms, the repetitive DNA sequence in the subset of repetitive DNA sequence loci can belong to one of the families of repetitive DNA sequences listed in Table 16 or 17, such as AluY, AluSx, AluSp, AluSg, AluSc, LTR9, or LTR9B.
In some forms, the method can further comprise selecting one or more additional subsets of repetitive DNA sequence loci each from a different additional set of repetitive DNA sequence loci, generating one or more additional sets of status biomarker probe sequences each based on one of the one or more additional subsets, and synthesizing one or more additional status biomarker probes, wherein each additional status biomarker probe has the sequence of one of the additional status biomarker probe sequences. In some forms, the repetitive DNA sequence loci in each additional set of repetitive DNA sequence loci can independently belong to a different single one of the families of repetitive DNA sequence such as the repetitive DNA sequence families listed in, for example, Table 1, Table 12, or Table 13, wherein the repetitive DNA sequence loci in the set of repetitive DNA sequence loci and in each additional set of repetitive DNA sequence loci belong to different families of repetitive DNA sequence.
In some forms, the repetitive DNA sequence loci in the each additional set of repetitive DNA sequence loci can independently belong to a single one of the families of repetitive DNA sequence LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, and MLTD2. In some forms, each status biomarker probe sequence in the set can have a length of, for example, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 bases or more. In some forms, each status biomarker probe represented in the set of status biomarker probe sequences can hybridize to, for example, at least 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15% of the repetitive DNA sequence loci in the selected subset of repetitive DNA sequence loci. In some forms, the set of status biomarker probe sequences can comprise from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 34, 35, 36, 38, 40, 42, 44, 45, 46, 48, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 status biomarker probe sequences. In some forms, the set of status biomarker probe sequences can comprise from 5 to 100 status biomarker probe sequences. In some forms, the set of status biomarker probe sequences can comprise from 10 to 100 status biomarker probe sequences. In some forms, one or more of the additional sets of status biomarker probe sequences each can comprise from 1 to 100 status biomarker probe sequences. In some forms, the one or more additional sets of status biomarker probe sequences each can comprise from 5 to 100 status biomarker probe sequences. In some forms, the one or more additional sets of status biomarker probe sequences each can comprise from 10 to 100 status biomarker probe sequences.
Status biomarker probes can be designed and produce for any desired status biomarker or family of status biomarkers. For example, capture probes for preferred status biomarkers can be designed by:
a. obtaining the RepeatMasker annotation for all members of a preferred status biomarker family, said annotation comprising the genomic coordinates of each member of the chosen repetitive element family, as well as the annotated DNA sequence;
b. re-organizing the DNA sequences in the list so that all are in the 5′ to 3′ orientation;
c. examining each candidate status biomarker locus by defining a window of 1000 bases, centered in the middle of the repetitive element sequence, and then performing a query of the RepeatMasker annotation to find any other repeats present in the window, whereby those co-localized or neighbor repetitive elements belong to a list of preferred neighbor families (such as those listed in Table 16 and Table 17);
d. choosing the subset of the coordinates corresponding to repetitive elements satisfying the criteria that they contain a neighbor present in the list of neighbors; this is the preferred candidate coordinate and sequence list;
e. generating a list, using standard computational tools, of between 1 and 100 oligonucleotides each with a length of 100 bases or more, each probe capable of forming duplex structures with more than 5% of the sequences present in the preferred candidate sequence list; the duplex structures can contain several mismatches, as long as they are deemed capable of forming a duplex stable enough for performing sequence capture (design criteria for such capture probes are published and well known in the art).
The designed capture probes can be produced and used by, for example, performing synthesis (as DNA or RNA) of the designed oligonucleotides (between 1 and 100 different sequences), and utilizing these oligonucleotides, as a mixture in solution, or as a collection of probes bound on a microarray surface, for capturing fragmented genomic DNA from a biological sample, using methods well know in the published art.
1. Random Forest
Random forest (Breiman 2001) is a classifier that is consisted of many decision trees. The following is the procedure of constructing an individual decision tree. Suppose there are n observations and p variables (or features) in the data set. (1) Randomly draw a bootstrap sample of size n with replacement from the data set. This set is called the training set and is used to construct a decision tree. (2) A pre-specified fixed number of variables, say m, is drawn at random from the p variables. The parameter m is chosen such that it is much smaller than p. (3) A tree is constructed from the top down. At each node, the variable that yields the best split is chosen to split the node. (4) Repeat step 3 to grow the tree until no split can further improve the classification. No pruning is conducted.
To classify a new case, run it through all trees in the forest. Each tree gives a classification, or called a “vote”. And the final classification given by the forest takes the majority votes of all trees. To obtain an estimate of error rates, the set of observations that are not sampled in each tree, which is called the out-of-bag (OOB) set and is about one third of the original data, is used for cross validation. More specifically, for each OOB case, run it down the decision tree and obtain a classification or a vote. At the end of all runs, each case has the final classification by simple majority OOB votes. This gives an estimate of the error rates.
The advantages of random forest include: excellent classification accuracy; fast computation speed; efficient handling of large data sets; providing proximities between pairs of cases; generating importance measures for all variables; no need of extra test sets.
2. Support Vector Machine
In Support Vector Machine (SVM) (Vapnik 1998) a set of features that describes an observation is called a vector. SVM classifies observations by construct hyperplanes that optimally separate the data into different classes, i.e., vectors of different classes are on different sides of the hyperplanes. The vectors close to the hyperplanes are called support vectors. The goal of SVM is to find optimal hyperplanes by maximizing the distances between the support vectors and the hyperplanes. SVM is computationally efficient and can handle large data sets.
Support Vector Machine-Recursive Feature Elimination (SVM-RFE) (Guyon et al., 2002) selects features in a sequential backward elimination manner, which starts with all the features and discards one feature at a time.
3. Others
Several statistical analyses can be performed. A list of other analyses includes, but is not limited to, Linear discriminant analysis (McLachlan 2004), Logistic regression (Agresti 2002), Classification and Regression Trees (CART) (Breiman 1984), Neural Networks (Marques de Sa 2001), Bayesian Additive Regression Trees (Chipman 2006).
Any analyte, including the various compounds and compositions disclosed herein, can be detected. For example, status biomarkers, repetitive DNA sequence, repetitive DNA sequence loci, families of status biomarkers, families of repetitive DNA sequences, etc. can be detected. Detection of status biomarkers can be by, for example, detecting the level, amount, presence, or a combination, of the analyte in a sample or assay. As described below and elsewhere herein, the manner of detection of status biomarkers can be based on the treatment of DNA samples and generally can be in service of detecting and determining the methylation state and presence of methylation in status biomarkers. Detection of the disclosed compounds and compositions can be accomplished in any of a variety of ways and using any of a variety of techniques. Many such detection techniques are known and can be readily adapted for use in the disclosed methods. In most cases, the disclosed methods do not depend on particular techniques of detection. However, certain techniques and reagents are useful for detecting different types of compounds and compositions. Those of skill in the art are aware of the selection of particular techniques for the detection of particular compounds and compositions. Detection can, but need not, involve an element of quantitation.
Detection can be of a class of compounds or compositions or of specific compounds or compositions. Although the disclosed methods generally involve detection of specific compounds and compositions, such as specific DNA molecules, the disclosed methods can also be used to detect classes or groups of compounds or compositions, generally via one or more common properties. In other forms, multiple different specific compounds and/or compositions can be detected. Such detection accomplished in the same assay or run (or in separate assays of runs performed at the same time), can generally be referred to as multiplex detection.
Detection can involve or include, for example, measuring, sequencing, identification, or a combination. Measurement is useful for determining abundances and levels of an analyte in a sample. Sequencing is useful for identifying nucleic acid sequence and molecules. Uses and forms of detection in the context of the disclosed methods are also described elsewhere herein.
Detection can involve a variety of forms. For example, detecting one or more of the status biomarkers can be accomplished using a probe corresponding to a unique sequence in the status biomarker.
1. Measuring
Any analyte, including the various compounds and compositions disclosed herein, can be detected by measuring, for example, the level, amount, presence, or a combination, of the analyte in a sample or assay. For example, the methylation state and/or level of status biomarkers, repetitive DNA sequence, repetitive DNA sequence loci, families of status biomarkers, families of repetitive DNA sequences, etc. can be measured. Measurement of the level, amount, presence, or a combination, of the analyte can also be accomplished when detection is not an explicit object. Similar to detection, measurement of the disclosed compounds and compositions can be accomplished in any of a variety of ways and using any of a variety of techniques. Many such measurement techniques are known and can be readily adapted for use in the disclosed methods. In most cases, the disclosed methods do not depend on particular techniques of measurement. Measurement can involve an element of quantitation. Many techniques are known for measuring abundances and levels of an analyte in a sample. Such techniques can be adapted for use with the disclosed methods.
2. Sequencing
Nucleic acid sequences and molecules can be detected, measured, identified, and so on, via sequencing. In the context of nucleic acid sequences and molecules, sequencing refers to the determination or identification of some or all of the nucleotide base sequence of a nucleic acid sequence or molecule. Numerous techniques for nucleic acid sequencing are known and can be used with the disclosed methods. Examples of useful types of sequencing techniques include techniques involving detection of individual nucleotide bases (such as by detection of terminated primer extension products) and detection of multiple nucleotide bases (such as by hybridization of probes of known sequence). Any suitable sequencing technique can be used with the disclosed methods. Sequencing is particularly useful for identifying nucleic acid sequences and molecules.
Particularly useful sequencing techniques are those that can generate large amounts of sequence data quickly and accurately. High-throughput and ultra-high throughput sequencing provides a number of advantages, the main two being faster results and the ability to detect and measure a large number of nucleic acid molecules. Examples of useful high-throughput sequencing techniques include Solexa™ sequencing, SOLiD™ sequencing, and sequencing using a Illumina Genome Analyzer™ or a 454™.
Illumina Sequencing technology is based on massively parallel sequencing of millions of fragments using reversible terminator-based sequencing chemistry. Illumina
Sequencing technology relies on the attachment of randomly fragmented genomic DNA to a planar, optically transparent surface. Attached DNA fragments are extended and bridge amplified to create an ultra-high density sequencing flow cell with hundreds of millions of clusters, each containing ˜1,000 copies of the same template. These templates are sequenced using a four-color DNA sequencing-by-synthesis technology that employs reversible terminators with removable fluorescent dyes. This allows high accuracy and true base-by-base sequencing, eliminating sequence-context specific errors and enabling sequencing through homopolymers and repetitive sequences. High-sensitivity fluorescence detection is achieved using laser excitation and total internal reflection optics. Sequence reads are aligned against a reference genome and genetic differences are called using specially developed data analysis pipeline software.
The SOLiD System involves depositing beads containing template DNA fragments to be sequenced onto a glass slide. Primers hybridize to a sequence within the template. A set of four fluorescently labeled di-base probes compete for ligation to the sequencing primer. Specificity of the di-base probe is achieved by interrogating every 1st and 2nd base in each ligation reaction. Multiple cycles of ligation, detection and cleavage are performed with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and the template is reset with a primer complementary to the n−1 position for a second round of ligation cycles. Five rounds of primer reset are completed for each sequence tag. Through the primer reset process, each base is interrogated in two independent ligation reactions by two different primers. For example, the base at read position 5 is assayed by primer number 2 in ligation cycle 2 and by primer number 3 in ligation cycle 1. This dual interrogation is fundamental to the unmatched accuracy characterized by the SOLiD System.
The SOLiD System relies on open slide format and flexible bead densities to enable increases in throughput with protocol and chemistry optimizations. The SOLiD System provides system accuracy greater than 99.94%, due to 2 base encoding. 2 Base encoding enables unique error checking capability, providing higher confidence in each call. The SOLiD™ System can generate over 20 gigabases and 400M tags per run. The independent flow cell configuration of the SOLID Analyzer two completely independent experiments in a single run. The combination of multiple slide configuration and sample multiplexing capability enables you to analyze multiple samples cost effectively for a variety of applications. The SOLiD System supports sample preparation for mate-paired libraries with insert sizes ranging from 600 bp up to 10 kbp. This broad range of insert sizes combined with ultra high throughput and flexible 2 flow cell configuration enables more precise characterization of structural variation across the genome.
3. Identification
In the context of the disclosed methods, identification refers to determination of the particular type or instance of a thing, such as of the disclosed status biomarkers, repetitive DNA sequence, repetitive DNA sequence loci, families of status biomarkers, families of repetitive DNA sequences, etc. Thus, for example, a status biomarker can be identified by determining part of its sequence, where the sequence is characteristic of that status biomarker. In the disclosed method, a number of components are, or can be designed, to correspond to, be complementary to, or be for particular other components. By such correspondence, identification of one component can often allow identification of any other components that correspond. For example, a probe can be designed with a target complement sequence that is complementary to a particular sequence of a status biomarker of interest. The probe can be said to correspond to, or to be for, the status biomarker of interest. When used in the disclosed methods, detection or identification of the probe can result in the detection of the presence, or identification, of the corresponding status biomarker in the sample.
The term “hit” refers to a test compound that shows desired properties in an assay. The term “test compound” refers to a chemical to be tested by one or more screening method(s) as a putative modulator. A test compound can be any chemical, such as an inorganic chemical, an organic chemical, a protein, a peptide, a carbohydrate, a lipid, or a combination thereof. Usually, various predetermined concentrations of test compounds are used for screening, such as 0.01 micromolar, 1 micromolar and 10 micromolar. Test compound controls can include the measurement of a signal in the absence of the test compound or comparison to a compound known to modulate the target.
The terms “higher,” “increases,” “elevates,” or “elevation” refer to increases above basal levels, e.g., as compared to a control. The terms “low,” “lower,” “reduces,” or “reduction” refer to decreases below basal levels, e.g., as compared to a control.
The term “modulate” as used herein refers to the ability of a compound to change an activity in some measurable way as compared to an appropriate control. As a result of the presence of compounds in the assays, activities can increase or decrease as compared to controls in the absence of these compounds. Preferably, an increase in activity is at least 25%, more preferably at least 50%, most preferably at least 100% compared to the level of activity in the absence of the compound. Similarly, a decrease in activity is preferably at least 25%, more preferably at least 50%, most preferably at least 100% compared to the level of activity in the absence of the compound. A compound that increases a known activity is an “agonist”. One that decreases, or prevents, a known activity is an “antagonist.”
The term “inhibit” means to reduce or decrease in activity or expression. This can be a complete inhibition or activity or expression, or a partial inhibition. Inhibition can be compared to a control or to a standard level. Inhibition can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100%.
The term “monitoring” as used herein refers to any method in the art by which an activity can be measured.
The term “providing” as used herein refers to any means of adding a compound or molecule to something known in the art. Examples of providing can include the use of pipettes, pipettemen, syringes, needles, tubing, guns, etc. This can be manual or automated. It can include transfection by any mean or any other means of providing nucleic acids to dishes, cells, tissue, cell-free systems and can be in vitro or in vivo.
The term “preventing” as used herein refers to administering a compound prior to the onset of clinical symptoms of a disease or conditions so as to prevent a physical manifestation of aberrations associated with the disease or condition.
The term “in need of treatment” as used herein refers to a judgment made by a caregiver (e.g. physician, nurse, nurse practitioner, or individual in the case of humans; veterinarian in the case of animals, including non-human mammals) that a subject requires or will benefit from treatment. This judgment is made based on a variety of factors that are in the realm of a care giver's expertise, but that include the knowledge that the subject is ill, or will be ill, as the result of a condition that is treatable by the disclosed compounds.
As used herein, “subject” includes, but is not limited to, animals, plants, bacteria, viruses, parasites and any other organism or entity. The subject can be a vertebrate, more specifically a mammal (e.g., a human, horse, pig, rabbit, dog, sheep, goat, non-human primate, cow, cat, guinea pig or rodent), a fish, a bird or a reptile or an amphibian. The subject can be an invertebrate, more specifically an arthropod (e.g., insects and crustaceans). The term does not denote a particular age or sex. Thus, adult and newborn subjects, as well as fetuses, whether male or female, are intended to be covered. A patient refers to a subject afflicted with a disease or disorder. The term “patient” includes human and veterinary subjects.
By “treatment” and “treating” is meant the medical management of a subject with the intent to cure, ameliorate, stabilize, or prevent a disease, pathological condition, or disorder. This term includes active treatment, that is, treatment directed specifically toward the improvement of a disease, pathological condition, or disorder, and also includes causal treatment, that is, treatment directed toward removal of the cause of the associated disease, pathological condition, or disorder. In addition, this term includes palliative treatment, that is, treatment designed for the relief of symptoms rather than the curing of the disease, pathological condition, or disorder; preventative treatment, that is, treatment directed to minimizing or partially or completely inhibiting the development of the associated disease, pathological condition, or disorder; and supportive treatment, that is, treatment employed to supplement another specific therapy directed toward the improvement of the associated disease, pathological condition, or disorder. It is understood that treatment, while intended to cure, ameliorate, stabilize, or prevent a disease, pathological condition, or disorder, need not actually result in the cure, ameliorization, stabilization or prevention. The effects of treatment can be measured or assessed as described herein and as known in the art as is suitable for the disease, pathological condition, or disorder involved. Such measurements and assessments can be made in qualitative and/or quantitative terms. Thus, for example, characteristics or features of a disease, pathological condition, or disorder and/or symptoms of a disease, pathological condition, or disorder can be reduced to any effect or to any amount.
A cell can be in vitro. Alternatively, a cell can be in vivo and can be found in a subject. A “cell” can be a cell from any organism including, but not limited to, a bacterium.
By the term “effective amount” of a compound as provided herein is meant a nontoxic but sufficient amount of the compound to provide the desired result. As will be pointed out below, the exact amount required will vary from subject to subject, depending on the species, age, and general condition of the subject, the severity of the disease that is being treated, the particular compound used, its mode of administration, and the like. Thus, it is not possible to specify an exact “effective amount.” However, an appropriate effective amount can be determined by one of ordinary skill in the art using only routine experimentation.
By “pharmaceutically acceptable” is meant a material that is not biologically or otherwise undesirable, i.e., the material can be administered to an individual along with the selected compound without causing any undesirable biological effects or interacting in a deleterious manner with any of the other components of the pharmaceutical composition in which it is contained.
It is understood that the disclosed method and compositions are not limited to the particular methodology, protocols, and reagents described as these may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural reference unless the context clearly dictates otherwise. Thus, for example, reference to “a status biomarker” includes a plurality of such status biomarkers, reference to “the status biomarker” is a reference to one or more status biomarkers and equivalents thereof known to those skilled in the art, and so forth.
“Optional” or “optionally” means that the subsequently described event, circumstance, or material may or may not occur or be present, and that the description includes instances where the event, circumstance, or material occurs or is present and instances where it does not occur or is not present.
Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, also specifically contemplated and considered disclosed is the range from the one particular value and/or to the other particular value unless the context specifically indicates otherwise. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another, specifically contemplated embodiment that should be considered disclosed unless the context specifically indicates otherwise. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint unless the context specifically indicates otherwise. Finally, it should be understood that all of the individual values and sub-ranges of values contained within an explicitly disclosed range are also specifically contemplated and should be considered disclosed unless the context specifically indicates otherwise. The foregoing applies regardless of whether in particular cases some or all of these embodiments are explicitly disclosed.
Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of skill in the art to which the disclosed method and compositions belong. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present method and compositions, the particularly useful methods, devices, and materials are as described. Publications cited herein and the material for which they are cited are hereby specifically incorporated by reference. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such disclosure by virtue of prior invention. No admission is made that any reference constitutes prior art. The discussion of references states what their authors assert, and applicants reserve the right to challenge the accuracy and pertinency of the cited documents. It will be clearly understood that, although a number of publications are referred to herein, such reference does not constitute an admission that any of these documents forms part of the common general knowledge in the art.
Throughout the description and claims of this specification, the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the method and compositions described herein. Such equivalents are intended to be encompassed by the following claims.
1. Introduction
There is a need for sensitive and accurate assays capable of detecting the presence of abnormal cells in tissues. Such abnormal cells may represent in aging tissues, dysplasia, carcinoma in situ, cancer, or metastatic cancer. There is also a need for accurate assays capable of reporting the response of patients to therapies whose purpose is to kill tumor cells, or to reduce the number of abnormal or dysplastic cells in an organ compartment.
The detection of circulating DNA derived from dead or damaged cells is an attractive strategy for implementing assays that are useful for the aforementioned purposes. DNA is a very stable molecule, and can persist for a long time in the circulation. Thus, when tumor cells or other abnormal cells die, the DNA may be detected in the circulation. There is a large literature reporting the detection in the circulating of DNA derived from tumors, or from abnormal cells. Recently Sunami et al (2008) reported the quantification of LINE-1 in circulating DNA as a molecular biomarker of breast cancer. An earlier report by Rago et al. (2007) had reported the assessment of human tumor burdens in mouse xenografts by the analysis of circulating human-specific LINE-1 DNA. In the mouse study, a spike in the amount of human LINE-1 DNA present in plasma was shown to increase after tumor cytotoxic therapy, demonstrating the utility of this biomarker for monitoring drug responses. In both of these studies the detection of circulating DNA is facilitated by the fact that each dying cell releases thousands of molecules of LINE-1 DNA, and therefore the limit of detection of these assays corresponds to a relatively small number of cells of interest.
Recently Korshunova et al (2008) reported the results of a comprehensive methylation pattern analysis from breast cancer clinical tissues and sera obtained using massively parallel bisulphite pyrosequencing of four different gene loci in the human genome. The detailed sequencing analysis of more than 700,000 DNA fragments derived from more than 50 individuals (cancer and cancer-free) revealed an unappreciated complexity of genomic cytosine-methylation patterns in both tissue derived and circulating DNAs. Key observations of this study were as follows: First, there were no tumor-specific molecular methylation sequence patterns obtained from any of the four tested loci. Tumors and cancer-free tissues as well as sera of cancer-free individuals contained nearly every conceivable cytosine methylation pattern. A great variety of methylated molecules were present in all samples, yet no special type of methylation pattern could be found in a statistically meaningful way exclusively in cancerous or exclusively in normal tissues (or serum). At all four tested loci, while there were no tumor-specific molecules, there were different tumor-specific loads of abnormally methylated DNA molecules. The second important finding of the work was that the levels of methylation vary greatly among tumor samples, but yet, little variation in methylation levels was found in samples considered histologically normal. The third important observation of this study involved the quantification of the background level of circulating, abnormally methylated molecules in cancer-free patient sera. According to cancer-specific mutation-based estimations, tumor DNA in serum at early stages of disease is present at a relative abundance of about 12 haploid genomes for every 10,000 somatically normal haploid genomes (0.12%) or less. Expected methylation signals from such minute amounts are so close to the level of background (in most cases around several tens of the percent) that robust detection of tumor-shed DNA was problematic, especially in the case of an epigenetically complex background.
Over time, many laboratories have been identifying, and will continue to identify biomarkers that are indicative of aging, dysplasia, or cancer. A common property of the majority of these cancer markers is that they vary depending on the tissue of origin. For example, Ince et al. (2007) have published findings that indicate that transformation of different breast human breast epithelial cell types leads to distinct tumor phenotypes. This is the case because tumor phenotype tends to resemble progenitor tissue due to natural lineage differentiation relationships. The present disclosure provides methods of identifying biomarkers that can be of general utility in detecting all types of tumors, such that they will serve for detection of any tumor type, as well as for the detection of dysplasia in all types of tissues.
Changes in DNA methylation patterns of non-coding genomic compartments of cancer cells have been explored. The analysis of repetitive DNA in cancers of the head and neck were focused on. Kurshunova et al. (2008) did not examine the methylation abnormalities of repetitive DNA. Most reported studies on DNA methylation of repetitive elements have measured the methylation level of repetitive elements by obtaining average metrics, representative of mixtures of thousands of different repetitive elements. A novel method that reports the DNA methylation level of each individual locus harboring a repetitive DNA element was utilized. The method used for this analysis provided a convenient tool to survey the methylation status of individual repetitive elements.
2. Results
i. Methylation Patterns of Major Classes and Families of DNA Repeats
The DNA methylation profiles of 33 tumors and 17 non-tumor adjacent tissue samples obtained from patients with head and neck squamous carcinoma (HNSCC) were analyzed. DNA methylation profiles from the buccal epithelia of 10 normal individuals were also generated, which served as controls. A novel microarray method for analysis of DNA methylation, based on the use of methylation sensitive as well as methylation dependent endonucleases, enables the interrogation of methylation levels in all compartments of the genome, including repetitive elements. Analysis of a substantial set of samples of squamous carcinomas of the head and neck, as well as non-tumor adjacent tissue and normal controls, reveals a complex framework of epigenetic dysregulation, where loss of methylation differentially affect distinct families of repetitive elements. Predominantly the younger, primate-specific members of retroelement families suffer the most dramatic loss of methylation, with the exception of some extremely young, human-specific retroelements. These complex patterns of differential susceptibility to disruption of silencing are probably a result of the natural history of evolutionary domestication of retroelements in genomes, in interplay with a minimal time requirement for strong silencing to be established. Primate-specific subfamilies of LINE-1 elements appear to suffer a particularly pronounced loss of methylation in tumors, with the most dramatic changes apparently observed for those primate retroelements with conserved promoter regions and longer sequences.
ii. Repetitive Elements as Cancer Biomarkers.
DNA methylation status of repetitive elements has been used as biomarkers for cancer risk. The majority of these studies have focused on the DNA methylation status of Line-1 elements, while a few have utilized Alu elements instead. A sampling of seven exemplary publications on this subject was examined, and five pairs of different DNA primer sequences were identified that have been utilized to amplify Line-1 DNA sequences, typically after treatment of the DNA with sodium bisulfite. Using the computer program FASTA, the positions in the human genome where these five different sets of primers are perfectly aligned were identified, and predicted the exact composition of the amplified repetitive elements, from the standpoint of repeat masker annotation. The sequences that are predicted to be amplified by the polymerase chain reaction in every case represent a complex mixture of Line-1 elements corresponding to different families of different evolutionary age. The lineages that are most highly represented, shown in the table, are L1 HS (human specific) and L1PA2, a relatively recent lineage that originated in simians approximately 7.6 millions years ago (see Table 2).
In some cases, L1PA3 elements are also highly represented. In conclusion, primers used in the published for the amplification of Line-1 biomarkers are not designed optimally, and do not sample specifically any chosen L1 subfamily, but rather a mixture of subfamilies. A consequence of the sub-optimal design of all of the primer-pairs reported in the literature is that the Line-1 sequences being sampled to generate DNA methylation metrics are not those genomic sequences that contain the most useful information related to the onset of dysplasia and cancer.
iii. Generation of Subsets of Genomic Repetitive Element Sequences that are Useful for Distinguishing Tumors from Adjacent Nontumor Tissue, and from Healthy Normal Tissue
A list of DNA methylation values calculated as the average methylation of each category or sub-category of repetitive element was generated, including reprotransposon-derived elements, and DNA-transposon-derive elements. The values are obtained for individual experiments, and each average is generated my multiple probes of the same category, where each category will comprise anywhere from 20 to 48,000 probes. The data for individual members of each individual family was then analyzed. As an example of this analysis, the plot shown in
The arrows point to DNA methylation values calculated by taking the fractional values obtained from Table 1, and calculating a weighed average that takes into account the fractional composition, as well as the DNA methylation value of each class represented in the mixture. The “in-silico PCR” values represent the simulated prediction of the DNA methylation metrics that would be obtained if one were to perform a PCR experiment based on the use of published primer sequences, and utilizing as biological material the DNA obtained from the samples of cancer of the head and neck. It is notable that none of the DNA methylation values indicated by the arrow represents metrics with optimal information content.
Among the publications cited as examples of papers teaching the use of Line-1 primers as cancer biomarkers, the most recent (Choi et al, 2009, using Line-1 primers designed by Woloszynska-Read et al., 2008) point out the observation that “5-mdC level in leukocyte DNA was significantly lower in breast cancer cases than healthy controls (p=0.001), but no significant case-control differences were observed with LINE-1 methylation”. It is not surprising that Choi et al. did not observe significant case-control differences, in the light of the data presented herein (see
iv. Statistical Analysis of Optimal Repetitive DNA Biomarkers Selected from a Large Set of Repetitive Elements that Suffer DNA Methylation Changes in Dysplasia and Cancer.
A list of repetitive DNA subfamilies that comprises approximately 900 members was generated. A list of DNA methylation values calculated as the average methylation of each category or sub-category of repetitive element was generated, including reprotransposon-derived elements, and DNA-transposon-derive elements. The values are obtained for individual experiments, and each average is generated my multiple probes of the same category. Two independent algorithms were used to rank the variables based on their abilities to classify experiments. Wilcoxon was used to classify tumor and non-tumor adjacent. Random Forest was used to classify Normal, Non-Tumor Adjacent and Tumor experiments. Both algorithms relied on the same definition of variables. The variables included single probes, or collections of probes sharing a common feature i.e. proximity to the repetitive element. Both algorithms ranked the variables based on repetitive-elements and non-genic, non-repetitive probes very high. Moreover, the repetitive element categories appear to be better classifiers than the gene probes as evidenced by the enrichment of repetitive element categories in the top ranked categories. Specifically, in top 30 categories there were 7 gene probes (out of ˜44,000), and 14 repetitive element categories (out of a total of 896) (
The Wilcoxon test results, where the biomarker is ranked based on Wilcoxon test p-value for the top 200 variables out of 138,783 (repetitive elements, genes, non-genic, non-repetitive) are shown in Table 3. The Wilcoxon test results for the top 200 out of 90,007 non-repetitive non-gene probes are shown in Table 4. The Wilcoxon test results for all repetitive categories and literature-based categories (898) are shown in Table 5.
indicates data missing or illegible when filed
indicates data missing or illegible when filed
Next, statistical analysis was performed using random forest binary decision trees. Table 6 shows the importance of top 45 from 139,379 variables generated using Random Forest algorithm. The categories include gene probes (gene), non-genic and non-repetitive probes (nonrep), repetitive element. The random forest classifier based on the repetitive element categories alone worked well (89% accuracy). Both algorithms agree on several categories of repetitive elements being the most informative, i.e. both algorithms report them in the top 20, for example: MER67D, HUERS-P3B, MER6, MER66C, ERVL, MLT1G1, MLT2D, MER50B, THE1B (Table 5 and Table 7). In both analyses, the categories based on the primer design discussed in recent literature ranked much lower i.e. ˜200 (Table 5, Wilcoxon) or ˜350 (Table 7, Random Forest) than the categories defined based on repetitive elements.
307
X. Rago
494
−0.26
0.06
1.5
1.0
0.8
0.94
0.01
318
K. Sunami
749
−0.29
0.06
1.9
0.6
0.6
0.92
0.00
333
X. Woloszynska
486
−0.26
0.06
1.4
1.6
0.4
0.88
0.01
371
K. Yang
810
−0.29
0.06
1.1
1.1
0.8
0.80
0.00
387
K. Chalitchagorn
729
−0.29
0.06
1.8
0.8
0.6
0.77
0.00
indicates data missing or illegible when filed
1. Introduction
Close to 50% of the human genome harbors repetitive sequences originally derived from mobile DNA elements, and in normal cells this sequence compartment is tightly regulated by epigenetic silencing mechanisms involving chromatin-mediated repression. In cancer cells, repetitive DNA elements suffer abnormal demethylation, with potential loss of silencing. A genome-wide microarray approach was used to measure DNA methylation changes in cancers of the head and neck, and to compare these changes to alterations found in adjacent non-tumor tissues. Specific alterations were observed at thousands of small clusters of CpG dinucleotides associated with DNA repeats. Among the 257,599 repetitive elements probed, 5 to 8% showed disease-related DNA methylation alterations. In dysplasia, a large number of local events of loss of methylation appear in apparently stochastic fashion. Loss of DNA methylation is most pronounced for certain members of the SVA, HERV, LINE-1P, AluY, and MaLR families. The methylation levels of retrotransposons are discretely stratified, with younger elements being highly methylated in healthy tissues, while in tumors these young elements suffer the most dramatic loss of methylation. Wilcoxon test statistics reveal that a subset of primate LINE-1 elements is demethylated preferentially in tumors, as compared to non-tumoral adjacent tissue. Sequence analysis of these strongly demethylated elements reveals genomic loci harboring full-length, as opposed to truncated elements, while possible enrichment for functional LINE-1 ORFs is weaker. This analysis indicates that in non-tumor adjacent tissues there is generalized and highly variable disruption of epigenetic control across the repetitive DNA compartment, while in tumor cells a specific subset of LINE-1 retrotransposons that arose during primate evolution suffers the most dramatic DNA methylation alterations.
Herein is a systematic study of DNA methylation changes occurring in the repetitive DNA compartment of squamous carcinomas of the head and neck. In contrast to previous studies, a novel microarray-based approach to obtain discrete DNA methylation data at hundreds of thousands of individual repetitive DNA loci in the human genome was used. Extensive annotation resources for different subfamilies of repeats was then used to evaluate possible relationships between loss of epigenetic silencing in the context of natural history of cancer, and the evolutionary history of repetitive element sub-compartments in the human genome.
2. Materials and Methods
A specific microarray analysis method permits genome-wide assessment of DNA methylation status using restriction endonucleases (described below). Among the 339,314 probes in the microarray, 257,599 are dedicated to the measurement of the methylation levels of individual members of interspersed DNA repeat families. The probes, and the loci to which they hybridize, can be grouped into families or categories of probes and loci based on, for example, repetitive DNA sequence families to which the loci belong. Such groups can be used as collective status biomarkers.
i. Principle of the DNA Methylation Analysis Method
Multiple displacement amplification (MDA, Dean et al., 2002; Lage et al., 2003; Lage et al., 2005; U.S. Patent Application Publication No. 20040063144) is an isothermal amplification method based on random priming and DNA hyper-branching, catalyzed by a strand-displacing DNA polymerase. The yield of the MBA reaction is strongly influenced by the size of the DNA used as template (Lage et al., 2005). The dependence of amplification yield using DNA templates of different size have been systematically studied, and computational model of the reaction that fits the experimental data was built. The results of this analysis indicate that the yield of DNA derived from any sequence segment depends on template size, and additionally on the distance of the sequence segment from the nearest DNA terminus on the template molecule. Other amplification techniques that have similar effect can also be used. A specific cleavage event in a genomic DNA molecule could be detected by measuring DNA amplification yield using a DNA microarray, and a probe in the microarray would be able to measure a local reduction in sequence representation due to cleavage, even if that cleavage event occurred as far as 1200 bases upstream or downstream from the location of the probe. This property enables the use of probe designs that measure cleavage events not only in unique DNA sequences overlapping a probe, but also cleavage events within repetitive DNA sequences that contain CpG dinucleotides, located in the vicinity of a probe of unique sequence, within a window of approximately 2400 bases surrounding the probe. Experimental data is provided that helps to define the approximate size of the window that enables probing-at-a-distance.
ii. Microarray Probe Design
DNA probes of unique sequence (uniqueness assessed using merEngine, Healy et al., 2003) were designed to map as closely as possible to every CpG island in the human genome. The DNA sequences located within a window of plus or minus 4 kb from loci coding for microRNAs were examined, and many of these regions contained small clusters of CpG residues. A relatively lax “CpG islet” specification was then created, requiring that a region in the genome contain a minimum of 7 CpG residues, that the ratio of the CG count to the GC content be larger than 0.53, and that the region be no shorter than 200 bases to be nominated as a CpG islet (this is only an example of a specification of CpG islets; other specifications are disclosed elsewhere herein). Using this specification, 453 out of the 532 microRNA loci in the Sanger database (Griffiths-Jones, 2006) are associated with at least one CpG islet within a window of +/−4 kb. By contrast, based on the more restrictive Takai and Jones definition (2002), the equivalent count of CpG islands in the vicinity of microRNA loci is 141. The total count of CpG islets in the human genome using this relaxed specification is approximately 500,000. A custom microarray containing probes for all CpG islands and CpG islets was designed, in order not to miss DNA methylation changes that may occur in tumors in CpG-rich regions that would not fit the standard CpG island definition. Five broad classes of CpG islands and CpG islets were probed: promoter associated, unique, non-promoter associated, interspersed repeat associated (Jurka, 1998; Smit, 1996-2004), tandem repeat associated (Benson, 1999), and microRNA locus associated (Griffiths-Jones, 2006). A subset of the probes were replicated on the array surface, bringing the total number of probes in the microarray to 377,000. The coordinates of the probes relevant to the Top 138 repetitive DNA sequence families are shown in Table 15.
Experimental Work Flow for Microarray Analysis
Relative methylation was measured by splitting the DNA sample in two equal aliquots, and digesting each aliquot with either methylation-sensitive or methylation-dependent restriction endonucleases, respectively, as shown diagrammatically in
Analysis with ASCIIMap can show the locations of the probes and the restriction endonuclease cutting sites in the CpG islands associated with these elements. Probes for unique sequences can overlap with repetitive DNA sequences. The probe design algorithm always ensures that the sequence is unique before designing a probe. The apparent paradox that a repetitive element may have parts that are unique sequences can be explained by considering the age of the repetitive elements for which the probe is designed. For example, an element of the family MLT1C, 85 MYO: over a span of millions of years since it appeared in its original form in the genome, its sequence have deteriorated from its consensus so much that although the element can still be classified as MLT1C now (based on the overall structure and certain sequence patterns), its sequences acquired enough random mutations that the probe algorithm can recognize certain parts within this MLT1C as unique in the genome. For repetitive element families that are younger, i.e. the elements that haven't had evolutionary time to acquire mutations differentiating them from their respective consensus, the probe designer most likely designs the probe within the 100 bases flanking region of the repetitive element. Conversely, for the older repetitive elements (20, 30, 40+ MYO), the probe designer is able to find regions that have uniquely diverged from the global consensus of the repeat family.
The experimental data obtained from 74 different probe loci in the microarray was independently validated by bisulfite sequencing using either Sanger sequencing of individual clones of PCR products, or using the Sequenom EpiTyper platform, which is based on sequencing of transcribed RNA by mass spectrometry. Sanger-based analysis was performed for a total of 59 different microarray probes. The correlation between the microarray read-out and the results of Sanger sequencing was analyzed based on the count of CpGs methylated or demethylated in all the clones of the sequencing result of a locus, the sequences were classified as un-methylated, composite or methylated. For 48 of the 59 probes, there was agreement between the microarray methylation result and the bisulfite Sanger sequencing result, for a concordance of 81.4%. The bisulfite sequencing validation analysis produced methylation results from the microarray analysis, the map position of the probes, and the bisulfite sequencing result for one gene promoter region, one AluSq element and one AluY element. In these three validation experiments there is agreement between the microarray result and the bisulfite sequencing data. It should be noted, however, that in the case of the probe that samples the AluSq element there are neighboring sequences belonging to an MLT1C LTR element, whose methylation status will influence the measurements. An effort was made, through extensive probe annotation data, to keep track of these complex cases. The results presented herein were generated by calculating the average methylation of hundreds, or even thousands of repetitive elements belonging to specific families of repeats. This averaging process will minimize the influence of the surrounding sequence context.
iv. Specimen Sample Acquisition and DNA Preparation
Tumor samples and adjacent non-tumor tissue were obtained through the Tissue Procurement Program of the Surgical Pathology Laboratory at Yale New Haven Hospital. All patients provided informed consent (IRB/HIC #14414). Representative histological sections of all specimens were reviewed to confirm the nature of the sample. After informed consent, oral epithelial cells from subjects with no known risk for oral cancer were obtained by scraping. DNA from all tissues was obtained using MasterPure DNA Purification Kit (EPICENTRE). The protocol follows: for every reaction a mix of 150 μL of Tissue and Cell Lysis solution and 1.5 μL of proteinase K from the kit was created. Lysate from about 8 mm3 of specimen was collected. The lysate was vortexed every 5 min until the tissue was completely dissolved. The incubation at 65 degrees followed for 30-60 min. Subsequently 0.5 μL of RNase was added to each tube and incubated for 30 min at 37 degrees. 75 μL of MPC protein precipitation agent was added to the lysed sample. After centrifugation for 10 min at 15,000 rpm the supernatant was transferred to a labeled 1.5 mL tube. With 250 μL of isopropanol added to the supernatant the tube was inverted multiple times. The DNA was then transferred using Pasteur pipet and resuspended in 100 μL of TE (0.1 mM EDTA). The DNA was then stored for 2 days at 4 degrees. Subsequent quantitation was done using PicoGreen fluorescence.
200 ng genomic DNA extracted from the head and neck tumor or the corresponding non-tumor adjacent tissues were digested by two sets of restriction enzymes respectively. One genomic sample was digested by McrBC (New England Biolabs), the other was digested by AciI and HhaI (New England Biolabs). 20 units of each enzyme were used to set up 45 μl reaction in the recommended buffer (McrBC: 50 mM NaCl, 10 mM Tris-HCl, 10 mM MgCl2, 1 mM DTT supplemented with 100 μg/mL BSA, and 5 mM GTP; AciI and Hha: 50 mM Tris-HCl, 100 mM NaCl, 10 mM MgCl2, 1 mM DTT supplemented with 100 μg/mL BSA). Reactions were incubated at 37 C for 6 hours and then boosted with an additional 10 units of the corresponding enzyme for another 12 hrs, and finally inactivated at 65 C for 20 minutes. One aliquot of each digested genomic DNA (20 ng) was subjected to whole genome amplification respectively using REPLI-G kit (Qiagen) with 8 hours incubation at 30 C. The amplified DNA sample was then purified by QIAEX II kit (Qiagen) with slightly modified protocol (3 instead of 2 washes with PE buffer and finally eluted in water rather than EB buffer). 4 μg of the purified genomic DNA sample was submitted to Nimblegen for labeling and hybridization.
v. Microarray Response
A control experiment defined the longest distance from a probe at which endonuclease cuts can be measured using the microarray method. DNA was cleaved with PmeI, and processed by isothermal whole genome amplification followed by microarray analysis using uncleaved, amplified DNA as a reference channel. All probes containing a single predicted PmeI cut, and not bounded by another cut within a distance of +−40 kb were plotted at x=0 in an xy plot. Other probes proximal to the cut site (upstream as well as downstream) were plotted according to their position in the x axis. The ratio of the two microarray channels (cleaved and uncleaved DNA) was plotted in the y axis. The deflection of the y axis in the xy plot indicates that a single endonuclease cut produces large changes in the ratio (y) within a window of +/−3.0 kb, with the most pronounced deflection of the ratio occurring within a window of +/−1.2 kb.
vi. Validation of Microarray-Based Observations Using Bisulfite DNA Sequencing
a total of 74 probe loci that showed DNA methylation changes in tumors were selected, and the DNA methylation status was examined using bisulfite DNA sequencing across a total of 12 experiments, for a total of 207 probe validation data points. DNA sequencing was performed using two different experimental approaches. In the first approach, bisulfite-treated DNA was used to amplify by PCR the genomic regions of interest, and the PCR amplicons were cloned. Individual clones were processed for Sanger sequencing in both strand orientations. In the second approach, bisulfite-treated DNA was used to amplify by PCR the genomic regions of interest, and the PCR amplicons were then transcribed to generate complementary DNA using reagents provided by Sequenom, Inc. as part of their EpiTYPER kit. The RNA was then cleaved with ribonuclease A, and subjected to mass spectrometry analysis. Using software provided by Sequenom, the mass spectrograms were processed to generate a fractional value of DNA methylation between 0.0 and 1.0. When multiple probes associated with a single CpG island were averaged, the concordance of the microarray calls and the bisulfite sequencing results was 87.6%.
vii. Plotting the Data
All members of each phylogenetic branch of the repetitive element subfamilies were grouped together based on their sequence homology and estimated evolutionary age. The average of the individual log 2 methylation values were calculated for all microarray probes belonging to each branch, and plotted these subset-specific values across all experiments. A fourth class of experiments included three technical replicates of a microarray analysis performed using DNA from human sperm.
a. Per-Experiment Plots
Given that each probe in the microarray is annotated with its association to the proximal genomic elements (repetitive element category, gene, miRNA) for every experiment in the library, a query is issued to retrieve a subset of probes in the vicinity of a specific element. The set of probes (from which the subset of probes are retrieved) are complementary to unique sequences in loci containing CpG islands or CpG islets the selection of which is described elsewhere herein. A set of values from the probes in the retrieved subset of probes is then averaged per experiment and plotted accordingly (note, instead of average, any other function can be applied here). This is repeated for every experiment and for every category requested.
b. Per-Category Plots
An alternative to the plots described above are the per-category plots, devised to simplify the presentation of information especially when many categories of repetitive elements are to be plotted. For these plots, once an average of a given category of probes is calculated for all experiments, a box-and-whisker plot is then generated to summarize these values for experiment subsets: normal (top), non-tumor adjacent, tumor and sperm (bottom).
A standard boxplot implementation included in R programming language was embedded in a custom script to generate these plots.
c. On Order of Experiments
The experiments are always grouped (top to bottom) into normal, non-tumor adjacent, tumor and sperm-replicate classes. The order of experiments across all plots (unless stated otherwise) is kept constant. The order has been established based on the difference of most informative category of L1P (Table 9) versus the most stable across all experiments categories of repetitive elements: AluSq and DNA transposons.
d. Shannon Information Value
The order of categories in the legend of per-experiment plot and the category-list of per-category plots is not accidental. The categories are ordered based on the extent of their variation using Shannon information content metric (available at cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf). Only Normal and Tumor experiments were used to establish the order of categories. Specifically, once the per-category values (average methylation for
The Shannon Information measure is a foundation of modern Information theory and was devised to estimate the minimum number of bits needed to encode sentence or a string of characters of text, if one wanted to transmit such string digitally. The information measure takes into consideration the frequency of the symbols. As a result, a string made up of the same symbol would require a very simple encoding using one bit of information, whereas a string made up of all the letters in the alphabet would need considerably more bits to represent all the letters unambiguously.
Analogously, the 43 values can be considered as the individual letters of Shannon's string. Shannon's entropy measures how dissimilar the 43 values are from each other. The more dissimilar, the more information is in the set.
The categories are listed from lowest information content (top) to the highest information content (bottom). The most informative categories are highlighted. A custom R script was used to generate the plots and calculate the information content.
3. Results and Discussion
i. Methylation Patterns of Major Classes and Families of DNA Repeats
The DNA methylation profiles of 33 tumors and 17 non-tumor adjacent tissue samples obtained from patients with head and neck squamous carcinoma (HNSCC) were analyzed. DNA methylation profiles were also generated from the buccal epithelia of 10 normal individuals, which served as controls. In addition, an analysis of sperm DNA was performed in technical triplicate to assess the reproducibility of the microarray results. An average methylation value for selected subsets of “genomic probe compartments” was calculated. An exemplary profile of average methylation for two extremely different genomic probe compartments can be found in
An alternative way of examining the DNA methylation data is to compare methylation levels that have been normalized with respect to the values of the non-tumor adjacent tissue, as shown in
The following sections will focus on exploring in greater detail the dynamics of methylation patterns among various sequence sub-compartments of a given class of repetitive elements. To facilitate navigation through the perhaps unfamiliar nomenclature that identifies the various subclasses of elements, four tables are included that list the different subclasses within a class, in chronological order of estimated evolutionary age. These tables (Tables 10A through 10D) can be found in the supplementary materials. A table for the ERV retroelements is not included because the evolutionary ages and phylogenetic relationships of these elements are still a subject of investigation and revised annotation. To facilitate additional exploration of the data sets, the methylation levels can be grouped by subclass of repetitive element, rather than by tissue type (as in
ii. Methylation of MaLR Elements
Since observing that AluY and UP, two primate-specific subfamilies of repeats scored higher using the information content statistic than their respective (and all-inclusive) parents, a relatively well-annotated family tree of “mammalian apparent LTR retrotransposons” (MaLR) was investigated (Smit, 1993). An analysis involving specific subsets of MaLR is shown in
iii. Methylation of Alu Elements
Alu elements are the most abundant class of repetitive elements in the human genome with over one million copies and spanning over 30 lineages. The most detailed published analysis of Alu DNA methylation in normal cells and cancer cells was reported by Rodriguez et al. (2008). These authors targeted unmethylated SmaI sites within Alu sequences, and found that normal colon epithelial cells contain a subpopulation of undermethylated Alus, while in tumor cells the number of unmethylated Alu sequences is doubled. They also reported an increased methylation of the younger Alu subfamilies. The microarray-based analysis includes only those Alu lineages for which more than 200 unique locations were probed. As observed for other classes of elements, the younger elements (AluY) are more highly methylated in normal adult tissues, yet are suffering a greater loss of DNA methylation in many tumors. Interestingly, the oldest Alu elements remain methylated in sperm, while the younger ones show loss of methylation in this tissue. The most informative among normal and tumor tissues lineage of Alu's is AluYb. Coincidentally, it is also the most active of all Alu lineages and found primarily in human genomes (Jurka, 1993; Carter et al., 2004). AluYg, the next most informative lineage remains relatively unknown. Among other, less informative lineages, the middle-age AluS families lose methylation in tumor tissue, while the members of the oldest, AluJ lineages remain methylated at an intermediate level, and constant in all 4 tissue types.
iv. Methylation of ERV Elements
Endogenous Retrovirus (ERV) Families are a heterogeneous group of sequences with over 60 lineages according to RepBase (Jurka, 1998). There are reports of ERV sequences being involved in extensive chromosomal rearrangement during the last 30 million years in primate evolution (Romano et al., 2006). Per-lineage analysis pertaining to the methylation pattern of ERV was assessed. Similarly to MaLR and Alu discussed above, Human Endogeneous Retrovirus (HERV) families appear heavily methylated in the normal tissues. The gradual loss of methylation is apparent for HERVH and HERV17 families. To an extent, the methylation levels of HERVE and KERVK also vary among normal, tumors and non-tumoral adjacent tissues. So far, for MaLR, Alu and ERV families of ancient repetitive elements, predating the mammalian radiation, the microarray DNA methylation analysis indicates that young, primate specific lineages appear more susceptible to de-methylation in disease than other, older lineages.
v. Methylation of SVA Elements
A similar analysis was performed for SVA elements, which have been extensively mobilized in the human genome after the divergence of hominids from chimpanzees (Xing et al., 2007; Wang et al., 2007; Macfarlane and Simmonds, 2004). SVA elements consist of a combination of sequences derived from other retroelements (Babushok and Kazazian, 2007) and are known to be non-autonomous, depending on LINE-1 elements for mobilization. Wang et al. (2005) have estimated the evolutionary age of different subfamilies of SVA elements, named SVA-A through SVA-F. This analysis reveals that the youngest SVA subfamilies show an unusual relationship between evolutionary age and the level of dysregulation. SVA-F elements, which are human specific, and only 3 MY old, are significantly less methylated than other, older subfamilies, and their methylation level does not change much in different samples, with the exception of sperm, where these elements show loss of methylation. On the other hand, the SVA-A elements, which are the oldest SVA subfamily (16.81 MY), are strongly methylated in normal oral tissues, but their loss of methylation is strikingly variable among different samples, with tumors showing the greatest level of variation. Thus, the magnitude and trends of DNA methylation changes for the youngest SVA elements seems to diverge from the patterns observed for AluY, MaLR, and ERV elements. The dramatic DNA methylation dysregulation affecting most SVA subfamilies in non-tumoral adjacent tissue is particularly striking.
vi. Methylation of LINE-1 Elements
Lineages of the LINE-1 family were investigated. Categories which could be probed in at least 100 unique genomic loci. Comparing the values across the four classes of experiments, it is apparent that younger, primate specific classes of LINE-1 elements (LINE-1PA3 (L1PA3) and LINE-1PA4 (L1PA4) and LINE-1PA5 (L1PA5), none of which exist in the baboon or marmoset) are more strongly methylated in normal tissue, and suffer more dramatic losses in DNA methylation in tumors and sperm. However, similarly to the observations for SVA subfamilies, the newest LINE-1 families that are strictly human specific (L1PA2, L1HS, Full Length active LINE-1 (Penzkofer et al., 2005)) are not as highly methylated in normal tissue, and not as dramatically demethylated in tumorigenesis as the longer-established lineages, the primate-specific LINE-1PA subfamilies.
It is relevant to explore potential correlations or anti-correlations among the DNA methylation metrics within individual experiments. With this question in mind,
vii. Analysis of Relative CpG Content Among Different Repetitive Element Subfamilies
The formal possibility that some of the differences in DNA methylation levels could be influenced by the CpG content of the DNA sequences being probed was explored. This analysis involves analyzing the count of CpG residues in the repeats and the immediately surrounding sequences, as shown for a single repetitive element family in
viii. Properties of Probes Capable of Best Distinguishing Non-Tumor Adjacent Tissue from Tumor Tissue
The foregoing analysis does not help to identify events that could be tumor-specific. To address this issue, each individual probe associated with a repetitive element was ranked on the basis of its ability to differentiate tumors from non-tumoral adjacent tissue using a Wilcoxon test. A statistical analysis involving those probes that displayed altered methylation was performed by calculating the probe values (ratios) in tumor samples, and the likelihood of random methylation changes as a function of the total number of probes belonging to any one family of repeats. The probes were ranked based on the P-values generated by a hypergeometric t-test, as shown in Table 9. The entries with the most significant P-values include members of the LINE-1P, AluY, LTR, and SVA families of interspersed repeats. Among the primate-specific L1 elements, the L1PA3, L1PA2, and L1PA4 are among the most highly enriched. Among the LTR elements, the LTR7, LTR33, and HERV elements are high on the list. AluY represents the youngest family of Alu elements, and they rank much higher than older Alu elements. The HERV and SVA elements are among the few retrotransposon families known to have been extensively mobilized in the human genome after the divergence of hominids from chimpanzees (Xing et al., 2007; Wang et al., 2005; Macfarlane and Simmonds, 2004).
The data in Tables 11A and 11B summarizes salient properties of the subset of LINE-1 elements that were identified using the Wilcoxon test, as the best DNA methylation probe variables for distinguishing tumors from non-paired non-tumoral adjacent tissue. In Table 11A, the column corresponding to relative enrichment of a set of elements shows that the highest value (4.757) corresponds to a subset of the L1PA4 subfamily. Members of the L1PA3 subfamily are also highly enriched among the most significant probes. The column specifying the median length of the elements shows that for L1PA5 and L1PA6 there is a noticeable increase in the length of the elements corresponding to the most significant probes (almost a 2-fold increase relative to all probed elements, in the case of L1PA6). A longer length could be associated with a higher likelihood of having an intact L1 promoter, as well as a higher probability of generating a full-length LINE-1 RNA transcriptional product. The table also shows enrichment of probes mapping to full-length L1 elements (FLI-L1) and ORF2-competent L1 elements (ORF2-L1, Jurka, 1998; Penzkofer, et al., 2005). L1PA4 elements, which are the most highly enriched among the significant probes, are unlikely to code for functional ORF2 proteins, and thus unlikely to generate reverse transcriptase. This observation indicates that possible positive selection in tumors for long L1 elements among the most significant probes is not operating at the level of conservation of ORF2 protein-coding function.
6,029
6,028
6,026
6,026
6,025
6,026
6,024
6,023
6,031
4.598
2.19E−83
6,031
6,130
4.757
2.80E−46
6,132
5,659
6,115
An additional level of analysis, shown in the Table 11B involved measurement of the level of homology of sequences near the 5′ end of each L1 sequence with an exemplar sequence represented by the first 700 bases of an active LINE-1 element of the class FLI-L1, which contain an active promoter. Using BLAT (Kent, 2002), those 5′-end sequences of different subclasses of L1PA elements scoring with a homology of 80% or better were selected. The table shows that among the L1PA elements present in the subset of the 15,587 most significant probes, there is a much higher percentage of sequences with good homology matches to the active L1 exemplar. This indicates that possible selection in tumors for demethylated L1 elements could involve specific features of the sequence at the 5′-end of the element, which harbor potential forward promoter as well as antisense promoter activity (discussed in section 3.9, below). If, for any given class of elements (i.e. L1PA5), a potentially active promoter exists, it may be more likely to be associated with a full-length L1PA5 elements. Along this line of thought, the apparent length-selection could be a by-product of functional promoter selection.
ix. Functional Significance of the Enrichment of Demethylated Line-1 Elements in Tumors
The simplest interpretation of the age-stratified dysregulation DNA methylation of repetitive DNA observed among normal tissue, non-tumoral adjacent tissue and tumors is that the younger members of repetitive DNA families are the most likely to be transcribed, and that these RNA transcripts are best able in normal cells to trigger RNA-directed chromatin silencing. Silencing efficiency would be additionally enhanced in the younger elements due to reduced sequence divergence, as recently proposed by Reiss and Mager (2007). Paradoxically, for the very youngest members of retrotransposon families, exemplified in the data set by SVA-F and full-length, active LINE-1s, the emergence of optimal silencing may still remain incomplete, for lack of sufficient evolutionary time for RNA-mediated silencing traits to be selected and fixed. Such a hypothesis could explain why the very youngest, human specific retrotransposon families are relatively under-methylated in normal tissue, as compared to their relatively older and more “mature” primate siblings. It has been reported that heterochromatic piRNA loci interact with potentially active transposons in Drosophila resulting in transposon control (Brennecke et al., 2007). Normal transcriptional events involving retrotransposon sequences occur in human oocytes (Georgiu et al, 2009) and are well documented in the murine germ-line, where DNA is transiently demethylated, and where piRNAs have been implicated in reestablishing silencing (Aravin et al., 2007, 2008, Kuramochi-Miyagawa, 2008). Unfortunately, the general understanding of the evolutionary history of piRNAs remains extremely limited, particularly with regards to the mechanism responsible for generation of new functional piRNA sequences, as novel subclasses of retrotransposons enter the genome.
The most recently evolved repetitive elements can have accumulated fewer mutations or truncations deleterious to their function, and their selective loss of epigenetic silencing could be associated with functions that increase the fitness of tumors, therefore subjecting to positive selection. An example of such a function would be the transcriptional activation of genes with oncogenic potential as a result of loss of methylation of cryptic promoter or enhancer sequences within a full-length retrotransposon. For example, Roman-Gomez et al (2005) reported that L1 hypomethylation led to activation of c-MET gene transcription driven by an L1 antisense promoter (Speek, 2001, Nigumann et al, 2002) located within intron one of the c-MET gene in patients with blast crisis chronic myeloid leukemia (BC-CML), where these transcriptional events may contribute to disease progression. More recently, Lin et al. (2006) reported the induction of an abnormal chimeric transcript in esophageal adenocarcinomas, initiated from the antisense promoter located in the 5′-UTR of a full-length LINE-1 element. Another function that could be subject to positive selection in tumor cell lineages is the transcriptional activation of a retrotransposon ORF coding for a reverse transcriptase. It has been reported that the reverse transcriptase inhibitor efavirenz antagonizes the growth of H69 human small-cell lung carcinomas in nude mice (Sinibaldi-Vallebona et al., 2005). The same group has recently reported that inhibition of the reverse transcriptase messenger RNA of LINE-1 elements or HERV-K elements leads to loss of tumorigenic potential in cell lines (Oriccio et al., 2007). Of course, an important caveat is that the reported occurrence in cancer cells of transcripts or proteins derived from retrotransposons could be merely coincidental, not causal.
An interesting functional hypothesis regarding L1 retrotransposon sequences is the possible unselfish participation of expressed and reverse-transcribed LINE-1 elements in nonstandard DNA double strand break repair in the context of oncogenesis, where normal repair mechanisms are disrupted (Helleday et al., 2007). Repair of double-strand breaks by gene conversion involving different endogenous LINE-1 elements has been reported in the mouse (Tremblay et al., 2000). DNA repair by endonuclease-independent LINE-1 retrotransposition was first reported by Morrish et al. (2002, see commentary by Eickbush, 2002) using a model reporter vector transfected into CHO cells. This pathway was found to be dependent on reverse transcriptase activity, and resulted in integration of a truncated LINE-1 sequence lacking target site duplications. Recently Sen et al. (2007) characterized sites in the human genome where L1 elements have integrated without signs of endonuclease-related activity, and found that the structural features of these loci suggested that they arose by double-strand break repair, resulting in translocations or deletions. Also relevant are the findings of Srikanta et al (2009), who scanned the human, chimpanzee, and rhesus macaque genomes, and reported 23 instances of Alu integration events most likely mediated by endonuclease-independent DNA repair (EIDR). Observations of truncated LINE-1 insertions in the context of physiological stress have been reported in two mouse models, lambda-MYC lymphomas and endogenous oxidative stress caused by deficient G6PD expression. In these two models (Rockwood et al, 2004), the LINE-1 insertions, plausibly generated by the EIDR mechanism, have been captured within a chromosomally integrated lac-Z reporter vector. The observed insertions represent predominantly incomplete elements, and their frequency (25% of all events) is higher than the frequency of LINE-1 sequences in the mouse genome (10%).
EIDR involving LINE-1 and Alu elements could be ubiquitous in human cancer cells, and can have adaptive value, enhancing the viability of DNA repair-deficient tumor cells. The rapid rate of progress in high-throughput, low cost DNA sequencing will make it possible to sequence a large number of human tumor genomes to elucidate the sequences found at sites of genomic rearrangements, insertions, and deletions (CGP, 2009). Emerging genome analysis tools will also facilitate the design of experiments to assess the potential adaptive value of EIDR mediated by retroelements.
4. Conclusions
A novel microarray method for analysis of DNA methylation, based on the use of methylation sensitive as well as methylation dependent endonucleases, enables the interrogation of methylation levels in all compartments of the genome, including repetitive elements. Analysis of a substantial set of samples of squamous carcinomas of the head and neck, as well as non-tumor adjacent tissue and normal controls, reveals a complex framework of epigenetic dysregulation, where loss of methylation differentially affect distinct families of repetitive elements. Predominantly the younger, primate-specific members of retroelement families suffer the most dramatic loss of methylation, with the exception of some extremely young, human-specific retroelements. These complex patterns of differential susceptibility to disruption of silencing are probably a result of the natural history of evolutionary domestication of retroelements in genomes, in interplay with a minimal time requirement for strong silencing to be established. Primate-specific subfamilies of LINE-1 elements appear to suffer a particularly pronounced loss of methylation in tumors, with the most dramatic changes apparently observed for those primate retroelements with conserved promoter regions and longer sequences.
A buccal sample can be obtained from the cheek of a subject using the “Buccal DNA Sample Collection Kit” (Bode Technologies). The DNA can be processed with two sets of different restriction endonucleases (methylation sensitive, or methylation dependent), and then amplified with phi29 DNA polymerase as described (Szpakowski et al, 2009).
The sample can be applied to a Nimblegen DNA microarray containing a set of DNA oligonucleotide probes, each 50 bases long, representing a genomic sampling for 25 different repetitive element families. Optionally, the probes can be 60, or 70, or 80, or 90 bases long. On the average each repetitive element family comprises from 30 to several thousand unique probe sequences, designed to be complementary to different specific loci in the genome. Each probe is replicated 4 times to allow for the calculation of the standard deviation of each probe measurement. Thus, the total number of probes in a microarray sector is 25×120×4=12,000. The microarray contains 24 sectors, permitting the analysis of 24 buccal samples at once. The total number of probes in the chip is 24×12,000=288,000.
Probe sets: The probe list can be specified by 25 families, chosen from a master set of 138 repetitive element families (Table 1), which are known to yield good classification results. The coordinates of all probes in all 138 families is listed in Table 15.
The microarray can be subject to a hybridization protocol, and the microarray signals can be processed using bioinformatics protocols as described by Szpakowski et al., 2009.
A Random Forest binary tree classifier can be used to process the data (Strobl et al., 2009), yielding a classification result. The classifier assigns the sample to one of the three following categories: Normal, Tumor, Non-tumor tissue-at-risk.
The list of top 138 Classifier Categories (repetitive element families) in order of rank is as follows: LTR54B, MER67D, MER11B, MER6, ERVL, U1, MER34B, MER66C, HUERS-P3, LTR56, MLT1G1, THE1B-int, HERV9, MER4D, LTR14C, MLT2D, HERVFH21, THE1B, LTR6B, MLT1A1, LTR46, centr, Charlie5, MLT1D-int, MLT2B3, MER50B, HERVK11, MER70A, Charlie3, PABL_B, MER50, MSR1, AluYa5/8, LTR2, LTR10B, MLT1A, HERVK22, HERVL, GSAT, LTR33A, LTR10B1, MSTB-int, Cheshire, LTR17, LTR51, MSTA, MER11A, MER51B, MLT2B2, SVA, SVA_A, SVA_B, SVA_C, SVA_D, SVA_E, SVA_F, L1PA12, MSTC, ERVL-B4, LTR9B, HERVK14, LTR14B, HUERS-P2, LTR29, LTR6A, MSTB1, ALR/Alpha, MSTD, LTR48B, LTR52, LTR8, MER105, LTR8A, MER67A, HUERS-P1, MER92B, LTR22, LTR7B, L1PB1, MER51A, L1PA15-16, LTR36, LTR28, PABL_A, LTR45B, MER4D1, AcHobo, LTR7Y, HERVL18, LTR48, LTR30, MLT1A0, HERVK9, LTR1B, LTR45C, MSTB, LTR47A, MER11D, LTR19A, THE1C, LTR66, MLT1E2, MER115, SST1, MER34B-int, LTR65, MER34C, MER44D, MER57A-int, MLT2B1, L1PA10, MER4A1, MER6A, MLT1E, MER41E, MLT2B4, 7SK, HERVP71A, L1MA7, L1PBa1, LTR5, MER44C, GSATII, THE1D, L1MA1, LTR7, LTR9, MER63A, MER91C, LTR5A, Harlequin, L1PB4, MLT1F1, L1M3f, MLT1F, MLT2A2, LTR14, MER11C.
A buccal sample can be obtained from the cheek of a subject using the “Buccal DNA Sample Collection Kit” (Bode Technologies, Inc.). The DNA can be processed with sodium bisulfite using the Zymo EZ DNA Methylation-Gold kit (Zymo Research, Inc.).
The bisulfite-modified sample can be divided into 12 aliquots and each aliquot can be amplified by PCR using a specific pair of 12 sets of primers. For each primer pair, one primer can be anchored on a repeat family, chosen from among 138 informative families (see list in Example 3). The primer can be designed by obtaining the set of DNA sequences comprising the repeat family, and aligning the sequences with the program ClustalW (available at the website ch.embnet.org/software/ClustalW.html). The second primer can be anchored on an AluY repeat consensus sequence specific for AluY elements. The AluY consensus can be obtained by aligning a limited set of 150 randomly chosen AluY sequences with the program ClustalW.
The amplified DNA can be analyzed using a method capable of indirectly reporting the predicted level of methylated cytosines present of at CpG dinucleotide positions prior to bisulfite treatment, which converts cytosine to uridine, but does not convert methylcytosine. A preferred method, due to its low cost, is electrochemical detection (ECD, Nakahara et al., 1992) of cytosine and thymidine. The ratio of cytosine to thymidine can be converted to a relative DNA methylation level. An alternative method that can be used to obtain the ratio of cytosine to thymidine is Nanopore DNA sequencing (Clarke et al, 2009).
A Random Forest binary tree classifier can be used to process the data (Strobl et al., 2009), yielding a classification result. The classifier assigns the sample to one of the three following categories: Normal, Tumor, Non-tumor tissue-at-risk.
A buccal sample can be obtained from the cheek of a subject using the “Buccal DNA Sample Collection Kit” (Bode Technologies, Inc.). The DNA can be sheared by nebularization. It can then be immobilized using an antibody column, using an antibody capable of binding specifically 5 methyl cytosine. Alternatives to using a methyl-binding antibody are using either the MBD1 or the MECP2 methyl-binding proteins to immobilize the methylated DNA. This step (Sorensen & Collas, 2009) removes methylated DNA from solution, releasing an unmethylated DNA fraction. The immobilized, methylated DNA can then be recovered from the methyl-bindings column.
After 5-methyl-C-content separation, the methylated and the unmethylated DNA samples can be divided into 12 aliquots and each aliquot is amplified by quantitative PCR (as indicated in the next paragraph) using a specific pair of 12 sets of primers. For each primer pair, one primer can be anchored on a repeat family, chosen from among 138 informative families (Table 1). The primer can be designed by obtaining the set of DNA sequences comprising the repeat family, and aligning the sequences with the program ClustalW (available at the website ch.embnet.org/software/ClustalW.html). The second primer can be anchored on an AluY repeat consensus sequence specific for AluY elements. The AluY consensus can be obtained by aligning a limited set of 150 randomly chosen AluY sequences with the program ClustalW.
The amount of methylated and unmethylated DNA is determined using nanoliter-microarray quantitative PCR (Morrison et al., 2006; Dixon et al., 2009). This analytical format contains 3072 individual PCR reaction features, and enables the analysis of samples from 64 individuals, in quadruplicate, using specific primer pairs that measure the levels of 12 different repetitive element families.
A Random Forest binary tree classifier is used to process the data (Strobl et al., 2009), yielding a classification result. The classifier assigns the sample to one of the three following categories: Normal, Tumor, Non-tumor tissue-at-risk.
A buccal sample can be obtained from the cheek of a subject using the “Buccal DNA Sample Collection Kit” (Bode Technologies, Inc.). The DNA from target repetitive element loci can be captured (Gnirke et al, 2009) using several long oligonucleotides (with a few degenerate base positions) specific for a consensus DNA sequence of each of 20 different repetitive element families. The degenerate positions enable binding of repetitive DNA at positions where the consensus sequence is imperfect. In this example, the 20 families are: LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, MLTD2. The repetitive element families used for sequence capture comprises 20 families, chosen from a master set of 138 repetitive element families (Table 1), which are known to yield good classification results. The coordinates of all probes in all 138 families is listed in Table 15.
The captured material can be released from the capture oligonucleotides, and the released DNA can be sequenced using the Pacific Biosciences SMRT system (Flusberg et al., 2010), which is capable of distinguishing cytosine from methylcytosine. The amount of DNA methylation can be calculated using the sequence data.
A Random Forest binary tree classifier can be used to process the data (Strobl et al., 2009), yielding a classification result. The classifier assigns the sample to one of the three following categories: Normal, Tumor, Non-tumor tissue-at-risk.
Due to the fact that the Pacific Biosciences single-molecule real-time SMRT system is capable of producing of long sequence reads, the data generated in this example will contain information about single-nucleotide polymorphisms (SNPs) present in the captured DNA loci. The base present at each SNP position in the sequenced locus will be different in different individuals being tested by this method. Thus, in this example data can be generated that specifies the Genetic State for some of the status biomarkers.
A buccal sample can be obtained from the cheek of a subject using the “Buccal DNA Sample Collection Kit” (Bode Technologies, Inc.). The DNA from target repetitive element loci can be captured (Gnirke et al, 2009) using several long oligonucleotides (with a few degenerate base positions) specific for a consensus DNA sequence of each of 20 different repetitive element families. The degenerate positions enable binding of repetitive DNA at positions where the consensus sequence is imperfect. In this example, the 20 families are: LTR54B, MER11B, MER34B, LTR56, THE1B, HERV9, LTR14C, HERVFH21, LTR6B, LTR46, MLT1D, MER67D, HERVK11, LTR10B, HERVK22, MER6, MER66C, MLT1G1, MER4D, and MLTD2. The repetitive element families used for sequence capture comprises 20 families, chosen from a master set of 138 repetitive element families (Table 1), which are known to yield good classification results. The coordinates of all probes in all 138 families is listed in Table 15.
The captured material can be released, and then re-captured (Gnirke et al, 2009), using a second set of several capture oligonucleotides specific for a consensus sequence for AluY and another set of consensus sequences for AluSx, AluSp, AluSg and AluSc repetitive elements. This can result in binding of DNA containing one repetitive element from the first set of 20, as well as a neighboring AluY or AluSx or AluSp or AluSg or AluSc elements.
The twice-captured material can be released from the capture oligonucleotides, and the released DNA can be sequenced using the Pacific Biosciences SMRT system (Flusberg et al., 2010), which is capable of distinguishing cytosine from methylcytosine. The amount of DNA methylation can be calculated using the sequence data.
A Random Forest binary tree classifier can be used to process the data (Strobl et al., 2009), yielding a classification result. The classifier assigns the sample to one of the three following categories: Normal, Tumor, Non-tumor tissue-at-risk.
Due to the fact that the Pacific Biosciences single-molecule real-time SMRT system is capable of producing of long sequence reads, the data generated in this example will contain information about single-nucleotide polymorphisms (SNPs) present in the captured DNA loci. The base present at each SNP position in the sequenced locus will be different in different individuals being tested by this method. Thus, in this example data can be generated that specifies the Genetic State for some of the status biomarkers.
A set of DNA methylation biomarkers that are informative regarding the stability of the genome and the epigenome in tissues are disclosed. The biomarkers were discovered through statistical analysis of a data set generated by microarrays that sampled the entire human genome, and included probes for gene promoters, non-gene-non-repetitive probes, and repetitive element probes.
The original set of microarray data comprised a list of 139,379 variables including gene probes, unique probes and repetitive element probes. In order to improve the robustness of the DNA methylation metrics, a strategy was developed whereby the probes belonging to the set of “repetitive elements” were subdivided in a total of 901 categories, based on their membership in specific sub-families of repetitive elements. For example, the 49 probes in the microarray mapping to a MER67D repetitive element (MER67D is a member of the LTR class of repetitive elements) were placed in one of the 901 categories, and the DNA methylation values of the 49 probes for that specific category were averaged. Repetitive element categories represented by less than 30 probes were not included in the set of 901 categories. The average methylation value of each of the 901 categories was used to perform a 3-way classification of normal tissue, vs. tumor tissue, vs. nontumor margin tissue.
Presented herein is a detailed report on a new set of classification experiments, performed with a subset of variables of higher quality. Using several sets of technical replicates of microarray experiments, a subset of the 901 category variables was selected based on a defined threshold value of the standard deviation the calculated values for each category, in 3 sets of 3 technical replicates. The subset of variable where the coefficient of variation was no larger than 15% was then selected. This variance quality filter reduced the number of variables to 569 categories. Table 18 is a list of these variables (repetitive DNA sequence families; status biomarkers).
A classification experiment using a Random Forest (RF) binary decision tree algorithm (Breiman, 2001) using the 569 repeat category variables was performed. The error rate in this analysis was 13% mis-classification. A list of the top 75 classifier variables was generated, which comprise categories of repeats according to the results of the RF classifier.
A classification experiment was then performed using a Support Vector Machine (SVM, Vapnik, 1998, Guyon et. al, 2002) classifier run using 569 variables. A list of the top 75 classifier variables was generated, which comprise categories of repeats according to the results of the SVM analysis. The performance of the SVM classifier was tested using top variables only, and found the best performance (100% accuracy) using either the top 18 or the top 19 variables.
Finally, the top 50 variables were ranked by the SVM classifier, and used in a Random Forest classification run. The error rate in this RF run was 8.1% mis-classification. This indicates that the SVM can be more effective than the RF algorithm in surveying all the available 569 variables to find the best classifiers among these variables.
1. Analysis of the Genomic Loci Comprising the Top Classifier Variables
The genomic organization of the repetitive elements that comprise the top variables in the classifiers was examined. It was observed that the genomic loci comprising the best classifiers have a structure characterized by the presence of two or three different repetitive elements, co-existing within a DNA window of approximately 500 to 1000 bases. A common organizational theme is a combination of an element belonging to the LTR family of retrotransposons, and an element belonging to the AluY (Young Alu) or AluSx family of retrotransposons. This information is presented in Table 14.
In the majority of cases, the LTR retrotransposon comprising a top classifier variable belongs to a primate-specific family, implying a relatively recent evolutionary origin. A small set of highly-performing variables consists of DNA transposons, such as Charlie3_MER1 and Charlie5_MER1, and Cheshire MER1 which have a different evolutionary origin. Yet another set of variables comprises repetitive sequences belonging to centromeric DNA, such as mini-satellite repeat 1 (MSR1), Gamma-satellite DNA, and Alpha-ALR-satellite DNA.
The presence of two or even three different repetitive element sequences within a window of 500 to 1000 bases may have biological consequences, for example a tendency of these loci to undergo loss of epigenetic silencing when cells are under stress, such as oxidative stress or cytokine-induce stress. Additionally, it is well known that centromeric sequences, which comprise closely spaced DNA repeats, are subject to loss of methylation under conditions of cellular stress.
This application claims benefit of U.S. Provisional Application No. 61/234,367, filed Aug. 17, 2009. Application No. 61/234,367, filed Aug. 17, 2009, is hereby incorporated herein by reference in its entirety.
This invention was made with government support under grant No. 1R21CA116079 from the National Institutes of Health (NIH). The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2010/045788 | 8/17/2010 | WO | 00 | 2/15/2012 |
Number | Date | Country | |
---|---|---|---|
61234367 | Aug 2009 | US |