There have been several recent large-scale efforts to gain insight into both common and rare human genetic variation. Historically, these efforts utilized two principal analytical methods to gather genetic information in large scale: high-density microarrays and whole exome sequencing. More recently, technological advances have allowed for the large-scale sequencing of the whole human genome.
Most studies have generated population-based information on human diversity using low to intermediate coverage of the genome (4× to 20× sequencing depth). The highest coverage (30× or greater) has been reported for the recent sequencing of 1,070 Japanese subjects, 129 trios from the 1000 Genome Project, and 909 Icelandic subjects. This shift in paradigm is only made stronger by the recent release of the Illumina HiseqX-Ten, which allows the sequencing of up to 160 genomes at 30× mean depth in 3-day cycles, at an average cost of $1,000 to $2,000 per genome.
These advances create new complications for the health care industry and health professionals. A whole genome sequence from an individual can possess several million nucleotide variations when compared to a reference genome. While, it is well appreciated that many different gene and nucleotide variants can have a significant impact on the risk to an individual's overall health, a significant problem arises when a health care worker is presented with a previously unannotated genetic mutation. This disclosure describes a novel method to determine the impact that any given nucleotide variation has on an individual's overall health risk.
The genomic health risk metrics elaborated herein hold significant advantages for the health care industry. The likelihood that any given genomic sequence variant (GSV) will be deleterious is relatively small. Since every human genome sequenced may result in several million GSVs, the advantage of a health risk metric such as a tolerability score, an n-mer score, a context dependent tolerance score, or a protein tolerability score to clinicians is that it will allow them to focus on and prioritize deleterious mutations. Thus, the methods, systems and media of this disclosure solve significant problems that were created by virtue of advances in DNA sequencing and analysis. The methods described herein also describe a functional genomic sequencing assay that improves upon and is more efficient then previous methods such as whole-genome sequencing and exosome sequencing. The functional genomic sequencing assay described herein is allows targeted sequencing or analysis of GSV increasing the efficiency and reducing the cost of such analysis. This method is superior to other methods such as exosome sequencing in that it takes into account GSVs that occur in non-coding regions, and, thus, allows for greater sensitivity and accuracy of nucleic acid analysis.
In certain embodiments, described herein, is a method of identifying a relative genomic health risk of a genomic sequence variant in the DNA sequence of an individual, the method comprising: determining at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and comparing the at least one genomic sequence variant of the individual to a tolerability score at a corresponding position within x nucleotides of a genetic element, wherein the tolerability score comprises a function of a nucleotide variation score and an allele proportion score, wherein the nucleotide variation score is the variance observed in a plurality of genomes at the corresponding position, and the allele proportion score is the proportion of genomic variants that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the nucleotide variation score is normalized. In certain embodiments, the genetic element is selected from any one or more of a gene promoter, gene enhancer, transcriptional start site, splice donor site, splice acceptor site, polyadenylation site, start codon, stop codon, exon/intron boundary, intron sequence, and an exon sequence, TFBS, protein domain, non-coding RNA and a regulatory element. In certain embodiments, the genomic sequence variant is within 500 nucleotides of the genetic element.
In another embodiment, described herein, is a method of identifying a relative genomic health risk of a genomic sequence variant in the DNA sequence of an individual, the method comprising: determining at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and determining an n-variant score for the at least one genomic sequence variant, wherein the n-variant score comprises a function of a count score and an allele frequency score, wherein the count score is the ratio of the number of times any genomic sequence variant occurs in a unique sequence of n-nucleotides in length in the plurality of genomes to the number of times that the unique sequence of n-nucleotides in length occurs in the reference genome, and the allele frequency score is the frequency of the proportion of genomic sequence variants that are fixed in the population, at an allele frequency greater than 0.0001 in the plurality of genomes. In certain embodiments, the unique sequence of n-nucleotides in length is greater than 3 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is less than 100 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is 7 nucleotides. In certain embodiments, the genomic sequence variant occurs in the center of the unique sequence of n-nucleotides. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes.
In another embodiment, described herein, is a method of identifying a relative genomic health risk of a genomic sequence variant of an individual, the method comprising: determining at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and determining if the at least one genomic sequence variant occurs within a region with a low context dependent tolerance score, wherein the context dependent tolerance score comprises a function of an observed context dependent tolerance score and an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed and fixed in the plurality of genomes as a function of a length of the region. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the context dependent tolerance score comprises subtracting the expected context dependent tolerance score from the observed context dependent tolerance score.
In another embodiment, described herein, is a method of identifying a relative genomic health risk of a genomic sequence variant of an individual, the method comprising: determining at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; determining if the at least one genomic sequence variant causes an amino acid variant in an expressed protein, wherein the amino acid variant is a difference of at least one amino acid when compared to a reference genome; and comparing the amino acid variant to a protein tolerability score at a corresponding position within a defined protein class, wherein the protein tolerability score comprises a diversity score, missense score, and a protein allele frequency score, wherein the diversity score is a normalized diversity metric, the missense score is the variance observed in a plurality of genomes at the corresponding position which leads to an amino acid mutation, and the protein allele frequency score is the proportion of genomic variants that leads to an amino acid variant that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the defined protein class is selected from any one or more of a kinase, a phosphatase, a tyrosine kinase, a serine/threonine kinase, a G protein coupled receptor (GPCR), a nuclear hormone receptor, an acetylase, a chaperone, a protease, a serine protease, and a transcription factor. In certain embodiments, the diversity metric is a Shannon entropy, a Simpson diversity index, or a Wu-Kabat variability coefficient.
In another embodiment, described herein, is a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create a program to identify a relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a software module to determine at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and a software module to compare the at least one genomic sequence variant of the individual to a tolerability score at a corresponding position within x-nucleotides of a genetic element, wherein the tolerability score comprises a function of a nucleotide variation score and an allele proportion score, wherein the nucleotide variation score is the variance observed in a plurality of genomes at the corresponding position, and the allele proportion score is the proportion of genomic variants that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the nucleotide variation score is normalized to the size of the genetic element. In certain embodiments, the genetic element is selected from any one or more of a gene promoter, gene enhancer, transcriptional start site, splice donor site, splice acceptor site, polyadenylation site, start codon, stop codon, exon/intron boundary, intron sequence, and an exon sequence. In certain embodiments, the genomic sequence variant is within 50 nucleotides of the genetic element. In certain embodiments, the genomic sequence variant is within 500 nucleotides of the genetic element.
In another embodiment, described herein, is a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create a program to identify a relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a software module to determine at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome in a unique sequence of n nucleotides in length; and a software module to determine an n-variant score for the at least one genomic sequence variant, wherein the n-variant score is comprises a function of a count score and an allele frequency score, wherein the count score is the ratio of the number of times any genomic sequence variant occurs in a unique sequence of n-nucleotides in length in the plurality of genomes to the number of times that the unique sequence of n-nucleotides in length occurs in the reference genome, and the allele frequency score is the frequency of the proportion of genomic sequence variants that are fixed in the population, at an allele frequency greater than 0.0001 in the plurality of genomes. In certain embodiments, the unique sequence of n-nucleotides in length is greater than 4 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is less than 100 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is 7 nucleotides. In certain embodiments, the genomic sequence variant occurs in the center of the unique sequence of n-nucleotides. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes.
In another embodiment, described herein, is a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create a program to identify a relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a software module to determine at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and a software module to determine if the at least one genomic sequence variant occurs within a region with a low context dependent tolerance score, wherein the context dependent tolerance score comprises a function of an observed context dependent tolerance score and an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length actually observed and fixed in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed in the plurality of genomes. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the context dependent tolerance score comprises subtracting the expected context dependent tolerance score from the observed context dependent tolerance score.
In another embodiment, described herein, is a non-transitory computer-readable storage media encoded with a computer program including instructions executable by a processor to create a program to identify a relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a software module to determine at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; a software module to determine if the at least one genomic sequence variant causes an amino acid variant in an expressed protein, wherein the amino acid variant is a difference of at least one amino acid when compared to a reference genome; and a software module to compare the amino acid variant to a protein tolerability score at a corresponding position within a defined protein class, wherein the protein tolerability score comprises a diversity score, missense score, and a protein allele frequency score, wherein the diversity score is a normalized diversity metric, the missense score is the variance observed in a plurality of genomes at the corresponding position which leads to an amino acid mutation, and the protein allele frequency score is the proportion of genomic variants that leads to an amino acid variant that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the DNA sequence comprises at least 100,000 nucleotides. In certain embodiments, the DNA sequence comprises at least 90% of human haploid genome. In certain embodiments, at least 100 genomic sequence variants are determined in the DNA sequence of the individual. In certain embodiments, the reference genome is generated from at least 10,000 individual genomes. In certain embodiments, the reference genome is generated from at least 100,000 individual genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, defined protein class is selected from any one or more of a kinase, a phosphatase, a tyrosine kinase, a serine/threonine kinase, a G protein coupled receptor (GPCR), a nuclear hormone receptor, an acetylase, a chaperone, a protease, a serine protease, and a transcription factor. In certain embodiments, the diversity metric is a Shannon entropy, a Simpson diversity index, or a Wu-Kabat variability coefficient. In another embodiment, described herein, is a method of creating a genomic health risk database comprising: populating a database with a tolerability score value for each of a plurality of positions in a genome; wherein the tolerability score is determined for each of the plurality of positions in the genome within x nucleotides of a genetic element, wherein the tolerability score comprises a function of a nucleotide variation score and an allele proportion score; wherein the nucleotide variation score is the nucleotide variance observed in a plurality of genomes at each of the plurality of positions in the genome, and the allele proportion score is the proportion of genomic variants that exceed an incidence of 0.0001 in the plurality of genomes at each of the plurality of positions in the genome. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the nucleotide variance is an insertion, a deletion, or a translocation. In certain embodiments, the nucleotide variance is a point mutation. In certain embodiments, the nucleotide variation score is normalized to the size of the genetic element. In certain embodiments, the plurality of positions is greater than 1,000. In certain embodiments, the genetic element is selected from any one or more of a gene promoter, gene enhancer, transcriptional start site, splice donor site, splice acceptor site, polyadenylation site, start codon, stop codon, exon/intron boundary, intron sequence, and an exon sequence. In certain embodiments, the tolerability score is determined for each of a plurality of positions in the genome within 500 nucleotides of the genetic element.
In another embodiment, described herein, is a method of creating a genomic health risk database comprising: populating a database with an n-variant score value for each of a plurality of positions in a genome; wherein the n-variant score is determined for each of the plurality of positions in the genome, wherein the n-variant score comprises a function of a count score and an allele frequency score; wherein the count score is the ratio of the number of times any genomic sequence variant occurs in a unique sequence of n-nucleotides in length in the plurality of genomes compared to a reference genome to the number of times that the unique sequence of n-nucleotides in length occurs in the reference genome, and the allele frequency score is the frequency of the proportion of genomic sequence variants that are fixed in the population, in the plurality of genomes for each of the plurality of positions in the genome. In certain embodiments, the unique sequence of n-nucleotides in length is greater than 4 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is less than 100 nucleotides. In certain embodiments, the unique sequence of n-nucleotides in length is 7 nucleotides. In certain embodiments, the genomic sequence variant occurs in the center of the unique sequence of n-nucleotides. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes.
A method of creating a genomic health risk database comprising: populating a database with a context dependent tolerance score for each of a plurality of regions in a genome; wherein the context dependent tolerance score comprises a function of an observed context dependent tolerance score and an expected context dependent tolerance score; wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length actually observed and fixed in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed in the plurality of genomes. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the genomic sequence variant is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the context dependent tolerance score comprises subtracting the expected context dependent tolerance score from the observed context dependent tolerance score.
In another embodiment, described herein, is a method of creating a genomic health risk database comprising: populating a database with a protein tolerability score value for each of a plurality of positions in a genome; wherein the protein tolerability score is determined for each of the plurality of positions in the genome, wherein the protein tolerability score comprises a function of a diversity score, missense score, and a protein allele frequency score; wherein the diversity score is a normalized diversity metric, the missense score is the variance observed in a plurality of genomes at each of the plurality of positions in the genome which leads to an amino acid variant, and the protein allele frequency score is the proportion of genomic variants that leads to an amino acid variant at each of the plurality of positions in the genome. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of genomes is at least 100,000 genomes. In certain embodiments, the is an insertion, a deletion, or a translocation. In certain embodiments, the genomic sequence variant is a point mutation. In certain embodiments, the defined protein class is selected from any one or more of a kinase, a phosphatase, a tyrosine kinase, a serine/threonine kinase, a G protein coupled receptor (GPCR), a nuclear hormone receptor, an acetylase, a chaperone, a protease, a serine protease, and a transcription factor. In certain embodiments, the diversity metric is a Shannon entropy, a Simpson diversity index, or a Wu-Kabat variability coefficient.
In another embodiment, described herein, is a genomic assay comprising a plurality of polynucleotides bound to a substrate, wherein each of the plurality of polynucleotides possess a sequence corresponding to a genomic locus, wherein a sequence corresponding to the genomic locus possesses a tolerability score below 0.1, wherein the tolerability score comprises a function of a nucleotide variation score and an allele proportion score, wherein the nucleotide variation score is the variance observed in a plurality of genomes at the corresponding position, and the allele proportion score is the proportion of genomic variants that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of polynucleotides is at least 1,000 polynucleotides. In certain embodiments, the plurality of polynucleotides is at least 10,000 polynucleotides. In certain embodiments, the plurality of polynucleotides comprises at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprises at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprises at least 8,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 5 prime ends. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 3 prime ends. In certain embodiments, the plurality of polynucleotides further comprises a fluorescent molecule. In certain embodiments, the plurality of polynucleotides further comprises a fluorescent dye. In certain embodiments, the substrate comprises glass. In certain embodiments, the substrate comprises silicon.
In another embodiment, described herein, is a genomic assay comprising a plurality of polynucleotides bound to a substrate, wherein each of the plurality of polynucleotides possess a sequence corresponding to a genomic locus, wherein a sequence corresponding to the genomic locus possesses an n-variant score below 0.05 wherein the n-variant score comprises a function of a count score and an allele frequency score, wherein the count score is the ratio of the number of times any genomic sequence variant occurs in a unique sequence of n-nucleotides in length in the plurality of genomes to the number of times that the unique sequence of n-nucleotides in length occurs in the reference genome, and the allele frequency score is the frequency of the proportion of genomic sequence variants that are fixed in the population, at an allele frequency greater than 0.0001, in the plurality of genomes. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the plurality of polynucleotides is at least 1,000 polynucleotides. In certain embodiments, the plurality of polynucleotides is at least 10,000 polynucleotides. In certain embodiments, the plurality of polynucleotides comprise at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprise at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprise at least 8,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 5 prime ends. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 3 prime ends. In certain embodiments, the plurality of polynucleotides further comprise a fluorescent molecule. In certain embodiments, the plurality of polynucleotides further comprise a fluorescent dye. In certain embodiments, the substrate comprises glass. In certain embodiments, the substrate comprises silicon.
In another embodiment, described herein, is a genomic assay comprising a plurality of polynucleotides bound to a substrate, wherein each of the plurality of polynucleotides possess a sequence corresponding to a genomic locus, wherein a sequence corresponding to the genomic locus possesses a low context dependent tolerance score, wherein the context dependent tolerance score comprises a function of an observed context dependent tolerance score and an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length actually observed and fixed in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed in the plurality of genomes. In certain embodiments, the context dependent tolerance score comprises subtracting the expected context dependent tolerance score from the observed context dependent tolerance score. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, plurality of polynucleotides is at least 1,000 polynucleotides. In certain embodiments, plurality of polynucleotides is at least 10,000 polynucleotides. In certain embodiments, the plurality of polynucleotides comprise at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprise at least 4,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides comprise at least 8,000 distinct nucleotide sequences. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 5 prime ends. In certain embodiments, the plurality of polynucleotides are covalently bound to the substrate at their 3 prime ends. In certain embodiments, the plurality of polynucleotides further comprise a fluorescent molecule. In certain embodiments, the plurality of polynucleotides further comprise a fluorescent dye. In certain embodiments, the substrate comprises glass. In certain embodiments, the substrate comprises silicon.
Any of the methods of this disclosure can be used to determine a section of the genome for targeted sequencing, resequencing, or SNP analysis.
In another embodiment, described herein, is a functional genomic assay comprising: identifying a presence of at least one genomic sequence variant in the nucleic acid sequence of an individual; determining if the at least one genomic sequence variant occurs in a highly conserved genomic region; the highly conserved genomic region having an observed context dependent tolerance score greater than an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the probability to vary of a unique nucleic acid sequence of n-nucleotides in length in a certain region of x nucleotides in length in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in the certain region of x nucleotides in length actually observed in the plurality of genomes. In certain embodiments, the nucleic acid sequence comprises a DNA sequence. In certain embodiments, the DNA sequence comprises a nuclear DNA sequence. In certain embodiments, the plurality of genomes is at least 10,000 genomes. In certain embodiments, the nucleic acid sequence comprises at least 100,000 nucleotides. In certain embodiments, the functional genomic assay comprises identifying the presence of at least 10 genomic sequence variants. In certain embodiments, the at least one genomic sequence variant comprises at least one of an insertion, a deletion, and a translocation. In certain embodiments, the at least one genomic sequence variant comprises a single nucleotide polymorphism. In certain embodiments, n equals 7. In certain embodiments, x is between 400 and 600. In certain embodiments, the functional genomic assay comprises determining if the at least one genomic sequence variant is in a non-coding genomic region that is highly conserved. In certain embodiments, the at least one genomic sequence variant is in a non-coding highly conserved genomic region within 1,000 base pairs of a known disease-associated gene. In certain embodiments, the highly conserved genomic region is a genomic region corresponding to a most conserved 1st percentile of all genomic regions. In certain embodiments, the observed context dependent tolerance score is at least 10% greater than an expected context dependent tolerance score. In certain embodiments, at least one of the at least one genomic sequence variant in a non-coding genomic region that is highly conserved is selected from the list consisting of rs587780751, rs745366624, rs777251123, rs778796405, rs774531501, rs587776927, rs768823171, rs749303140, rs376829288, rs750530042, rs587776558, rs372686280, rs111812550, rs143144732, rs193922699, rs750180293, rs398122808, rs757171524, rs773306994, rs773306994, rs372418954, rs762425885, rs397516031, rs397516022, rs730880592, rs730880592, rs397516020, rs397516020, rs373746463, rs373746463, rs373746463, rs387906397, rs387906397, rs587782958, rs730880718, rs730880667, rs113358486, rs111683277, rs112917345, rs730880691, rs397515916, rs730880690, rs111437311, rs397515903, rs727503201, rs112999777, rs397515897, rs727503204, rs397515893, rs397515891, rs587776699, rs587776700, rs376395543, rs748486465, rs149712664, rs199683937, rs144637717, rs587776644, rs730880296, rs397515322, rs558721552, rs531105836, rs587777262, rs267607302, rs387907354, rs398123750, rs727503988, rs587783714, rs148622862, rs763991428, rs761780097, rs770204470, rs387906521, rs387906520, rs79367981, rs749160734, rs587776708, rs587776708, rs34086577, rs199959804, rs587777290, rs386834170, rs386834169, rs144077391, rs386834164, rs386834166, rs770093080, rs587777374, rs45517105, rs45517105, rs45488500, rs45517289, rs45517289, rs137854118, rs45517358, rs189077405, rs515726118, rs386833742, rs386833739, rs755127868, rs200655247, rs376023420, rs747351687, rs113690956, rs376281637, rs765390290, rs773401248, rs61750189, rs530975087, rs201978571, rs267604791, rs80358116, rs80358116, rs273899695, rs80358011, rs80358011, rs80358051, rs730880267, rs63751296, rs63750707, rs776442328, rs776820510, rs72653165, rs72667012, rs72667008, rs527398797, rs587780009, rs587776658, rs587782018, rs745620135, rs372651309, rs556992558, rs137853932, rs200253809, rs386833901, rs770882876, rs750550558, rs397507554, rs730880306, rs201613240, rs147952488, rs770241629, rs373494631, rs397517741, rs386833856, rs559854357, rs371496308, rs539645405, rs187510057, rs41298629, rs536892777, rs747330606, rs748559929, rs770277446, rs201685922, rs767245071, rs730882032, rs587776525, rs398123358, rs72659359, rs137853943, rs267607709, rs267607710, rs766168993, rs775288140, rs780041521, rs145564018, rs775456047, rs587776879, rs540289812, rs745832717, rs745915863, rs386833418, rs199422309, rs431905514, rs587784059, rs748086984, rs386833492, rs199988476, rs281865166, rs587776515, rs397518439, rs193922258, rs142637046, rs73717525, rs145483167, rs587777285, rs747737281, rs183894680, rs116735828, rs574673404, rs386833563, rs768154316, rs111033661, rs755363896, rs368953604, rs180177319, rs148049120, rs150676454, rs372655486, rs373842615, rs763389916, rs118203419, rs515726232, rs312262809, rs312262804, rs281865349, rs281865338, rs281865337, rs281865334, rs281865336, rs281865336, rs62638626, rs62638627, rs587784423, rs113951193, rs281874765, rs104886349, rs398123247, rs74315277, rs200346587, rs398122908, rs727503036, rs397515747, and rs587776734. In certain embodiments, at least one of the at least one genomic sequence variant in a non-coding region that is highly conserved is selected from the list consisting of rs778796405, rs8177982, rs376829288, rs4253196, rs750180293, rs757171524, rs727503201, rs397515893, rs587776699, rs397516083, rs201078659, rs750425291, rs558721552, rs531105836, rs200782636, rs752197734, rs3093266, rs34086577, rs199959804, rs144077391, rs386834164, rs386834166, rs189077405, rs746701685, rs386833721, rs376023420, rs761146008, rs765390290, rs72648337, rs527398797, rs367567416; rs372651309, rs200253809, rs193922837, rs761737358, rs113994173, rs559854357, rs111951711, rs371496308, rs368123079, rs118192239, rs41298629, and rs536892777. In certain embodiments, the functional genomic assay is for use in determining a likelihood of the individual being diagnosed with a cancer. In certain embodiments, the functional genomic assay is for use in prognosing a cancer of the individual.
In another embodiment, described herein, is a computer-implemented system comprising: a computer comprising: at least one processor, a memory, an operating system configured to perform executable instructions, and a computer program including instructions executable by the at least one processor to create a functional genomic assay application, the functional genomic assay application configured to perform the following: receiving a nucleic acid sequence of an individual; identifying a presence of at least one genomic sequence variant in the nucleic acid sequence of the individual; and determining if the at least one genomic sequence variant occurs in a highly conserved genomic region, the highly conserved genomic region having an observed context dependent tolerance score greater than an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the probability to vary of a unique nucleic acid sequence of n-nucleotides in length in a certain region of x nucleotides in length in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in the certain region of x nucleotides in length actually observed in the plurality of genomes. The nucleic acid sequence may comprise a DNA sequence and in some cases, the DNA sequence comprises a nuclear DNA sequence. In some cases, the plurality of genomes is at least 10,000 genomes. In some cases, the nucleic acid sequence comprises at least 100,000 nucleotides. The functional genomic assay may comprise identifying the presence of at least 10 genomic sequence variants. In some cases, the at least one genomic sequence variant comprises at least one of an insertion, a deletion, and a translocation. In some cases, the at least one genomic sequence variant comprises a single nucleotide polymorphism. In particular embodiments of the functional genomic assay n equals 7. In some embodiments of the functional genomic assay x is between 400 and 600. The functional genomic assay may comprise determining if the at least one genomic sequence variant is in a non-coding highly conserved genomic region. In some cases, the at least one genomic sequence variant is in a non-coding highly conserved genomic region within 2 megabases of a known disease-associated gene. In some cases, the highly conserved genomic region is a genomic region corresponding to a most conserved 1st percentile of all genomic regions. In some cases, the observed context dependent tolerance score is at least 10% greater than an expected context dependent tolerance score. In various cases, at least one of the at least one genomic sequence variant in a non-coding genomic region that is highly conserved is selected from the list consisting of rs587780751, rs745366624, rs777251123, rs778796405, rs774531501, rs587776927, rs768823171, rs749303140, rs376829288, rs750530042, rs587776558, rs372686280, rs111812550, rs143144732, rs193922699, rs750180293, rs398122808, rs757171524, rs773306994, rs773306994, rs372418954, rs762425885, rs397516031, rs397516022, rs730880592, rs730880592, rs397516020, rs397516020, rs373746463, rs373746463, rs373746463, rs387906397, rs387906397, rs587782958, rs730880718, rs730880667, rs113358486, rs111683277, rs112917345, rs730880691, rs397515916, rs730880690, rs111437311, rs397515903, rs727503201, rs112999777, rs397515897, rs727503204, rs397515893, rs397515891, rs587776699, rs587776700, rs376395543, rs748486465, rs149712664, rs199683937, rs144637717, rs587776644, rs730880296, rs397515322, rs558721552, rs531105836, rs587777262, rs267607302, rs387907354, rs398123750, rs727503988, rs587783714, rs148622862, rs763991428, rs761780097, rs770204470, rs387906521, rs387906520, rs79367981, rs749160734, rs587776708, rs587776708, rs34086577, rs199959804, rs587777290, rs386834170, rs386834169, rs144077391, rs386834164, rs386834166, rs770093080, rs587777374, rs45517105, rs45517105, rs45488500, rs45517289, rs45517289, rs137854118, rs45517358, rs189077405, rs515726118, rs386833742, rs386833739, rs755127868, rs200655247, rs376023420, rs747351687, rs113690956, rs376281637, rs765390290, rs773401248, rs61750189, rs530975087, rs201978571, rs267604791, rs80358116, rs80358116, rs273899695, rs80358011, rs80358011, rs80358051, rs730880267, rs63751296, rs63750707, rs776442328, rs776820510, rs72653165, rs72667012, rs72667008, rs527398797, rs587780009, rs587776658, rs587782018, rs745620135, rs372651309, rs556992558, rs137853932, rs200253809, rs386833901, rs770882876, rs750550558, rs397507554, rs730880306, rs201613240, rs147952488, rs770241629, rs373494631, rs397517741, rs386833856, rs559854357, rs371496308, rs539645405, rs187510057, rs41298629, rs536892777, rs747330606, rs748559929, rs770277446, rs201685922, rs767245071, rs730882032, rs587776525, rs398123358, rs72659359, rs137853943, rs267607709, rs267607710, rs766168993, rs775288140, rs780041521, rs145564018, rs775456047, rs587776879, rs540289812, rs745832717, rs745915863, rs386833418, rs199422309, rs431905514, rs587784059, rs748086984, rs386833492, rs199988476, rs281865166, rs587776515, rs397518439, rs193922258, rs142637046, rs73717525, rs145483167, rs587777285, rs747737281, rs183894680, rs116735828, rs574673404, rs386833563, rs768154316, rs111033661, rs755363896, rs368953604, rs180177319, rs148049120, rs150676454, rs372655486, rs373842615, rs763389916, rs118203419, rs515726232, rs312262809, rs312262804, rs281865349, rs281865338, rs281865337, rs281865334, rs281865336, rs281865336, rs62638626, rs62638627, rs587784423, rs113951193, rs281874765, rs104886349, rs398123247, rs74315277, rs200346587, rs398122908, rs727503036, rs397515747, and rs587776734. In various embodiments, at least one of the at least one genomic sequence variant in a non-coding region that is highly conserved is selected from the list consisting of rs778796405, rs8177982, rs376829288, rs4253196, rs750180293, rs757171524, rs727503201, rs397515893, rs587776699, rs397516083, rs201078659, rs750425291, rs558721552, rs531105836, rs200782636, rs752197734, rs3093266, rs34086577, rs199959804, rs144077391, rs386834164, rs386834166, rs189077405, rs746701685, rs386833721, rs376023420, rs761146008, rs765390290, rs72648337, rs527398797, rs367567416; rs372651309, rs200253809, rs193922837, rs761737358, rs113994173, rs559854357, rs111951711, rs371496308, rs368123079, rs118192239, rs41298629, and rs536892777. The functional genomic assay may be for use in determining a likelihood of the individual being diagnosed with a cancer, for use in prognosing a cancer of the individual, and/or for use in determining longevity of the individual.
Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Any reference to “or” herein is intended to encompass “and/or” unless otherwise stated.
As used herein “genomic sequence variant” refers to any nucleotide difference in an individual's genome sequence compared to a reference genome. The variant can be a single nucleotide variant (SNV or SNP), insertion or deletion (Indel), or translocation. In certain embodiments, the indel comprises more than a single nucleotide. In certain embodiments, a genomic sequence variant excludes mitochondrial deoxyribonucleic acid (DNA) sequences. In certain embodiments, a genomic sequence variant excludes variants found on either of the non-autosomal human X or Y chromosomes. In certain embodiments, the genomic sequence variant is a human genomic sequence variant.
As used herein “reference genome” refers to any standard publicly available reference genome, for example GRCh38, the Genome Reference Consortium human genome (build 38). Alternatively, the reference genome can be one that is constructed de novo from sequencing a plurality of genomes. In certain embodiments, the plurality of genomes is greater than 10,000 different genomes. In certain embodiments, the plurality of genomes is greater than 100,000 different genomes.
Described herein, are methods, systems, and media useful for determining the health risk of a genomic sequence variant (GSV) in the nucleic acid sequence of an individual's genome. In certain embodiments, the DNA sequence comprises a sequence for an individual's whole genome. In certain embodiments, the DNA sequence comprises a sequence for only the high confidence regions of an individual's whole genome. In certain embodiments, the DNA sequence comprises a sequence for the high confidence region of an individual's whole genome as defined by the NA12878 Genome-In-A-Bottle call set (GiaB v2.19). In certain embodiments, the DNA sequence comprises a sequence for 90% of the high confidence region of an individual's whole genome as defined by the GiaB v2.19. In certain embodiments, the DNA sequence comprises a sequence for 80% of the high confidence region of an individual's whole genome as defined by the GiaB v2.19. In certain embodiments, the DNA sequence comprises a sequence for 70% of the high confidence region of an individual's whole genome as defined by the GiaB v2.19. In certain embodiments, the DNA sequence comprises a sequence of a plurality of contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence comprises a sequence of at least 100 contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence comprises a sequence of at least 1,000 contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence comprises a sequence of at least 10,000 contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence comprises a sequence of at least 100,000 contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence comprises a sequence of at least 1,000,000 contiguous nucleotides from an individual's genome. In certain embodiments, the DNA sequence does not comprise the sequence of ribonucleic acid (RNA). In certain embodiments, the DNA sequence does not comprise the sequence of cDNA generated from ribonucleic acid (RNA).
Described herein, are methods, systems, and media useful for determining the genomic health risk of a genomic sequence variant (GSV) in the DNA sequence of an individual's genome. Determining a genomic health risk encompasses several different or alternative steps. Further, the genomic health risk itself is with respect to an overall health risk or for specific diseases. In certain embodiments, determining the genomic health risk comprises determining a tolerability score for at least one GSV in an individual. In certain embodiments, determining the genomic health risk comprises determining an n-variant score for at least one GSV in an individual. In certain embodiments, determining the genomic health risk comprises determining a context dependent tolerance score for at least one region in which there is at least one GSV in an individual. In certain embodiments, determining the genomic health risk comprises determining a protein tolerability score for at least one GSV in an individual. In certain embodiments, the genomic health risk is determined using any single genomic health risk metric of this disclosure selected from the list consisting of: a tolerability score, an n-mer score, a context dependent tolerance score, and a protein tolerability score. In certain embodiments, the genomic health risk is determined using any two genomic health risk metrics of this disclosure selected from the list consisting of: a tolerability score, an n-mer score, a context dependent tolerance score, and a protein tolerability score. In certain embodiments, the genomic health risk is determined using any three genomic health risk metrics of this disclosure selected from the list consisting of: a tolerability score, an n-mer score, a context dependent tolerance score, and a protein tolerability score. In certain embodiments, the genomic health risk is determined using all of a tolerability score, an n-mer score, a context dependent tolerance score, and a protein tolerability score.
In certain embodiments, the genomic health risk is determined with respect to any single GSV of an individual. In certain embodiments, the genomic health risk is determined with respect to a plurality of GSVs of an individual. In certain embodiments, the genomic health risk is determined with respect to at least 10 GSVs of an individual. In certain embodiments, the genomic health risk is determined with respect to at least 100 GSVs of an individual. In certain embodiments, the genomic health risk is determined with respect to at least 1,000 GSVs of an individual. In certain embodiments, the genomic health risk is determined with respect to at least 10,000 GSVs of an individual. In certain embodiments, the genomic health risk is determined with respect to at least 100,000 GSVs of an individual.
In certain embodiments, the genomic health risk determined is an overall health risk defined as the increase or decrease in the likelihood of contracting any pathological condition. In certain embodiments, the genomic health risk is an arbitrary designation that communicates the increased risk of any given GSV. In certain embodiments, the genomic health risk is an arbitrary designation that communicates the increased risk of a plurality of GSVs. In certain embodiments, the genomic health risk is a percentage increase risk that any given GSV will be deleterious to the health of the individual. In certain embodiments, the genomic health risk is a percentage increase risk that a plurality of GSVs will be deleterious to the health of the individual. In certain embodiments, genomic health risk comprises the likelihood of contracting or being afflicted with diabetes, high blood pressure, cardiac arrhythmia, cardiovascular disease, atherosclerosis, stroke, non-alcoholic fatty liver disease, cirrhosis, dementia, bipolar disorder, depression, schizophrenia, anxiety disorder, autism, Asperger's syndrome, Parkinson's disease, Alzheimer's disease, Huntington's disease, cancer, breast cancer, prostate cancer, leukemia, melanoma, pancreatic cancer, colon cancer, stomach cancer, kidney cancer, liver cancer, an inborn error of metabolism, a genetically linked immunodeficiency, risk or protective alleles for the contraction. In certain embodiments, the genomic health risk is determined without GSVs known at the date of filing this disclosure that lead to a known disease, for example, known GSVs in the BRCA gene that lead to increased risk of breast cancer.
In certain embodiments, DNA sequence data for use with the methods, systems and media, described herein, is generated by any suitable method. In certain embodiments, the DNA sequence data is generated by Sanger sequencing. In certain embodiments, the DNA sequence data is generated by any next-generation sequencing technology. In certain embodiments, the DNA sequence data is generated, by way of non-limiting example, pyrosequencing, sequencing by synthesis, sequencing by ligation, ion semiconductor sequencing, or single molecule real time sequencing. In certain embodiments, the DNA sequence data is generated by any technology capable of generating 1 gigabase of nucleotide reads per 24 hour period. In certain embodiments, the DNA sequence data is obtained from a third party.
In certain embodiments, GSVs for use with the methods, systems and media, described herein, are determined de novo during implementation of any of the methods. In certain embodiments, GSVs are determined by a third party and received by the party performing the method. In certain embodiments, determining a GSV encompasses receiving a list or file that comprises an individual's GSVs.
In certain embodiments, GSVs are determined by comparison with a reference genome. In certain embodiments, the reference genome is publicly available. In certain embodiments, the reference genome is NA12878 from the CEPH Utah reference collection. In certain embodiments, the reference genome is the GRCh38, Genome Reference Consortium human genome (build 38). In certain embodiments, the reference genome is any previous or subsequent build of the Genome Reference Consortium human genome. In certain embodiments, the reference genome is constructed from at least 1,000 human genomes. In certain embodiments, the reference genome is constructed from at least 10,000 human genomes. In certain embodiments, the reference genome is constructed from at least 100,000 human genomes. In certain embodiments, the reference genome is constructed from at least 1,000,000 human genomes. In certain embodiments, a GSV is a difference of a single nucleotide compared to a reference genome. In certain embodiments, a GSV is a difference of a plurality of contiguous nucleotides compared to a reference genome. In certain embodiments, a GSV is an insertion of one or more nucleotides compared to a reference genome. In certain embodiments, a GSV is a deletion of one or more nucleotides compared to a reference genome.
In certain embodiments, the methods, systems and media, described herein comprise determining a tolerability score for at least one GSV. In certain embodiments, the methods, systems and media, described herein comprise determining a tolerability score for a plurality of GSV. The concept of determining a tolerability score is captured in
The nucleotide variation score in the plurality of genomes is determined for a position x bases upstream or downstream of the above mentioned landmark. In certain embodiments, the position is less than 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 bases, including increments therein, upstream or downstream from the landmark. The nucleotide variation score is then normalized to the average variability for all positions within x nucleotides of the landmark or genetic element. In certain embodiments, this normalization occurs in 100 to 1500 base pairs. The nucleotide variation score is then multiplied by the fraction of all alleles at that position x bases from the landmark that exceed 0.0001 (the allele proportion score, where the maximal allelic proportion is 0.5 in a population). In certain embodiments, the tolerability score is a function of the nucleotide variation score and the fraction of all alleles at that position x bases from the landmark that exceed 0.0001.This yields the tolerability score for a position x bases from a given landmark. In certain embodiments, the allele proportion score is determined as the fraction of all alleles at a position x bases from the landmark that exceeds 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, or 0.010. If an individual possesses a GSV x bases from a landmark the tolerability sore for that position is then correlated with the GSV.
In certain embodiments, a tolerability score that is below 0.01 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.02 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.03 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.04 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.05 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.06 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.07 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.08 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.09 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.10 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 1 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.12 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a tolerability score that is below 0.13 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, the genomic health risk is increased by at least 20%. In certain embodiments, the genomic health risk is increased by at least 50%. In certain embodiments, the genomic health risk is increased by at least 100%. In certain embodiments, the genomic health risk is increased by at least 200%. In certain embodiments, the genomic health risk is increased by at least 300%. In certain embodiments, the genomic health risk is increased by at least 400%. In certain embodiments, the genomic health risk is increased by at least 500%. In certain embodiments, the genomic health risk is increased by at least 1000%.
Position 117587738 on chromosome 7 has a tolerance score of 0.0159 and a variation at that position has been associated with Cystic fibrosis (ClinVar entry: NM_000492.3(CFTR):c.1585-1G>A AND Cystic fibrosis).
Position 32326240 on chromosome 13 has a tolerance score of 0.0137 and a variation at that position has been associated with Breast ovarian cancer (ClinVar entry: NM_000059.3(BRCA2):c.476-2A>G AND Breast-ovarian cancer, familial 2).
Position 47480818 on chromosome 2 has a tolerance score of 0.0258 and a variation at that position has been associated with Lynch syndrome (ClinVar entry: NM_000251.2(MSH2):c.2581C>T (p.G1n861Ter) AND Lynch syndrome).
n-Variant Score
In certain embodiments, the methods, systems and media, described herein comprise determining an n-variant score for at least one GSV. In certain embodiments, the methods, systems and media, described herein comprise determining an n-variant score for a plurality of GSV. The concept of determining an n-variant score, in this case n=7, is captured in
In certain embodiments, an n-variant score that is below 0.001 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.002 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.003 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.004 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.005 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.006 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.007 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, an n-variant score that is below 0.08 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, n-variant score that is below 0.009 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, n-variant score that is below 0.010 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, n-variant score that is below 0.011 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, n-variant score that is below 0.012 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, n-variant score that is below 0.013 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, the genomic health risk is increased by at least 20%. In certain embodiments, the genomic health risk is increased by at least 50%. In certain embodiments, the genomic health risk is increased by at least 100%. In certain embodiments, the genomic health risk is increased by at least 200%. In certain embodiments, the genomic health risk is increased by at least 300%. In certain embodiments, the genomic health risk is increased by at least 400%. In certain embodiments, the genomic health risk is increased by at least 500%. In certain embodiments, the genomic health risk is increased by at least 1000%. In certain embodiments, the n-variant score allows the identification of pathogenic variants (health risk associated) without the need for annotation.
n-Variant Score Examples
Position 43115730 on chromosome 17 has an heptamer tolerability score of 0.000397 for the variant T>A and this variant has been associated with Breast ovarian cancer (ClinVar entry: NM_007294.3(BRCA1):c.130T>A (p.Cys44Ser) AND Breast-ovarian cancer, familial 1).
Position 37028836 on chromosome 3 has an heptamer tolerability score of 0.000393 for the variant A>T and this variant has been associated with Lynch syndrome (ClinVar entry: NM_000249.3(MLH1):c.1462A>T (p.Lys488Ter) AND Lynch syndrome).
Position 108335959 on chromosome 11 has an heptamer tolerability score of 0.000388 for the variant A>T and this variant has been associated with Hereditary cancer-predisposing syndrome (ClinVar entry: NM_000051.3(ATM):c.8266A>T (p.Lys2756Ter) AND Hereditary cancer-predisposing syndrome).
In certain embodiments, the methods, systems and media, described herein comprise determining a context dependent tolerance score (regional variation score) for the region in which at least one GSV occurs. In certain embodiments, the methods, systems and media, described herein comprise determining a context dependent tolerance score for the region in which at least one GSV occurs. As noted previously an n-variant score can be determined for each nucleotide in the genome. In
In certain embodiments, the region for which the global probability to vary is between 10 and 10,000 nucleotides in length. In certain embodiments, the region is between 10 and 1,000 nucleotides in length. In certain embodiments, the region is between 10 and 500 nucleotides in length. In certain embodiments, the region is between 10 and 100 nucleotides in length. In certain embodiments, the region is between 100 and 200 nucleotides in length. In certain embodiments, the region is between 120 and 180 nucleotides in length. In certain embodiments, the region is between 140 and 160 nucleotides in length. In certain embodiments, the region is between 300 and 700 nucleotides in length. In certain embodiments, the region is between 400 and 600 nucleotides in length. The region can be any length that is able to be practically analyzed using computer aided means including lengths in excess of 1,000; 5,000; 10,000; 50,000; or 100,000 nucleotides.
In certain exemplary embodiments, if the context dependent tolerance score is represented as an observed context dependent tolerance score divided by the expected context dependent tolerance score a context dependent tolerance score below 1 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.9 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.8 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.7 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.6 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.5 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.4 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.3 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.2 increases the genomic health risk of a given GSV. In certain embodiments, a GSV that occurs in a region with a context dependent tolerance score below 0.1 increases the genomic health risk of a given GSV. In certain embodiments, the genomic health risk is increased by at least 20%. In certain embodiments, the genomic health risk is increased by at least 50%. In certain embodiments, the genomic health risk is increased by at least 100%. In certain embodiments, the genomic health risk is increased by at least 200%. In certain embodiments, the genomic health risk is increased by at least 300%. In certain embodiments, the genomic health risk is increased by at least 400%. In certain embodiments, the genomic health risk is increased by at least 500%. In certain embodiments, the genomic health risk is increased by at least 1000%.
The context dependent tolerance score is able to identify potentially pathogenic genomic sequence variants without any a priori knowledge about the genomic location of the sequence variant. In certain embodiments, the context dependent variation score allows the identification of pathogenic (health risk associated) variants without the need for annotation. In certain embodiments, the context dependent variation score allows the identification of pathogenic (health risk associated) variants without the need for functional annotation.
In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 10% of conserved regions. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 5% of conserved regions. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 2% of conserved regions. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 1% of conserved regions.
In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it in the top 10% of conserved genomic loci. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 5% of genomic loci. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 2% of genomic loci. In certain embodiments, the genomic health risk of a particular variant is defined as pathogenic if it falls in a region of the genome in the top 1% of genomic loci.
In these examples, the expected context dependent tolerance score (CDTS) is subtracted from the observed context dependent tolerance score to yield the context dependent tolerability score. In this case the more negative the score the more potentially pathogenic the variant. In general, when the CDTS is a subtraction function, a number less than zero indicates an increased health risk of a given variant. In certain embodiments, a CDTS of less than 0, −1, −2, −3, −4, −5, −6, −7, −8, −9, −10, −11, or −12 indicates an increased health risk.
ClinVar pathogenic variant (entry NM_000249.3(MLH1):c.2T>A (p.Met1Lys) AND Lynch syndrome), position 36993549 on chromosome 3 is associated with Lynch syndrome and has a context dependent tolerance score of −12.0987.
ClinVar pathogenic variant (entry NM_000492.3(CFTR):c.350G>A (p.Arg117His) AND Cystic fibrosis), position 117530975 on chromosome 7 is associated with Cystic fibrosis and has a context dependent tolerance score of −4.16129
ClinVar pathogenic variant (entry NM_006516.2(SLC2A1):c.377G>A (p.Arg126His) AND Glucose transporter type 1 deficiency syndrome), position 42930765 on chromosome 1 is associated with Glucose transporter type 1 deficiency syndrome and has a context dependent tolerance score of −9.09988.
In certain embodiments, the methods, systems and media, described herein comprise determining a protein tolerability score for at least one GSV. In certain embodiments, the methods, systems and media, described herein comprise determining a protein tolerability score for a plurality of GSV. The concept of determining a protein tolerability score is captured in
In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship, such as kinases. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 95% similarity. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 90% similarity. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 85% similarity. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 80% similarity. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 75% similarity. In certain embodiments, the protein family is any family of proteins that exhibit an evolutionary relationship and possess at least 70% similarity. In certain embodiments, a protein tolerability score that is below 0.1 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a protein tolerability score that is below 0.05 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a protein tolerability score that is below 0.01 indicates an increase in the genomic health risk for a given GSV. In certain embodiments, a protein tolerability score that is below 0.005 indicates an increase in the genomic health risk for a given GSV.
There is an established relationship between functional units and sequence conservation. Regions that are both functional and conserved are deemed essential for biology. Disclosed herein, are methods of using the regional score to enable the identification, and targeting for analysis and sequencing, of those parts of the human genome that are most functionally relevant, and, thus, most relevant for health.
The functional genome comprises regions that are known to have a biological role and share properties that assimilate them to probable functional units, despite being poorly annotated.
Referring to
Referring to
Referring to
Referring to
The methods of this disclosure can be used to develop a functional genomic assay. This functional genomic assay can integrate any of the methods described herein, including a context dependent tolerance score. The functional genomic assay comprises a step of obtaining a nucleic acid sequence from a biological sample from an individual; and determining a presence of at least one genomic sequence variant in a region that is highly conserved; wherein the region that is highly conserved is a region wherein an observed context dependent tolerance score is greater than an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed and fixed in the plurality of genomes as a function of a length of the region. In a certain instance, the at least one genomic sequence variant is in a non-coding region.
Suitable biological samples can comprise oral swabs, whole-blood samples, peripheral blood mononuclear cells obtained from whole blood, plasma samples, serum samples, biopsy samples (both normal and malignant tissue), semen samples, fecal/stool samples. Nucleic acids can be isolated in these samples using methods well known in the art and appropriate nucleotides for determining genomic sequence variants, can comprise RNA, mRNA, genomic DNA (including circulating cell-free DNA derived from nuclear DNA). In certain instances, the DNA does not comprise mitochondrial DNA or DNA derived from sex-chromosomes.
The step of the determining a presence of at least one genomic sequence variant in a region that is highly conserved can be greatly expanded. In some cases. greater than 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 genomic sequence variants can be determined in greater than 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000 highly conserved regions. In some cases genomic sequence variants can be determined in greater than 10,000; 20,000; 30,000; 40,000; 50,000; 60,000,; 70,000; 80,000; 90,000 or 100,000 highly conserved regions. In some cases genomic sequence variants can be determined in the most highly conserved 0.1%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10% regions of the genome as determined by the method herein or the context dependent tolerability score. A list of exemplar highly conserved regions corresponding to the most conserved 1% of genomic regions is shown in Table 5 (49523-703-201-TABLES.txt) submitted in text format with the instant application. Listed is the human chromosome number and the range of coordinates from X to X (e.g., chr1 902440 903230). Coordinates given are with regard to the Genome Reference Consortium GRCh38 build. Any one or more of these genomic regions are considered highly conserved for the purposes of functional genomic assay detailed herein.
The sequences can be determined using any method known inn the art that is sufficiently high throughput to enrich and identify a plurality of genomic sequence variants, such as, for example, next-generation sequencing (e.g., sequencing by synthesis, ion-semiconductor sequencing, or single molecule real-time sequencing) nucleotide array, massively-multiplex PCR, molecular inversion probes, padlock probes, or connector inversion probes. In certain instances the step of obtaining a nucleic acid sequence from a biological sample comprises receiving nucleotide sequence data from a third-party including commercial third parties such as 23andme. Additionally, the sequences may be received as raw data or as pre-called variants in a variant call format (.vcf) file. In certain instances greater than 10; 100; 1,000; 10,000; 100,000; 1,000,000; 2,000,000; or 3,000,000 GSVs, including increments therein, can be determined.
The genomic sequence variants (GSVs) determined include both germline and somatic mutations. For example, determining somatic GSVs from a biopsy sample, when compared to a normal germline control sample, can help to identify regions that are causative and contribute to an individual's malignancy allowing for rational selection of a treatment option. This treatment option can comprise specific drugs that target specific pathways or modalities that are associated with particular genomic mutations. The advantage of this functional genomic assay is that no previous knowledge concerning the potential pathogenicity of a particular locus is needed. The genomic sequence variant can include SNPS, indels, translocations, repetitions, or copy number variations.
The pathogenicity of a GSV can be determined with respect to a candidate or known disease associated gene. In certain aspect the GSV can be within 2 megabases, 1 megabase, 1 kilobase, 200 base pairs, or 100 base pairs of a genomic feature of a known disease associated gene, such as a spice acceptor site, splice donor site, transcriptional start site, or promoter or enhance region.
Additional advantages of the functional genomic assay are that it is amenable to simultaneous analysis of GSVs without any pre-annotation. In certain instances greater than 10; 100; 1,000; 10,000; 100,000; 1,000,000; 2,000,000; or 3,000,000, including increments therein, can be analyzed without any appreciable additional cost from computing sources used.
For the described functional genomic assay, the unique sequence of n-nucleotides in length can be any number larger than 2 and smaller than 20. In certain embodiments, n is equal to 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20.
For the described functional genomic assay, the certain region of x nucleotides in length can be greater than 10, 20, 20, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 base pairs, including increments therein. The certain region of x nucleotides in length can be less than, 20, 20, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 base pairs, including increments therein. In certain embodiments, the certain region of x nucleotides in length can be between 10 and 10,000 nucleotides in length; between 10 and 1,000 nucleotides in length; between 10 and 500 nucleotides in length; between 10 and 100 nucleotides in length; between 100 and 200 nucleotides in length; between 120 and 180 nucleotides in length; between 140 and 160; between 300 and 700; and between 400 and 600 nucleotides in length. The region can be any length that is able to be practically analyzed using computer aided means including lengths in excess of 1,000; 5,000; 10,000; 50,000; or 100,000 nucleotides, including increments therein.
The probability to vary is calculated from a plurality of genomes in some instance the plurality of genomes is greater than 10,000, 20,000; 30,000; 40,000; 50,000; 60,000; 70,000; 80,000; 90,000; 100,000; 200,000, 300,000; 400,000; 500,000; 600,000; 700,000; 800,000; 900,000; or 1,000,000 individual genomes, including increments therein. The probability to vary can be calculated from the allele frequency of all known alleles located in a certain region of x nucleotides in length, and optionally normalized to the length of the certain region of x nucleotides in length.
In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of any 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900 or more variants, including increments therein, in an individual given in Table 1. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of all variants given in Table 1. In certain instances, the functional genomic assay comprises determining the presence of a genomic sequence variant of any 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 or more variants, including increments therein, in an individual given in Table 2. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of all variants given in Table 2. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of any 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110 or more variants, including increments therein, in an individual given in Table 3. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of all variants given in Table 3. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 30, 40 or more variants, including increments therein, in an individual given in Table 4. In certain instances, the functional genomic assay comprises determining the presence of genomic sequence variant of any of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 20, 30, 40 or more variants, including increments therein, in an individual given in Table 4.
The functional genomic assay described is useful for determining a likelihood of a subsymptomatic disease, such as, a cancer, a metabolic disorder, a physiological disorder, or an autoimmune or inflammatory disorder. In addition, the assay is useful as a predictive measure to determine likelihood of developing a disease, such as, a cancer, a metabolic disorder, a physiological disorder, or an autoimmune or inflammatory disorder. This functional genomic assay can be used as a prognostic indicator for treatment and be performed multiple times on the same individual to guide treatment. These methods can be applied to a biopsy or a cell-free nucleic acid isolated from the plasma, for example, determine a prognosis of a cancer or to determine the malignant potential of a biopsy. In a certain aspect, the cell-free nucleic acid is an mRNA or DNA. The DNA can be derived from a linear chromosome in the nucleus of a cell and in certain aspects is not derived from mitochondria or a sex-chromosome. The functional genomic assay can assign a certain GSV as high risk when the observed context dependent tolerance score is 5%, 10%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 60%, 70%, 80%, 90%, 100%, 150%, or 200%, including increments therein, greater than an expected context dependent tolerance score for that GSV. In addition the functional genomic assay can determine a risk for a plurality of GSVs in some cases greater than 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000, including increments therein. The risk can be averaged or summed for the specific GSVs. The GSV can be in a certain part of the genome within 100 bp, 500 bp, 1 kb, 5 kb, or 10 kb, including increments therein, of a functional motif such as a splice acceptor site, splice donor site, transcriptional start site, a promoter, or enhancer element. In certain cases these, functional motifs are associated with a gene known to play a role in cancer, such as, a rector tyrosine kinase (e.g., epidermal growth factor receptor (EGFR), platelet-derived growth factor receptor (PDGFR), and vascular endothelial growth factor receptor (VEGFR), HER2/neu, ROR1); cytoplasmic tyrosine kinases (e.g., Src-family, Syk-ZAP-70 family, and BTK family of tyrosine kinases, BCR/ABL); cytoplasmic Serine/threonine kinases and their regulatory subunits (e.g., Raf kinase, and cyclin-dependent kinases); a regulatory GTPase (e.g., a Ras gene); a transcription factor (e.g., myc), or a tumor suppressor gene (e.g., p53, BRCA1, BRCA2, RB, PTEN, or pVHL, APC, CD95, ST5, YPEL3, ST7, and ST14).
In certain embodiments, any of a tolerability score, an n-variant score, a context dependent tolerance score, and a protein tolerability score can be pre-determined. In certain embodiments, a health care professional compares any one or more GSVs to a list, a spreadsheet or file with pre-determined health metrics. In certain embodiments, any of the health metrics are pre-determined for each nucleotide in the genome and accessible through a software program, on-line service or portal.
In certain embodiments, described herein, are systems to identify the relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a system to determine at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and a system to compare the at least one genomic sequence variant of the individual to a tolerability score at a corresponding position within x-nucleotides of a genetic element, wherein the tolerability score comprises a function of a nucleotide variation score and an allele proportion score, wherein the nucleotide variation score is the variance observed in a plurality of genomes at the corresponding position, and the allele proportion score is the proportion of genomic variants that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position.
In certain embodiments, described herein, are systems to identify the relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a system to determine at least one genomic sequence variant in the DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome in a unique sequence of n nucleotides in length; and a system to determine an n-variant score for the at least one genomic sequence variant, wherein the n-variant score is comprises a function of a count score and an allele frequency score, wherein the count score is the ratio of the number of times any genomic sequence variant occurs in a unique sequence of n-nucleotides in length in the plurality of genomes to the number of times that the unique sequence of n-nucleotides in length occurs in the reference genome, and the allele frequency score is the frequency of the proportion of genomic sequence variants that are fixed in the population, at an allele frequency greater than 0.0001 in the plurality of genomes.
In certain embodiments, described herein, are systems to identify the relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a system to determine at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; and a system to determine if the at least one genomic sequence variant occurs within a region with a low context dependent tolerance score, wherein the context dependent tolerance score comprises a function of an observed context dependent tolerance score and an expected context dependent tolerance score, wherein the expected context dependent tolerance score is the overall probability to vary of a unique sequence of n-nucleotides in length in a certain region of x nucleotides in length actually observed and fixed in a plurality of genomes, and the observed context dependent tolerance score is a number of genomic sequence variants in a certain region of x nucleotides in length actually observed in the plurality of genomes.
In certain embodiments, described herein, are systems to identify the relative genomic health risk of a genomic sequence variant of an individual comprising: a DNA sequence for the individual; a system to determine at least one genomic sequence variant in a DNA sequence of the individual; wherein the genomic sequence variant is a difference of at least one nucleotide in the individual when compared to a corresponding position in a reference genome; a system to determine if the at least one genomic sequence variant causes an amino acid variant in an expressed protein, wherein the amino acid variant is a difference of at least one amino acid when compared to a reference genome; and a system to compare the amino acid variant to a protein tolerability score at a corresponding position within a defined protein class, wherein the protein tolerability score comprises a diversity score, missense score, and a protein allele frequency score, wherein the diversity score is a normalized diversity metric, the missense score is the variance observed in a plurality of genomes at the corresponding position which leads to an amino acid mutation, and the protein allele frequency score is the proportion of genomic variants that leads to an amino acid variant that exceeds an incidence of 0.0001 in the plurality of genomes at the corresponding position.
The following examples are illustrative and not meant to limit this disclosure in any way.
In an effort to evaluate the capabilities of whole human genome sequencing on the HiseqX platform, we first measured accuracy and generated quality standards by replica analyses of the reference genome NA12878 from the CEPH Utah reference collection (also known as “Genome-In-A-Bottle”, GiaB). We then assessed these quality standards across 10,545 human genomes sequenced to high depth. This allowed for the development of a reliable representation of human single nucleotide variation, and the reporting of clinically relevant single nucleotide variants (SNV) using new high throughput sequencing technology.
We first assessed the extent of genome coverage and representation using the data from 325 technical replicates of NA12878 at different depth of read coverage. We evaluated the accuracy and precision of the laboratory and computational processes to define quality metrics that might be applied to other samples to ensure consistent data quality. At the target mean coverage of 30×, 95% of the NA12878 genome is covered at least at 10×. In contrast,
We next assessed reproducibility on variant calling for the whole genome by restricting the analysis to a set of 200 samples of NA12878 that were sequenced at a mean coverage of 30× to 40×. Due to manufacturer's changes in clustering reagents, we analyzed 100 samples prepared with v1 (original kit) and 100 with v2. In
The canonical NA12878 Genome-In-A-Bottle call set (GiaB v2.19) defines a set of high confidence regions that corresponds to approximately 70% of the total genome. The data for this GiaB high confidence region are derived from 11 technologies: BioNano Genomics, Complete Genomics, Ion Proton, Oxford Nanopore, Pacific Biosciences, SOLiD, 10× Genomics GemCode WGS, and Illumina paired-end, mate-pair, and synthetic long reads. Regions of low complexity (e.g., centromeres, telomeres and repetitive regions) as well as other regions that have proven challenging for sequencing, alignment and variant calling methods are excluded from the GiaB high confidence region. The above analysis of reproducibility addressed the whole genome of NA12878—both in the GiaB high confidence region, and beyond those boundaries. We thus used the reproducibility metrics to define regions within GiaB with high (≧90%) versus low (<90%) reproducibility at each position. The reproducibility metrics include the concordance in calls and missingness (defined in this disclosure as a measure of no-PASS calls).
We next defined an extended confidence region (ECR) that includes the high confidence GiaB regions and the highly reproducible regions extending beyond the boundaries of GiaB. We also defined a low confidence region to include the regions within and beyond the boundaries of GiaB that could not be sequenced reliably with the technology in use.
Creating Metaprofiles that Capture Human Variation
The volume of data presented here provides unprecedented detail on the pattern of sequence conservation and SNVs across the human genome. In
In order to explore the pattern of variation in the human genome in depth, we built “SNV metaprofiles” by collapsing all members of a family of genomic elements into a single alignment. Metaprofiles of protein-coding genes used GENCODE annotated TSS (n=88,046), start codons (n=21,147), splice donor and acceptor sites (n=137,079 and 133,702, respectively), stop codons (n=37,742) and polyadenylation sites (n=88,103).
A second example of functional inference from patterns of variation is provided in
To assess the value of a tolerability score for scoring of functional severity of GSV, we established a tolerance score
However, the assignment of pathogenicity or functional severity can be significantly biased by ascertainment (e.g., “it is at a splice site, it should then be a pathogenic variant”). In addition, variants are still observed at sites with very low metaprofile tolerance scores. In
In
Much of the non-reference sequence is shared with hominins. In
CDTS Defines Pathogenic Sequence Variance Better than Methods that Use Inter Species Conservation
Traditionally, conservation in the genome has been identified through the comparison among species: if a segment of genome is conserved across many species, then it is assumed that it is important. Therefore, to compare the conserved human genomics regions as defined by a context dependent tolerability score (CDTS) with findings in the larger context of interspecies conservation, we assessed the extent of overlap of conserved regions assessed with CDTS (i.e., context-dependent conservation in the current human population) and Genomic Evolutionary Rate Profiling (GERP) across 34 mammalian species (i.e., interspecies conservation). From the 1st to 10th percentile levels, the overlap between both scores is limited and heavily enriched for protein-coding regions.
The analysis used deep sequence genome data of 11,257 individuals. Analysis was limited to the high confidence region of the genome (as defined in Telenti, A. et al. “Deep sequencing of 10,000 human genomes,” Proc Natl Acad Sci USA) a region covering approximately 84% of the genome and closely overlapping with the high confidence region as described in the most recent release of Genome in a Bottle (GiaB v3.2).
Metaprofiles comprise the massive alignment of elements of the same nature in the genome. These genomic elements can be chosen based on their structure (e.g., exonic, intronic, intergenic, etc.), function (e.g., transcription factor binding sites, protein domains, etc.) or sequence composition (k-mers). Genetic diversity is assessed at each nucleotide position of the alignment of genomic elements, by monitoring both the occurrence of variation in the population (reported as a binary—presence or absence) and the allelic frequency. More specifically, 3 metrics are computed at each position: (i) the percent of elements with SNVs,(ii) the percent of SNVs with an allelic frequency higher than 0.001 or 0.0001, and (iii) the product of both scores. Each score is calculated using between 106 and 1010 values, a value provided by the number of elements present in the genome and aligned multiplied by the number of genomes sequenced; therefore, the metaprofile strategy massively increases the power to compute variation rate at nucleotide resolution with high precision. A priori knowledge of genomic landmarks is required for constructing metaprofiles based on similarity in structure or function. In order to remove potential biases through the use of this a priori knowledge, we developed a strategy to construct metaprofiles based on all possible heptameric sequences found in the genome (47=16384) and scored the middle nucleotide for each of these sequences as described above. As every nucleotide in the genome is part of an heptamer, every single position can be attributed to the corresponding genome-wide computed scores. Scores are computed separately for autosomes and chromosome X. To account for the difference in effective population size over history for chromosome X, the allelic frequency threshold is adjusted by a factor of 0.75. In a certain aspect, indels are not used to compute the score. When testing the score on smaller study populations the allelic frequency threshold was adjusted to retain only non-singleton positions.
The variation rates computed through heptamer metaprofiles reflect the chemical propensity of a nucleotide to vary depending on its surrounding context and can be interpreted as an expectation of variation. We rationalized that functional regions would vary significantly less than they would be expected to, as assessed genome-wide through the heptamer tolerance score. To evaluate the departure from expectation, we compared the observed and expected tolerance score obtained in defined genomic regions.
The observed regional tolerance score is the number of SNVs present at an allelic frequency higher than 0.001 in the studied population in a defined region. The expected regional tolerance score is the sum of the heptamer tolerance scores in the same region.
The difference between the observed and expected scores is further referred to as context-dependent tolerance score (CDTS). The regions are then ranked based on their CDTS. The regions with the lowest rank are the regions with the lowest context-dependent tolerance to variability and the regions with the highest rank are the regions with the highest context-dependent tolerance to variability. Genomic regions are ranked based on their CDTS. Regions with the lowest rank (1st percentile) have the lowest context-dependent tolerance to variation. Regions with the highest rank (100th percentile) have the highest context-dependent tolerance to variation.
To avoid any use of a priori knowledge and any biases due to the differing size of the regions (i.e., more power to detect difference between observation and expectation in longer elements), the genome was chopped irrespective of genomic annotations into sliding windows of the same size. The window size was 1050 bp sliding every 50 bp and the calculated CDTS across the 1050 bp window was attributed to the middle 50 bp bin. Only regions with at least 90% of the nucleotides in the 1050 bp window present in high confidence regions were used. To evaluate the element distribution across those size defined windows, we built a new annotation model by combining sources of annotation from GenCode (v.23) and ENCODE (annotated features and multicell regulatory elements, Ensembl v84 Regulatory Build). In order to avoid conflicting and overlapping annotations from the two different sources and thereby use the score of the same region multiple times, we prioritized element annotation as follows, such that only the highest order element would be used: exonic, then multicell, then intronic and then annotated features. We assessed the element composition of the different percentiles, using the above mentioned combined GenCode/ENCODE annotation, by computing the number of nucleotides of an element in each percentile. The following categories were used: “Exon—protein coding”, referring to nucleotides in exonic regions contained in protein-coding genes (including UTR) as annotated in GenCode; “Exon—non-coding”, referring to nucleotides in exonic regions contained in non-coding RNAs (e.g., snRNA, snoRNA, lincRNA, etc.) as annotated in GenCode; “Intron”, referring to nucleotides in intronic regions contained in either protein-coding or non-coding genes as annotated in GenCode; “Promoter”, “Promoter Flanking” and “Enhancer”, referring to the nucleotides contained in the respective elements as annotated in ENCODE multicell regulatory elements; “H3K9me3” and “H3K27me3”, referring to the nucleotides overlapping with (and only) the respective elements as annotated in ENCODE annotated features; “Multiple Histone marks”, referring to the nucleotides overlapping with a combination of histone marks, as annotated in ENCODE annotated features; “Others”, referring to the remaining nucleotides with ENCODE annotated features that did not cover a substantial part of the genome individually, which notably encompasses transcription factor binding sites as well as other regulatory element combinations (e.g., nucleotides annotated as both Promoter and Enhancer); and “Unannotated”, referring to nucleotides in regions that had no annotated features in either GenCode or ENCODE.
We used gene essentiality (pLI score from ExAC2) as an orthogonal proxy for functionality to assess whether genomic bins, annotated with the same genomic element, have different biological importance depending on their CDTS ranking. Each genomic bin present within 10 kb of a gene is attributed the essentiality score of its closest or overlapping gene, with the exception of genomic bins annotated as “Promoters,” that have the mandatory constraint of being upstream of the closest gene. The median essentiality score is then assessed per genomic element annotation and per percentile slice. To assess distal CDTS coordination, we used an external chromatin loop dataset. The loop and anchor coordinates were extracted from previous Hi-C experiment. The median CDTS percentile is computed for every anchor region. To pair distal enhancers with their hypothetically associated genes, for each loop we extracted the genes and enhancers that were the closest to both loop-anchor points. We then kept only meaningful pairs, where an enhancer was annotated in the upstream anchor and a gene in the downstream anchor, or vice versa. In addition, the 5 prime end of the gene had to be facing the loop. A maximum of one pair per gene was retained; in the cases of several possible pairs, the pair was kept that had the smallest total distance between the enhancer to the gene after subtracting the loop size. We computed the median CDTS of the enhancers associated in such a distal gene-enhancer pair and compared it to the essentiality score of the associated gene.
We used Genomic Evolutionary Rate Profiling (GERP++) to capture the interspecies conservation. GERP++ provides conservation scores through the quantification of position specific constraint in multiple species alignments. We calculated and attributed the mean GERP scores to the same set of 50 bp bins as mentioned in the section “Region definition and annotation.” Bins were ranked based on the GERP score from the most (percentile 1) to the least conserved (percentile 100). Bins without GERP score, due to insufficient multiple species alignments in the region, were not considered in the ranking process.
A surprising result emerges from the mapping of all human conserved regions as represented by CDTS. The genome structure that is revealed is one of coordination of genes with the respective regulatory regions. For example, a very important gene (“essential gene”) will use a very conserved promoter, cis enhancer, distal regulatory elements and other regulatory signals. This new data provides enhanced ability to pair the genes with the generally under- or un-recognized regulatory units, which is key to understanding function in health and disease. This also allows for using CDTS to identify pathogenic variants, and to build a targeted sequencing and genotyping array for diagnostics. As expected,
The description of the conserved genome raises the issue of its relevance to human disease. We assessed whether CDTS ranking was a good proxy to score functional constraint and the consequences of mutations. For this purpose, we investigated the distribution of annotated pathogenic variants across the genome.
We assessed the distribution of known annotated pathogenic variants, defined as either HGMD high DM 14 (Version: HGMD_2016_R1) or ClinVar variants consistently annotated as pathogenic or likely pathogenic and with at least 1 entry with star 1 or more15,16 (Version: ClinVarFullRelease_2016-07.xml.gz) for a total N=130,767, by counting the number of variants present in each percentile of the genome. For variants in indel regions, the left most coordinate was used to establish in which genomic bin they fell. Pathogenic variants with conflicting annotations were removed, defined here as variants having a high DM in HGMD and a consistent annotation of benign or likely benign with at least 1 entry being star 1 or more in ClinVar. The non-coding variants associated with Mendelian traits were extracted from ClinVar (copy number variants were excluded from analysis) and manually curated with a filter of >5bp from any splice acceptor or splice donor site, and additional variants were collected by literature review 17-20.
We explored how CDTS compared to other functional predictive scores used to prioritize variants, such as CADD and Eigen. We focused on the performance of these metrics on the non-coding genome. The combination of the three metrics provides the best detection, while the three metrics used alone provide similar ranges of detection as shown in
The CDTS metric was compared to the most widely used metrics for variant prioritization: CADD (Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310-5 (2014)) and Eigen (Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and non-coding variants. Nat Genet 48, 214-20 (2016)). A “control” set of variants relative to the previously defined pathogenic variants was created using variants from dbSNP (June 2015 release). A control variant was defined as having the “COMMON” and “GSA” tag (>5% minor allele frequency in each population and all populations overall) and, similar to the tested pathogenic variant set, not be present in an exonic region and appear more than 5 bp from any splice site. The remaining working set of non-coding pathogenic and control variants were ranked according to their CDTS, CADD or Eigen non-coding scores and the ranking was normalized from 0 to 100 (for CADD and Eigen, the PHRED scores were converted into probabilities before this step, so that for all metrics the lower the ranking the more likely pathogenic a variant would be). To compare the different metrics, the precision (TP/(TP+FP)) was computed at each step of the new ranking. TP are the true positives, in this case the number of pathogenic variants with a ranking ≦threshold, and FP are the false positives, in this case the number of control variants with rank ≦threshold; where threshold can be any step in the new ranking (from 0 to 100). The precision was further normalized by the general prevalence of pathogenic variant in the set studied (Σ pathogenic/(Σpathogenic+Σcontrol)). This step was done in order to account for the fact that not all variants were scored by the other metrics (e.g., no scores on chromosome X for Eigen, conversion conflicts from hg19 to hg38, not all indel have a CADD score, etc.). The prevalence normalized precision provides the enrichment of a metric pathogenic variant detection compared to random.
We explored how CDTS compared to other functional predictive scores used to prioritize variants in the non-coding genome, CDTS, Eigen, CADD, DeepSEA, GERP, funseq2, and LINSIGHT. To avoid the contribution of pathogenic variants in the proximity of exons, we focused the analysis to the stringent set of 1,369 non-coding pathogenic variants that were further than 10 bp from any splice site. Eigen and CDTS had the best performance of the metrics as represented by ROC curves as sown in
The CDTS metric was compared to other metrics used for variant prioritization: CADD, Eigen, GERP, DeepSEA, LINSIGHT and FunSeq2. A control set of variants relative to the previously defined pathogenic variants (N=1,369, detailed in the above paragraph) was created using variants from dbSNP 3′ (June 2015 release). The control variants were defined as having the “COMMON” and “G5A” tag (>5% minor allele frequency in each population and all populations overall, as well as in our own study population), being in high confidence region1 and, similar to the tested pathogenic variant set, not be present in an exonic region and more than 10 bp from any splice site. The remaining working set of non-coding pathogenic and control variants were ranked according to their CDTS, CADD, Eigen, GERP, DeepSEA, LINSIGHT or FunSeq2 scores and the ranking was normalized from 0 to 100 (the direction of values of the scores were modified so that, for all metrics, the lower the rank would represent the pathogenic state. Of note, the CDTS ranking might differ slightly as only variant positions (control+pathogenic) are used here. To compare the different metrics, the true positive rate (TP/(TP+FN)) and false positive rate (FP/(FP+TN)) was computed at each step of the new ranking. TP are the true positives, in this case the number of pathogenic variants with a ranking ≦threshold; FP are the false positives, in this case the number of control variants with rank ≦threshold; FN are the false negatives, in this case the number of pathogenic variants with a ranking >threshold; TN are the true negatives, in this case the number of control variants with rank >threshold; where threshold can be any step in the new ranking (from 0 to 100). Given the fact that the control set of variants (N>5 mio) is order of magnitudes bigger than the pathogenic set (N=1,369), a false positive rate of 0.01 (threshold used in
This example shows how metaprofiles and heptamer content analysis identifies new genomic elements that were misannotated so far. In short, we investigated 3 sets of splice sites described in
Results: While the 2 first sets (present in the principal isoforms) behave similarly, the set of sites that are present only in non-principal isoforms do not show the characteristics of exon-intron junctions in terms of tolerance to variation as assessed by metaprofiling (
We assessed 6 candidate genes (POMC, LEP, LEPR, SIM1, MC4R, and PCSK1) that have previously been associated with early onset of obesity due to deficiency in the MC4R pathway, based on existing literature. To identify new pathogenic SNVs, we started by extracting all variants from a population of unrelated individuals (N=7794) that were found in the genes or vicinity (15 kb upstream and downstream) as well as in distal regulatory elements, as assessed by Hi-C and promoter-capture Hi-C. The criteria for an SNV to be candidate were the following: (i) the minimum BMI of the individual(s) carrying the alternative allele must be >=35; (ii) when applicable, individual(s) homozygous for the alternative allele must have a median Body mass index (BMI) higher than the median BMI of individual(s) heterozygous for the alternative allele; (iii) the SNV must be present in the population at an allelic frequency lower than 1/100; finally, (iv) the SNV must be “likely functional” as assessed by either one or more of the following metrics: CDTS, percentile <=2; CADD, score >=15; Eigen or Non-coding Eigen, score >=15; GERP, score >=5; Linsight, score >=0.8. The remaining SNVs are kept as candidates.
Reports Generated and Delivered to Health Care Professionals and/or Consumers
Referring to
The digital processing device 1801 can be operatively coupled to a computer network (“network”) 1830 with the aid of the communication interface 1820. The network 1830 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 1830 in some cases is a telecommunication and/or data network. The network 1830 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 1830, in some cases with the aid of the device 1801, can implement a peer-to-peer network, which may enable devices coupled to the device 1801 to behave as a client or a server. Reports can be delivered from for example a sequencing lab to a health care provider or consumer over the network 1830, or alternatively through the mail or a secure download site such as an FTP site.
While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention.
This application claims priority to U.S. Provisional Application Ser. No. 62/333,653, filed on May 9, 2016, and U.S. Provisional Application Ser. No. 62/410,783, filed on Oct. 20, 2016, each of which is incorporated herein in its entirety. The instant application contains a Table, which has been submitted in ASCII format via EFS-Web and is hereby incorporated by reference in its entirety. Said ASCII copy, created on May 5, 2017 is named 49523-703-201-TABLES.txt and is 2,508,219 bytes in size. LENGTHY TABLESThe patent application contains a lengthy table section. A copy of the table is available in electronic form from the USPTO web site (http://seqdata.uspto.gov/?pageRequest=docDetail&DocID=US20170329893A1). An electronic copy of the table will also be available from the USPTO upon request and payment of the fee set forth in 37 CFR 1.19(b)(3).
Number | Date | Country | |
---|---|---|---|
62333653 | May 2016 | US | |
62410783 | Oct 2016 | US |