Embodiments of the present invention relate generally to the field of genetics. In particular, embodiments of the present invention relate to predicting the risk that one or more specific allele variants will cause gene dysfunction or deleterious mutations associated with disease or reduced likelihood of surviving or reproducing in an organism.
Every year thousands of babies are born with genetic diseases. Often, the parents of these children are both healthy, but each parent possesses genetic mutations that when passed in combination to the child, endow it from the time of conception with an unmitigated genetic defect. Children with such diseases may suffer, have diminished lifespans and can entail large emotional and financial costs, so many prospective parents attempt to minimize the chance that they pass on genetic elements that cause disease.
Carrier testing, in which both parents are genotyped at loci of their genomes that are known to cause disease, is a technique widely used to achieve this goal. Carrier testing is unique among medical diagnostics in that recessive disease is only predicted to occur in persons other than those actually being tested. Variants in a known disease gene are classified as “pathogenic” when observed in correlation with patients diagnosed with the corresponding disease. A panel of pathogenic variants from the same gene provides the basis for developing a specific test. Persons who carry any targeted clinically validated variant are scored “positive,” and two prospective parents who test positive for the same autosomal gene are assigned a 25% risk of conceiving a diseased child. Conventional carrier testing suffers from several limitations.
Firstly, carrier testing uses a binary classification system, defining an allele variant as having only either a “positive” (pathogenic) or a “negative” (benign) effect of causing a disease. This binary classification fails to identify any continuum or intermediate effects (e.g., a degree of disease or partial functionality of a phenotype) or to illuminate allele-specific or genotype-specific differences in predicted phenotypes from the same gene. In some cases, variants with partial functionality will express allele-specific or genotype-specific effects (e.g., associated with disease in some allelic combinations but not others). The binary classification system cannot differentiate between different phenotypes caused by different allele or genotype combinations of the same gene.
Binary classification is typically useful for patients with a known disease or phenotype to search for the variant that causes their disease or phenotype. A successful search distinguishes the patient's “pathogenic” variants from “benign” variants. If the patient's condition cannot be ascribed to previously characterized variants, a number of computational tools have been developed for filtering and ranking potential culprits. The performance of these discovery tools is typically measured by an area under a receiver operator characteristic (ROC) curve for benchmark sets of pathogenic and benign variants.
In recently published guidelines for scoring the pathogenicity of DNA sequence variants, the American College of Medical Genetics and Genomics encouraged clinical researchers to “arrive at a single conclusion” that is “determined by the entire body of evidence.” However, the assumption of all-or-none pathogenicity is inappropriate for variants in recessive disease genes. “Pathogenic” implies that a variant has an absolute or determinative causal relationship to a disease or phenotype, and yet, in molecular terms, a single recessive disease allele cannot independently cause a disease, but participates passively in a reduction or loss of function that is tolerated in the heterozygous presence of a fully functional gene copy. Recessive disease will only ensue in a homozygote or compound heterozygote where the molecular sum of functional products from both gene copies fails to rise above the threshold required for health.
A second limitation of conventional carrier testing is that it is very difficult to identify the disease-risk of variants of recessive diseases or traits because many of the patients carrying those variants are heterozygous and do not express the recessive disease or trait. Newly arising mutations in recessive disease genes will usually be transmitted silently from one generation of heterozygotes to the next, without appearing in diseased patients. In a recent analysis of the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene in 60,000 exomes from individuals not affected with cystic fibrosis, the number of likely disease-causing variants that had not been clinically validated was twice the number that were validated. The expanded use of Next Generation Sequencing (NGS) for genetic screening of all recessive disease genes will result in the detection of many more “untested” variants that are not available for informed reproductive decision making under the current testing regimen.
A third limitation of conventional carrier testing is that it typically only tests for variants validated to cause disease in clinical studies. Carrier testing typically relies on the curation of clinical reports as its primary source for variant inclusion. Such tests rely on a defined set of alleles known to cause diseases, and then screen for the presence of these alleles in one or both parents prior to conception. The alleles screened in such tests have been established to cause disease by examining pedigrees of patients with the disease, by using cellular or animal models of the effect of the particular allele, or alternate means. The incompleteness of these tests is evidenced by the fact that the number of alleles associated with disease in public databases such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) and OMIM (http://www.ncbi.nlm.nih.gov/omim) continues to grow every year, and in turn so do the number of loci tested by carrier screening. Similarly, many patients can present with pathologies which appear to have a genetic basis, but for which no specific underlying genetic mutation has yet been determined. In many of these cases, a novel pathogenic variant or variants is then later discovered by various means and added to the catalog of known disease associated mutations. For example, the genomes of many patients with similar pathologies can be sequenced and shared mutations found. Alternatively, mutations that occur in an individual patient's genome which appear damaging (missense, nonsense, etc.) and are present in genes known to be associated with a biological process related to the pathology, may be tested in a cellular or animal model.
While the steady increase of the catalog of variants known to cause disease implies that carrier testing will get better, it also evinces that it suffers from the limitation that it only screens for clinically validated mutations, and cannot assess the impact of novel or de novo mutations. If a variant is specific to an individual or family and has not been previously studied, carrier testing cannot determine what effect it may have on future offspring.
A fourth limitation of conventional carrier testing is that a diseased child must be born and diagnosed in order to find a new disease associated allele. In all cases, the correlation between alleles and genetic diseases are determined by studying one or more individuals that have already been born with the disease. In the case of recessive disease, the problem is compounded because novel variants usually initially only appear as one half of a heterozygote genotype which does not express disease, and will spread silently through populations before it is combined with itself or another recessive mutation as homozygotes to express the disease in patients. Thus, it is very difficult to resolve the effect of the mutation until children suffering from the disease are born, and from the perspective of a parent who wants to avoid passing on disease causing alleles, it is too late.
A system, device and method are described to overcome the aforementioned longstanding issues inherent in the art.
In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction caused by a defined genetic mutation in the genome of an organism. A neural network may be stored, for example in one or more memory units. The neural network may comprise multiple nodes respectively associated with multiple different gene-dysfunction metrics and multiple different confidence weights. One or more processors may process the neural network to combine the multiple gene-dysfunction metrics according to the respective associated confidence weights to generate one or more likelihoods that a genetic mutation causes gene-dysfunction in organisms. The one or more processors may process the neural network in a training-phase and a run-time phase. In a training-phase, the neural network may be trained using an input data set including one or more genetic mutations to generate new gene-dysfunction metrics and new associated confidence weights that optimize the neural network based on a cost factor. In a run-time phase, a genetic mutation may be identified and one or more likelihoods may be computed that the identified genetic mutation causes gene-dysfunction in the organism based on the new gene-dysfunction metrics and the associated new confidence weights of the neural network. The multiple different gene-dysfunction metrics may include combinations of one or more of a population selection component, an evolutionary selection component, a pathogenic predictor component, a mutation class component and/or a clinical classification component.
In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction associated with a genetic mutation in an organism based on population-specific selection factors. Multiple population-specific sets of genetic sequences may be received each including multiple genetic sequences obtained from genetic samples of organisms from a different respective one of multiple populations. Each of multiple population-specific measures of homozygosity of the genetic mutation may be generated for each of the respective multiple populations by comparing the count of observed homozygotes of the genetic mutation measured on both chromosomes at a genetic locus in the population-specific set and an expected homozygote count based on a total observed count of the genetic mutation measured on either chromosome at the genetic locus in the population-specific set. One or more likelihoods may be computed that the genetic mutation causes gene-dysfunction in the organism based on one or more of the multiple population-specific measures of homozygosity.
In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction associated with a genetic mutation in an organism based on the evolution of genetic variation of multiple organisms within one species (“single-species” or “intra-species” model) or across multiple different species (“multi-species” or “inter-species” model). Past evolutionary trends in allele mutations of extant or surviving (currently or once-living) organisms representative of one or more species or populations may be analyzed to predict the future fitness of a living organism or a potential hypothetical or virtual progeny simulated for two living potential parents. In some embodiments of the invention, a system, device and method may receive multiple aligned genetic sequences obtained from genetic samples of multiple organisms of one or more different species. Genetic loci may be aligned from different sequences for different organisms that are derived from one or more common ancestral genetic loci correlated with the same trait(s), disease(s), codon(s), that are positioned or sandwiched between other correlated marker loci, or that are otherwise related. A measure of evolutionary variation may be computed for one or more alleles at each of one or more aligned genetic loci of the multiple aligned sequences. The measure of evolutionary variation may be a function of variation in alleles at corresponding aligned genetic loci in the multiple aligned genetic sequences. One or more likelihoods may be computed that an allele, either a new mutation or one present in the alignment, at each of the one or more genetic loci in an organism will be deleterious based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci for the multiple organisms.
The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Embodiments of the invention provide a system, device and method for analyzing a DNA sequence to determine risk or probability of gene dysfunction associated with specific variants or allele combinations in the DNA sequence, for example, associated with disease or reduced likelihood of surviving or reproducing in an organism. The DNA sequence may be sequenced from a biological sample of a living organism (a “real” or “extant” organism) or may be simulated (e.g., simulating a mating) by combining at least a portion of genetic information representing genetic material obtained from biological DNA samples of two living potential parents (e.g. as shown in
Embodiments of the invention replace the unrealistic conventional binary classification system of disease-risk with a continuous variant-weighted component-based dysfunction scoring system for each single gene copy. Embodiments of the invention may compute one or more likelihoods of variant-specific gene dysfunction by integrating multiple gene dysfunction categories. These multiple gene dysfunction categories may be weighted according to variant-defined levels of confidence and summed to generate a variant-specific gene dysfunction (VGD). The multiple gene dysfunction components may be combined using a neural network (e.g.
The multiple gene dysfunction categories integrated into the variant-specific gene dysfunction score may include, for example, clinical classification, mutation class, pathogenic predictors, evolutionary constraint, and population selection.
The population selection component measures a “homozygous effect,” a “heterozygous effect” and/or a “dominant effect” in each of a plurality of human populations (r=1, . . . R). In one example, populations include non-Finnish Europe, Finland, South Asia, East Asia, Africa, and the Americas, although other populations or groups may be used.
The homozygous effect may measure each population's natural selection against homozygote forms of a recessive genetic variant. The homozygous effect may compare the observed incidence or count of a homozygote of a variant genotype Qjr,obs (e.g. observed on both chromosomes) to a predicted or expected incidence or count of the homozygous variant genotype Qjr,exp (e.g. based on the observed allele frequency on either chromosome (fjr) and the population size (Njr), such as, Qjr,exp=(fjr)2Njr), in each population. The homozygous effect is based on a “null hypothesis” that if a variant's effect is neutral (e.g. having substantially no negative or positive consequence on survival or reproductive ability), the observed incidence of the homozygous variant genotype would be approximately equal to the expected incidence of the homozygous variant genotype
If however, the variant genotype causes dysfunction, disease, or reduced likelihood of surviving or reproducing, there would be a selective force against that variant expressing in the population, thereby suppressing the observed incidence of the homozygous variant genotype relative to the heterozygote or total variant genotype
A relatively low observed incidence of the homozygous variant genotype compared to the expected incidence of the homozygous variant genotype (e.g. Qobs<<Qexp) results in a relatively high variant-specific gene dysfunction score, whereas a relatively neutral or high observed incidence of the homozygous variant compared to the expected incidence of the homozygous variant (e.g. Qobs≧Qexp) results in a relatively low variant-specific gene dysfunction score. At the limits of the homozygous effect, if there is no observed incidence of a homozygous variant genotype in a population (r) (e.g. Qr,obs=0), the homozygous effect score reaches its maximum (e.g., Thom,r≈1) whereas if there are the same or more observed incidences of a homozygous variant genotype than expected in a population (r) (e.g. Qr,obs≧Qr,exp) the homozygous effect score reaches its minimum (e.g., Thom,r=0). The homozygous effect is a powerful and accurate measure of gene dysfunction caused by each variant (j) in a population (r), particularly in situations with a sufficiently large sampled population (e.g., Njr>>1,000, such as >50,000 people) or a sufficiently high frequency of mutation (e.g., fjr>1%). However, in cases in which the population does not have many sequenced individuals (e.g. Njr<1,000 people) or the mutation is at a sufficiently low allele frequency (e.g., fjr<0.1%), random fluctuations in mutations may skew the ratio of observed to expected homozygotes, and thus the homozygous effect becomes less powerful. To reflect the varied power in the score, the weight or confidence of the homozygous effect is proportional to the maximal number of observed or expected incidences of the homozygous variant genotype, and thus is also diminished in cases with low variant frequency and/or population size. In such cases when the power of the homozygous effect is diminished, the heterozygous effect may compensate.
The heterozygous effect may measure the impact of heterozygote forms of a recessive genetic variant. The heterozygous effect may measure the relationship between the count or frequency of a variant in each population (fjr) with the variant's clinical visibility. A variant (j)'s clinical visibility CVj may be based on, for example a number of published articles that reference the variant (pm), a number of compound heterozygotes or combinations of the variant with other variants that is described in the clinical literature (ch), and/or a number of search results for the variant or an order or ranking of a search result for the variant in a database or web search. A rare variant or mutation is generally expected to only rarely (or never) appear in clinical studies. However, extensive documentation in clinical studies of a rare variant is shown to correlate with a higher likelihood that the variant causes gene dysfunction resulting in disease. The heterozygous effect increases when a variant has an unexpectedly or relatively high clinical visibility compared to its frequency in a population. The heterozygous effect is thereby of particular importance, for example, in cases where a variant has a high allele frequency (e.g. greater than 0.5%) or disproportionately high clinical visibility. In cases where a variant has a disproportionately high clinical visibility, the heterozygous effect may identify damaging or disease-causing variants even when they appear frequently in a population (e.g., with a sufficiently high frequency that would have otherwise indicated lack of damage). Although most disease variants are rare, for example, the variant primarily responsible for sickle-cell anemia (SCA) affects up to 10% of people in sub-Saharan countries with endemic malaria, which is a relatively high allele frequency for a damaging variant. However, the SCA variant has an extremely high clinical visibility, and thus both the score and weight for its heterozygous effect are relatively high. Allele frequencies from different populations may be treated unequally given that clinical studies are disproportionately prevalent in certain regions of the world. For example, for a variant with a null or 0 clinical visibility (e.g., not mentioned in the clinical literature), a 0.5% allele frequency in Europe would indicate a relatively low or minimal likelihood of that variant causing dysfunction, but a 0.5% allele frequency in Africa may still be aligned with the variant causing dysfunction, under the assumption that more clinical studies have been conducted in Europe than in Africa.
The dominant effect may measure each population's natural selection against dominant genetic variants. The dominant effect may measure the relationship between a variant's observed allele count (e.g., the number of people with either one or two copies of the variant) compared to an expected allele count for any pathogenic variant in the same gene, for example, based on a distribution of allele counts across a plurality of (or all) known pathogenic mutations in the gene. In a gene with a fully dominant disease, pathogenic variants are generally not expected to be observed in more than a small fraction of healthy individuals. If, for example, none or few of a gene's pathogenic variants have been observed in any individuals, and the number of pathogenic variants in the gene is sufficiently large (e.g., >30), then a variant that has been observed significantly more than would be expected of a pathogenic variant in that gene (e.g., allele count>2) would be given a relatively low dominant effect score. The weight or confidence of the dominant effect may be based on the variant's allele count, the number of pathogenic variants in the gene, and/or the distribution of allele counts across a plurality of pathogenic mutations in the gene.
Whereas the population selection component represents a variant's success or failure on a population-level, e.g., separately for each of a plurality of specific populations within the single human species (e.g., non-Finnish Europe, Finland, South Asia, East Asia, Africa, and the Americas), the evolutionary selection component represents a variant's success or failure on a species-level (in the “single-species” model) or across all species (in the “multi-species” model).
The evolutionary selection component predicts the likelihood that variants cause disease or dysfunction, for example, based on their frequency or rarity of occurrence, across multiple reference genetic sequences (e.g. see
The evolutionary selection and population selection component may complement each other, providing improved prediction when used together than when used separately. In one instance, the evolutionary selection component may predict disease in variants that are suppressed throughout evolution, for example, the variant that eliminates the development of wisdom teeth. However, whereas wisdom teeth are typically essential to survival of many animal species, humans are an exception protected by the intelligence and altruism of our species and the adaptation of diet. This leads to an anomaly, whereas the evolutionary selection component alone would have caused a mischaracterization of the wisdom teeth variant as damaging, it combination with the population selection component neutralizes its effect.
The mutation class component may measure the type or class of mutation of variant (j). Table 2 shows an example of the mutation class component for various mutation types. Example mutation classes include start-loss, stop-gain, stop-retained, frame-shift indel, essential splice site (associated with loss-of-function), splice region, untranslated region, microsatellite (e.g., a microsatellite or sequence of a repeating base type, such as, AAAAA . . . , of length STRj (short tandem repeats), synonymous (e.g., not affecting gene expression), intron, in-frame indel (e.g., an insertion or deletion of a multiple of three bases, or an integer number of amino acids AAj, so as not to shift the reading frame), missense (e.g., a non-synonymous variant that changes an amino acid), and stop-loss (e.g., loss of the normal stop codon by mutation to encode an amino acid). The loss-of-function (LoF) mutations, start-loss, stop-gain, frame-shift indel, essential splice site, typically cause complete dysfunction and may be assigned a relatively high or maximum dysfunction score (e.g., SjM=1) with relatively high confidence (e.g., wjM>50). A missense mutation alters an amino acid and may be assigned an intermediate dysfunction score (e.g., SjM=0.5), but because the amino acid may or may not damage a protein depending on which amino acid is damaged and other more complicated factors involved in the protein structure, it is associated with a relatively small weight (e.g., wjM=0.01). An in-frame indel mutation inserts or deletes an integer number (AAj) of amino acids. Because the likelihood of dysfunction typically increases the greater the number of damaged amino acids, in-frame indel mutation dysfunction scores and weights increase as the number (AAj) of damaged amino acids increases (e.g., as shown in
The clinical classification component may measure dysfunction for variants that were clinically validated, for example, as “pathogenic” (e.g., SjC=1) or “benign” (e.g., SjC=0). The weights of the score may be based on validated confidence levels (e.g., uncontested or certain classifications have a relatively high or maximal weight such as wjC=20, probable classifications have a relatively moderate weight such as wjC=10, and contested classifications have a relatively low or minimal weight of wjC=1). The clinical classification component may be null when there is no clinical classification for a variant, in which case, the remaining (e.g., non-zero) gene-dysfunction components compensate for the null clinical classification component to predict dysfunction for the variant in the absence of clinically validated data.
The pathogenic predictor component may measure a likelihood that a variant is pathogenic, inputting one or more of the following metrics: PROVEAN predicts whether a protein sequence variation affects protein function, Combined Annotation Dependent Depletion (CADD) predicts the effects of a single nucleotide variants as well as insertion/deletions variants, Variant Effect Scoring Tool (VEST) predicts the effects of missense mutations, and PolyPhen-2 (Polymorphism Phenotyping v2) predicts the effects of an amino acid substitution on the structure and function of a human protein. These metrics may be composed into the pathogenic predictor component as follows. The PROVEAN metric may be transformed to a linear scale subscore (e.g., the greater the PROVEAN metric, the more likely the variant damages a gene) and may be assigned a weight proportional to the metric (e.g., the greater the PROVEAN metric, the more certain its impact on gene function). The CADD metric may be transformed from a Phred scale to a linear scale subscore (e.g., the greater the CADD metric, the more likely the variant damages a gene) and may be assigned a weight that increases when the CADD metric is above a threshold parameter (sc) (e.g., above a threshold, the greater the CADD metric, the more certain its impact on gene function). PolyPhen-2 includes two metrics (HumDiv and HumVar) that may be averaged (e.g., the greater the combined metrics, the more likely the variant damages a gene) and assigned a weight inversely proportional to the difference between the metric (e.g., the more the two metrics disagree, the less reliable are the metrics). The VEST metric may be assigned as a subscore (e.g., the greater the VEST metric, the more likely the variant damages a gene) and may be assigned a constant weight. In general, all of these metrics have approximately 80% accuracy in assigning predicted non-pathogenic variants relatively low scores and predicted pathogenic variants relatively high scores. However, these four metrics are generally inconsistent in the scoring of a subset of partially functional pathogenic variants due to the overly simplistic pathogenic categorization in ClinVar. It is only by combining these components with other gene dysfunction components in the neural network that embodiments of the invention produce cumulative gene dysfunction scores with sufficient accuracy of for example greater than 90-95% as described below.
Embodiments of the invention may provide the following improvements and overcome the aforementioned longstanding issues inherent in the art:
Improved results: Table 1 and
Neural Network: The neural network composes multiple gene-dysfunction components representing different and complementary aspects of gene dysfunction in an optimized manner to produce a cumulative prediction that is greater than the sum of its parts. Whereas separately analyzing all of these gene dysfunction components one-at-a-time predicts for example up to at most only 80% of pathogenic variants, analyzing multiple gene dysfunction components optimized together in the neural network improves gene-dysfunction accuracy predicting for example greater than 90-95% of pathogenic variants (see e.g., discussion of
Continuous Classification: In contrast to conventional binary classification systems, embodiments of the invention provide a continuous classification system of scores distributed on a continuous (e.g., linear) scale. Because the causation between variants and gene-dysfunction is typically uncertain, in particular, when analyzing novel variants not yet clinically validated, or only validated to a contested or uncertain confidence, the continuous likelihood or variant-specific gene dysfunction described according to embodiments of the invention may more accurately represent pathology as compared to a conventional binary classification. Further, the continuous likelihood described according to embodiments of the invention may account for the varying degree of gene dysfunction, for example, to identify any continuum or intermediate effects (e.g., a degree of disease or partial functionality of a phenotype) or to differentiate between different degrees of gene-dysfunction caused by different allele or genotype combinations of the same gene.
Population Selection: In contrast to the conventional carrier testing which has no way to test for population-specific anomalies or differences in selective factors, such as, deafness or the elimination of wisdom teeth, embodiments of the invention provide a population selection component that measures the relative propensity or aversion for specific variants on a population-by-population basis.
Homozygous Recessive Predictive Screening: A recessive gene must typically be a homozygote (having two copies) to express a recessive disease or trait. Because many newly arising mutations in recessive disease genes have not yet expressed as homozygotes, or worse yet, have killed off all patients with those homozygotes, conventional carrier testing has no way to test for new recessive disease gene mutations. In contrast, embodiments of the invention provide a homozygous effect component that measures the observed incidence or frequency of homozygote variants compared to an expected homozygote incidence. The expected homozygote incidence may be generated by extrapolating from the heterozygote or total variant incidence to predict what the expected homozygote incidence would be if there was no selective factor against the homozygote form of the variant. A relatively low observed homozygote incidence compared to the expected homozygote incidence indicates a likely selective factor against the homozygote forms of the genotype, increasing the likelihood that the variant causes a recessive disease or dysfunctional trait.
Heterozygous Effect: In contrast to the conventional classification systems which have no way of testing the effect of heterozygous variants for recessive traits, embodiments of the invention propose a heterozygous effect component that is a relative measure between variant frequency and clinical visibility. The heterozygous effect component identifies variants that have a disproportionately high clinical visibility compared to their frequency in a population. This unexpectedly high clinical visibility may indicate that these variants are likely candidates for recessive disease or dysfunction. Conversely, variants that have a disproportionately low clinical visibility compared to their frequency in a population may indicate that these variants are unlikely to contribute to recessive disease or dysfunction.
Dominant Predictive Screening: Variants linked to dominant disease or gene-dysfunction are typically difficult to detect because organisms with those variants seldom survive, or only proliferate for a few generations. The VGD model however can predict dominant disease gene mutations based on the allele count of the variant compared to a distribution of allele counts across a plurality of pathogenic mutations in the same gene. If the allele count of that particular variant is relatively higher than expected based on the distribution of the allele counts of the pathogenic variants for that gene, the likelihood that the variant causes gene dysfunction as a dominant allele may be relatively decreased.
Evolutionary Selection: The evolutionary selection component may use evolution as a “four billion year experiment.” During the course of evolution, nearly all variants or mutations have likely been tested and their success or failure in propagating through one or more species by natural selection indicates which variants cause dysfunction or disease (rare variants) and which variants are innocuous or positive (frequent variants). The evolutionary selection component may thus relate the measure of evolutionary variation of alleles at a genetic locus to the likelihood that a variant at that locus will cause disease or dysfunction.
Pre-clinical screening: In contrast to the conventional carrier testing which only tests for clinically validated disease-causing variants, embodiments of the invention may use the population selection and evolutionary selection components to predict the effect or disease-risk of allele variations without clinicians having ever observed those allele variations in diseased patients.
Reference is made to
The 2012 graph in
The VGD scores use the population selection, mutation class, pathogenic predictors, and evolutionary constraint components to predict the effect or disease-risk of allele variations without clinicians having ever observed those allele variations in diseased patients. This allows geneticists to assess the disease-risk of novel or de novo mutations or variants that have never before been validated or studied in diseased patients. The graphs in
Pre-conception screening: Conventional carriers screening only tests for variants that have already been identified and validated in a living diseased child. Some embodiments of the invention compute the VGD score or likelihood of disease-risk of allele variations in “virtual progeny” (non-existing, pre-conception progeny) based on measures of population selection or evolutionary variation, instead of only based on clinically validating the genomes of real diseased children (existing, post-conception progeny). Embodiments of the invention can thereby predict the likelihood that an allele variation will cause disease without requiring that any child ever be conceived with that disease or disease-causing variant. As shown in
Other or different advantages may be realized according to embodiments of the invention.
Reference is made to
System 100 may include a genetic sequencer 102, a sequence aligner 104 and/or a sequence analyzer 106. Units 102-106 may be implemented in one or more computerized devices as hardware and/or software units, for example, specifying instructions configured to be executed by a processor. One or more of units 102-106 may be implemented as separate devices or combined as an integrated device.
Genetic sequencer 102 may input DNA obtained from biological samples, such as, blood, tissue, or saliva, of one or more real living organisms and may output each organism's genetic sequence including the organism's genetic information at one or more genetic loci, for example, a human genome. A single organism's DNA sample may be sequenced for performing carrier testing on that individual, two potential parents' DNA samples may be sequenced for performing carrier testing on a virtual progeny generated by combining at least a portion of the two potential parents' genetic sequences, or a single potential parent's DNA sample may be sequenced for combining with each of a plurality of candidate donor sequences to generate a plurality of virtual progeny to determine an optimal and/or a least optimal subset of one or more donors.
Sequence analyzer 106 may generate virtual progeny by inputting two potential parents' genetic sequences to simulate a mating by combining at least a portion of genetic information derived therefrom and output a virtual progeny genetic sequence of a virtual gamete, for example, as described in reference to
Sequence aligner 104 may align one or more loci of the organism or virtual progeny's genetic sequence with a plurality of reference genetic sequences of extant organisms from (a) one or more different populations for generating a population selection component and (b) one or more different species for generating an evolutionary constraint component. In some embodiments, a sequence aligner need not be used.
Sequence analyzer 106 may input the multiple sequence alignment and may compute measures of (a) population-specific variation of alleles and (b) species-wide evolutionary variation of alleles at one or more aligned genetic loci. Sequence analyzer 106 may generate (a) a population selection component and (b) an evolutionary constraint component based on these measures (e.g. as shown in
Genetic sequencer 102, sequence aligner 104, and sequence analyzer 106 may include one or more controller(s) or processor(s) 108, 110, and 112, respectively, configured for executing operations and one or more memory unit(s) 114, 116, and 118, respectively, configured for storing data such as genetic information or sequences and/or instructions (e.g., software) executable by a processor, for example for carrying out methods as disclosed herein. Processor(s) 108, 110, and 112 may include, for example, a central processing unit (CPU), a digital signal processor (DSP), a microprocessor, a controller, a chip, a microchip, an integrated circuit (IC), or any other suitable multi-purpose or specific processor or controller. Processor(s) 108, 110, and 112 may individually or collectively be configured to carry out embodiments of a method according to the present invention by for example executing software or code. Memory unit(s) 114, 116, and 118 may include, for example, a random access memory (RAM), a dynamic RAM (DRAM), a flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Genetic sequencer 102, sequence aligner 104, and sequence analyzer 106 may include one or more input/output devices, such as output display 120 (e.g., such as a monitor or screen) for displaying to users results provided by sequence analyzer 106 (e.g., visualizing
Reference is made to
DNA image 200 may be an image of DNA extracted from biological samples, such as, blood, tissue, or saliva. The organism may be a living organism or a virtual organism. When screening a living organism, DNA image 200 may be an image of DNA of the living organism undergoing screening. When screening a virtual organism, DNA image 200 may be an image of DNA of one or more of the two living potential parents whose DNA is combined to generate the virtual organism undergoing screening. For example, when two potential parents undergo carrier screening to predict disease or dysfunction in their potential child, the two potential parents' DNA may both be imaged, whereas when one potential parent seeks screening with a pool of donor candidates, the image of the DNA of the one potential parent may be displayed alone (without DNA images of candidate donors, e.g., for privacy issues) or together in a sequence of displays with the DNA image of each respective candidate donor. DNA image 200 may display a portion of or the entire length of a human genome.
DNA sequence 202 may be a schematic representation of the DNA described above, such as a sequence of nucleotides, bases, amino acids or other genetic information representing the DNA (e.g., sequenced by genetic sequencer 102 of
One or more genetic mutations 204 or their positions may be identified or labeled in DNA image 200 and/or DNA sequence 202 e.g., by color, tone, labeling, highlighting, etc. Identified genetic mutations 204 may be variants identified or flagged by a processor (e.g., 112 of
The display may provide detailed information 206 about the identified genetic mutations 204, for example, automatically where flagged as pathogenic or if selected or identified by a user (e.g., by hovering a cursor over the variant in DNA image 200 and/or DNA sequence 202). The detailed information 206 may include, for example, one or more likelihoods that the identified genetic mutation 204 causes gene-dysfunction in the organism (e.g., a VGD score and/or a naïve VGD score, VGD−Cl, omitting clinical classification data), a frequency of the genetic mutation (e.g., a total number of instances of the mutation in one or more populations, and population in formation), a mutation class, HGVsp (the standard notation for identifying an amino acid substitution), and/or clinical classification. Other information, combinations of information, or visualizations of information.
Analytic targets—Analyses described according to embodiments of the invention targeted an example set of exons from 480 genes associated with autosomal recessive disease. The gene set was composed of all autosomal genes covered by Illumina's TruSight Inherited Disease Sequencing Panel (URL: http://support.illumina.com/downloads/trusight_inherited_disease_product_files.html accessed on Sep. 24, 2014) as well as all additional autosomal genes targeted by Counsyl's Family Prep platform. Illumina TruSight One Sequencing Panel's intervals were used to target the exon regions of each gene analyzed. Exon intervals were padded with a minimum of 10 by (base pairs) to a maximum of 50 by of intronic sequence to include variants listed as “pathogenic” in the ClinVar datasets.
ExAC dataset—Population-specific allele and genotype frequency data for variants in targeted intervals were obtained from an example of 60,706 sequenced exomes that were consolidated and processed by the Exome Aggregation Consortium (ExAC) and made publically available through the Consortium's website (Version 0.3, URL: http://exac.broadinstitute.org accessed on Jan. 13, 2015). The ExAC cohort is composed of unrelated individuals from six geographically defined, and in most cases, genetically distinguishable populations: non-Finnish Europe (NFE; population size, n=33,370), Finland (FIN; n=3,307), South Asia (SAS; n=8,256), East Asia (EAS; n=4,327), Africa (AFR; n=5,203), and the Americas (AMR; n=5,789). ExAC subjects represent individual participants in several large-scale disease-specific and population genetic studies. Persons diagnosed with targeted pediatric diseases were generally excluded from participation.
ClinVar annotation archive—The ClinVar archive of DNA variants with clinical annotations is maintained in two partially overlapping datasets indexed on the human reference genome (hg19) or the HGVS format for describing variation in transcribed genomic intervals. Clinical assertions were retrieved related to variants located in targeted regions by parsing the two files clinvar.vcf and variant_summary.txt (both accessed at URL: http://www.ncbi.nlm.nih.gov/clinvar/ on Mar. 6, 2015; file parsing details can be found in more detail below).
VariBench dataset—Data from the VariBench website (URL: http://structure.bmc.lu.se/VariBench/, accessed on Feb. 6, 2015) was downloaded and processed as a second variant benchmark set. The VariBench metric clusters experimentally verified variants into “pathogenic” and “neutral” (or synonymously benign) datasets. The VariBench variants were divided into subsets that overlap targeted intervals.
Online Mendelian Inheritance in Man (OMIM)—OMIM is a catalogue of human genes and diseases with separate English language narrations provided for major allelic variants of disease genes (accessed as the compressed file omim.txt.Z located at URL: ftp.omim.org/OMIM/ on Mar. 3, 2015). Among other information, allele-specific narrations contain references to other indexed alleles in the same gene that have been detected in compound heterozygous patients. This analysis used a “natural language” approach and cross-indexing among alleles to enumerate distinct compound heterozygous second alleles found in association with each primary allele.
All data sources and testing parameters are described as examples, and are optional and may be omitted, or replaced by equivalents.
Reference is made to
In
Component weights may initially be “raw” weights, wj0b (initial or raw denoted by a “0” in the superscript) on a continuous raw scale (e.g., [0.0 to 99]), for example, derived according to component-defined sets of functions and rules. Variant components with missing or unused values may be designated as “unassigned” and/or may be given a null raw weight of, for example, 0.0. The final or scaled component weights, wjb, for each variant j may be obtained by recalibrating the raw weights, for example, to sum to a maximal value on the scale (e.g., 1.0 or N):
The likelihood of variant-specific gene dysfunction for variant j, denoted as VGDj, may be computed as a weighted sum of the B dysfunction components:
The specific component subscores and weights are explained in more detail in their respective sections below. The neural network of
where tgbs is a pathogenic bootstrap threshold providing a lower bound for the VGD scores of a predetermined number or percentage (e.g., 99%) of known pathogenic mutations, ug>bs is a measure of how many uncharacterized variants with a minimum allele count (e.g. allele count>5) fall above that pathogenic bootstrap threshold, Ug is a total count of uncharacterized variants to scale ug>bs as a ratio, {circumflex over (d)}gcv2 is a measure of the mean dysfunction values for benign variants (e.g., classified as ClinVar-2 or ClinVar-3). The neural network is optimized in the training-phase by reducing the cost factor CVGD. Minimizing or reducing CVGD may optimize the VGD model to better match three groups of variants: (1) known pathogenic variants are matched by shifting the center of VGD of known pathogenic mutations toward one or more maximal likelihoods (e.g., by minimizing (1−tgbs) in the cost factor when the maximal likelihood is 1); (2) known benign variants are matched by shifting the center of VGD of known benign mutations toward one or more minimal likelihoods (e.g., by minimizing {circumflex over (d)}gcv2 in the cost factor); and (3) uncharacterized variants are matched by shifting the center of VGD of uncharacterized away from the one or more maximal and/or minimal likelihoods (e.g., by minimizing ug>bs/Ug in the cost factor). The neural network is optimized in the training-phase on a gene-by-gene basis, after which the gene-specific optimization results are aggregated or summed across multiple genes, genome segments or an entire genome, for example, to obtain a combined genome-wide cost factor to be minimized.
Population Selection Component
The “population selection” component PopScorej (SjP) may provide a population-specific measure that each individual population suppresses or naturally selects against a particular allele variant or mutation. The population selection component may be inversely related to the observed frequency of the allele in the population. Because natural selection typically factors against damaging or disease-causing variants, the more frequent a variant is, the less likely the variant is to cause a trait that is considered dysfunctional or disease-causing in a particular population; whereas the less frequent a variant is, the more likely the variant is to cause such a trait. Because each population is unique, different populations will generally select differently against some traits (e.g., deafness may be considered acceptable for survival in some modern populations, but not in other populations such as hunting populations) and may select similarly against some common or universally dysfunctional traits (e.g., traits that threaten survival for all humans such as cancers). The population selection component balances a homozygous effect, a heterozygous effect and/or a dominant effect.
The homozygous effect may be used to identify variants that cause recessive disease or traits. The homozygous effect may measure the frequency or rarity of homozygote variants for each of multiple populations, for example, by comparing an observed homozygote incidence of a variant (e.g., measured on both chromosomes at the variant's genetic locus) relative to an expected homozygous incidence (e.g., based on a total observed incidence of the identified genetic mutation measured on either chromosome at the genetic locus and the population size) in each population. The likelihood of variant-specific gene dysfunction score may be inversely proportional to the ratio of the observed versus expected homozygote variants because a suppressed incidence of homozygotes in the population may indicate a population selection against those homozygotes (e.g., for causing a recessive trait or disease).
The heterozygous effect may measure the total frequency of variants (e.g. measured on either chromosome) in each population relative to its prevalence in the clinical literature. The likelihood of variant-specific gene dysfunction score may increase when the relative clinical visibility compared to its allele frequency in a population increases, indicating the variant is an anomaly or especially relevant to disease diagnostics.
The dominant effect may be used to identify variants that cause dominant diseases or traits. The dominant effect may measure a variant's observed allele count compared to an expected allele count for any pathogenic variant in the same gene, for example, based on a distribution (e.g., a Poisson or CFD distribution) of allele counts across a plurality of (or all) known pathogenic mutations in the gene. If the allele count of that particular variant is relatively higher than expected based on the distribution of the allele counts of the pathogenic variants for that gene, the likelihood that the variant causes gene dysfunction as a dominant allele is relatively decreased.
Pathogenic Predictors Component
The “pathogenic predictors” component PathScorej (SjPP) may compose one or more metrics (Rjpp) that predict a predicted degree of pathology of a variant or variant class under analysis. For example, a PROVEAN metric (RjPR) and a VEST metric (RjV) predict whether a protein sequence variation affects protein function, a CADD metric (RjC) predicts the effects of any type of variant, and PolyPhen-2 metric (RjP2) predicts the effects of a missense amino acid substitution on the structure and function of a human protein. All four pathogenic predictor metrics are trained by machine-learning on sets of presumed pathogenic and presumed benign variants. The pathogenic predictors component (SjPP) may be defined, for example, as:
S
j
PP=Σpp=1PPujPPRjPP (4),
where (Rjpp) is the predictive damage score generated from each of PP pathogenic predictor metrics (pp=1, 2, . . . , P), and ujpp is the corresponding subcomponent weight. The choice of pathogenic predictor metrics (Rjpp) may also be dynamic; the pathogenic predictors component may add, recompose, and remove subcomponent metrics in equation (4), for example, eliminating nodes that have no data (e.g., “unassigned”) or when a confidence weight is substantially negligible (e.g., nodes with negligible weight may be considered equivalent to discounting the nodes completely, as long as other nodes have significant weight).
Raw data for the pathogenic predictors, PolyPhen-2, VEST, CADD, and PROVEAN, may be obtained, for example, by sending query batches to the respective websites (http://genetics.bwh.harvard.edu/pph2/bgi.shtml, http://www.cravat.us/, http://cadd.gs.washington.edu/score, and http://provean.jcvi.org/genome_submit_2.php).
Each of the raw pathogenic predictor data may be differently scaled and the pathogenic predictor metrics (Rjpp) may map or transform the raw pathogenic predictor data onto a uniform scale. The scale for this and all component scores may be, for example, a [0.0, 1.0]) scale, or any [M, N] scale, where M, N are rational numbers such as integers.
Raw VEST data is provided on a [0.0, 1.0] scale. The VEST metric (RjV) may map the raw VEST values linearly by a constant factor. For example, if the pathogenic predictors component is provided on a [0.0, N.0] scale (where N is an integer), the raw VEST values may be multiplied by N. In the example where the pathogenic predictors components (Rjpp) use the same [0.0, 1.0] scale as the raw VEST values, the raw VEST values need not be scaled and may be set to equal the VEST metric (RjV).
Raw PolyPhen-2 data provides two metrics, HumDiv and HumVar, which are trained using two different models. Each of the HumDiv and HumVar values are provided on a [0.0, 1.0] scale. The PolyPhen-2 metric (RjP2) averages these two metrics (HumDiv and HumVar), thereby mapping each of them onto a half-scale (dividing each metric by a factor of 2). For example, if the pathogenic predictors component is provided on a [0.0, N.0] scale (where N is an integer), the raw PolyPhen-2 values HumDiv and HumVar may each be multiplied by N/2. In the example where the pathogenic predictors components (Rjpp) use the same [0.0, 1.0] scale as the raw PolyPhen-2 values, the raw PolyPhen-2 values HumDiv and HumVar are each scaled by ½. In other embodiments, the HumDiv and HumVar may be scaled differently (e.g., 1/n and (n−1)/n), for example, depending on the relative sample size or confidence level of the damaging vs. non-damaging training set. The raw PolyPhen-2 weight (wjP2) is a measure of the confidence of the PolyPhen-2 data, for example, based on the consistency or difference between its two values, (HumDiv−HumVar).
Raw CADD data may provide a “C-score” (CADDj) for each variant representing its pathogenic or deleteriousness ranking, for example, relative to the approximately 8.6 billion (˜109.9) single base changes that could occur in the human genome. The raw CADD C-score (CADDj) is defined on a “Phred” scale, which is a base-10 logarithmic scale in which each 10 points corresponds to an order of magnitude. The raw CADD C-scores may be transformed, for example, based on an optimized logistic function into a sigmoidal distribution of the adjusted CADD scores (e.g., equation (6)). Reference is made to
Raw PROVEAN scores (PROVj) may be transformed to a linear scale using an optimized logistic transformation (e.g., equation (7)). Reference is made to
The raw CADD and PROVEAN values may also indicate intrinsic levels of confidence implied by initially extreme raw CADD and PROVEAN scores. For example, a variant ranked by CADD as among the top 0.1% of variants (e.g., CADDj>30) or among the top 0.0000001% of variants (e.g., CADDj>90), in which all (or substantially all) of the contributing support vectors agree that the variant is damaging, may be transformed to the most damaging pathogenic predictor level (e.g., RjC≈1.0). The confidence level of the CADD and PROVEAN values may be represented by the pathogenic predictor weights, ujC and ujPR. The raw CADD and PROVEAN values are transformed to raw CADD and PROVEAN weights, uj0C and uj0PR, for example, using a power function such as in equations (8) and (9), respectively.
When there is no raw data for a particular variant, the respective subcomponent score may be designated as “not assigned,” and the raw weight, uj0pp, may be set to a null or negligible value, for example, to 0.0. The final individual metric weights, ujpp, may be recalibrated, for example, so that they sum to 1.0. The overall raw weight for the predictive damage component in equation (2), wjPP, may be set equal or proportional to the sum of the final individual metric weights (e.g., wj0PP˜Σpp=1PPujpp, as shown in
Mutation Class Component
The mutation class component MutScorej (SjM) may represent a mutation type, category or class, of a variant. Table 2 shows an example of mutation class subscores (SjM) and raw weights (wj0M) for various mutation types. Mutation types may be a first order categorization of molecular impact on RNA splicing, mRNA translation, and protein function. Examples of mutation types include, start-loss, stop-gain, frame-shift indel, essential splice site, microsatellite, synonymous, in-frame indel, missense, and stop-loss, although other mutation types may be used.
Start-Loss, Stop-Gain, Frame-shift Indel, and Essential Splice Site Mutation Classes: These mutation classes are associated with severe gene dysfunction. Variants in these mutation classes may be assigned a relatively high mutation class subscore (e.g., SjM=1.0) and a relatively high confidence raw weight (e.g., wj0M=99).
Synonymous Mutation Class: These mutation classes are associated with low incidences of gene dysfunction. Variants in the synonymous mutation class may be assigned a relatively low or null mutation class subscore (e.g., SjM=0.0). The synonymous mutation class is given a neutral confidence raw weight (e.g., wj0M=1.0).
Microsatellite Mutation Class: For variants in the microsatellite mutation class, the longer the length of the repeating microsatellite sequence, the less likely the microsatellite impacts gene function and the less likely that a mutation of one of its alleles will cause dysfunction. Accordingly, the mutation class subscore for variants in a microsatellite class may be inversely proportional to the length of the repeating microsatellite sequence (e.g., SjM=1/(1+mmsSTRems, where STR is equal to the number of short tandem repeats that exist in all of the microsatellites in ExAC at a given position, and ems and mms are tunable parameters). The microsatellite mutation class may be given a neutral confidence raw weight (e.g., wj0M=1.0).
Missense and Stop-loss Mutation Class: Variants in the missense and stop-loss mutation classes may be assigned a relatively moderate mutation class subscore (e.g., SjM=0.5) because their impact on gene function is highly variable.
In-Frame Indel Mutation Class: An in-frame indel mutation typically inserts or deletes an integer number (AAj) of amino acids or codons. Because the likelihood of dysfunction typically increases the greater the number of damaged amino acids, in-frame indel mutation dysfunction scores may increase with the number of codons added or subtracted, for example, asymptotically approaching a maximum score (e.g., 1.0), for example, as defined in equations (10a) or (10b). Reference is made to
In-frame indels, missense, stop-loss and other mutation classes may be assigned a relatively low raw weight (e.g., wj0M=0.01) because their impact on gene function may be more accurately assessed by other scoring components. The mutation class subscore for variants of these mutation classes will thus only significantly contribute to the aggregate VGDj score if the variant's remaining gene dysfunction components also have relatively small weights or are unassigned. All remaining mutation classes, including introns and un-translated region, may be assigned a relatively low or null mutation class score (e.g., SjM=0.0).
Clinical Classification Component
The clinical classification component, ClinScorej (SjCl), may represent a clinical classification assigned to a variant j. The clinical classification component is typically only considered (e.g., non-zero) when a variant has been clinically validated as causing disease or dysfunction in a living patient with that disease or dysfunction. The clinical classification component may therefore not be considered (e.g., zero) for all new or de novo variants not yet linked to cause disease or dysfunction in a clinical setting (e.g., variants listed in Table 1 or labeled as “No ClinVar Pathogenicity Classification” in
In some embodiments, to counterbalance low weights for variants with probable or contested pathogenic classifications, the weight for all non-benign variants may be increased based on a compound heterozygote count, hj
where hj may be the number of alternative alleles reported in compound heterozygote patients, for example, by a disease database such as OMIM. The compound heterozygote count, hj, may act as a proxy for independent clinical replication of pathogenic findings. Thus, the compound heterozygote count, hj, may also be used to reduce the confidence assigned to a benign classification, e.g., with hj≧3, for example to reduce its weight (e.g., to wj0Cl=0.0).
Bootstrap Pathogenicity Thresholds
Pathogenicity thresholds may be used to delineate a continuous range of VGD scores that are pathogenic or associated with gene dysfunction. Reference is made to
Expected Discovery Rate of Novel Disease Variants in Heterozygotes or Homozygotes
An expected newborn frequency of deleterious variants in heterozygotes compared to homozygotes (HET:HOM) is derived from the Hardy-Weinberg equilibrium as a function of disease incidence:
For example, based on a disease incidence of 1/2,500, a cystic fibrosis-causing CFTR variant is 98-fold more likely to appear in an unaffected person compared to an affected newborn. Most serious recessive diseases have lower individual incidences and, thus, higher predicted HET:HOM ratios.
Clinical Classification Data Parsing Details
ClinVar has two sources of data: clinvar.vcf (CV in standard VCF format) and variant_summary.txt (VST) files. The CV and VST files label positions of deletions differently and their data are thus difficult to combine. The VST file right-aligns the position of deletions, while the CV file (like other VCF files) left-align the position of deletions. For example, in a situation where the first four positions of a reference sequence is ‘AAAA’, and there is a deletion of a single ‘A’. There is no way to determine which of the 4 ‘A’s is actually deleted; however the deletion must be assigned a single position. VCF format labels this deletion at position 0, while VST labels this same deletion at position 4; yet they refer to the exact same variant. A similar issue occurs when labeling a specific sequence change of any indel. There is a need in the art to provide a consistent way to label specific sequence changes that works with any initial format, for example, even identical indels with different reference alleles.
To properly recognize any indel, embodiments of the invention propose a new technique to transform all variants into a standard labeling format. In this format, the reference is a single base, and an indel may be indicated by a first code or symbol (e.g., “+”) preceding all inserted bases and a second code or symbol (e.g., “−”) preceding all deleted bases. This format properly labels the insertion(s) and deletion(s) made to mutate the reference base into a mutated variant or allele. Additional unique difficulties and exceptions may be handled on a case-by-case basis for the corresponding variants.
Population Details
In some embodiments, allele frequency in the Finnish population may be ignored because it provides skewed results due to a bottleneck effect that took place during the founding of Finland roughly 2,000 years ago. This bottleneck effect has contributed to “Finnish Disease Heritage,” in which deleterious mutations that are rare elsewhere may exist at disproportionately higher frequencies in Finland. Accordingly, in such embodiments, the Finnish data set may be ignored or only used as part of the global data (in other embodiments, the Finnish data set may be considered).
In some embodiments, low-quality samples (e.g., indicated with a low allele number such as AN<500) may be selectively removed or ignored in the training set on a variant-by-variant basis. In one example, frequencies for a variant were only considered based on super-populations with an above threshold AN (e.g., ANsuper-pop≧500). If none of the super-populations surpassed this threshold, but a global AN exceeded a global threshold (e.g., ANglobal≧1,000) for the variant, the global population frequency may be used instead. For a rare case in which the global AN is below the global threshold, or the variant was not observed in ExAC, the variant may be determined to be novel.
Transformation of CADD and PROVEAN Values
The raw CADD and PROVEAN values may be mapped or transformed onto a uniform linear VGD scale, for example, a [M, N] scale such as [0.0, 1.0]).
To transform the raw CADD values, the Phred scale CADD C-scores may be transformed to values on a continuous VGD scale (e.g., [0.0, 1.0]), for example, using the following optimized logistic equation with midpoint (mC) and steepness (kC) parameters:
Similarly, the raw PROVEAN values (PROVj) for each variant (j) may be transformed to the continuous VGD scale (e.g., [M, N], [0.0, 1.0]), for example, using the following equation:
The confidence of the raw CADD and PROVEAN values may increase disproportionately greatly for highly deleterious values of CADDj and PROVj (e.g., extremely high, such as top 0.1%, of CADDj values and extremely low, such as bottom 0.1%, of PROVj values, since PROVj values are negative). Accordingly, the raw CADD and PROVEAN weights, uj0C and uj0PR, may be defined, for example, as:
Mutation Class Component MutScorej of In-Frame Indels
The mutation class component MutScorej (SjM) of in-frame indels typically increases with the number of codons added or subtracted for coding amino acids, denoted (AAj) (e.g., AAj=⅓*number of bases added or subtracted, so that if 12 bases are inserted, AAj=4). As AAj increases, the likelihood of the mutation disrupting protein function increases, but at a diminishing rate (e.g., asymptotically approaching a maximum value, such as, 1.0).
In one embodiment, as shown in
S
j
M
=I
j[0.6+0.4*(1−e−AA
where Ij is an indicator variable, for example, that is 1 if variant j is an in-frame indel, and 0 otherwise.
In another embodiment, shown in
Evolutionary Selection Component
The “evolutionary selection” component EvoScorej (SjE) may provide a single or cross-species measure that natural selection suppresses a particular allele variant or mutation. An evolutionary model may predict likelihoods that allele mutations or variations would be deleterious based on their frequency or rarity of occurrence across multiple reference genetic sequences in a single species (“single-species” model) or multiple different species (“multi-species” model). For example, allele mutations or variations that are relatively more rare across the reference genetic sequences may be considered negatively selected for evolutionarily (e.g. associated with a deleterious trait for which an organism cannot or has a relatively lower likelihood of surviving or reproducing), while allele mutations or variations that are relatively more common across the reference genetic sequences may be considered positively or neutrally selected for evolutionarily (e.g. not associated with a deleterious trait, but traits for which an organism has a neutral or improved likelihood of surviving or reproducing). Embodiments of the invention may compute a measure of evolutionary variation of alleles (fj) at each of one or more aligned genetic loci as a function of variation in alleles at corresponding aligned genetic loci in the multiple sequence alignment (MSA) (e.g.,
Some embodiments may assign one or more likelihoods that an allele mutation in an organism is deleterious based on the evolutionary variation at the allele loci in real extant species or populations, for example, in order to diagnose living organisms, cells or tumors, or to analyze virtual progeny to filter out prospective pairings of gametes prior to conception.
In some embodiments, multiple reference genetic sequences from the one or more species may be aligned to link or associate one or more genetic loci in each of the multiple different sequences. Aligned loci of the different sequences may be derived from one or more common ancestral genetic loci and/or may relate to the same features, diseases or traits. A measure of evolutionary variation of alleles at one or more of the aligned genetic loci may be computed, for example, as a function of variation in alleles at corresponding aligned genetic loci in the multiple aligned reference genetic sequences. Aligned genetic loci associated with a relatively lower frequency of allele variation may indicate that the alleles are “functional” or relatively important to an organism's survival and their mutations may have a relatively higher likelihood of causing deleterious traits in an organism, whereas aligned genetic loci associated with a relatively higher frequency of allele variation in the reference genetic sequences may indicate that the alleles are less or non-functional and may be mutated with a relatively lower likelihood of impacting the survival or formation of deleterious traits in an organism. In some embodiments, the reference genetic sequences in the model may be weighted according to their evolutionary proximity of its population or species to the population or species of the organism and/or potential parent. For example, more weight may be assigned to reference genetic sequences of populations or species that are relatively more evolutionarily related (e.g., closer on a phylogenetic tree or having a relatively smaller Hamming distance).
Genetic sequences may be obtained from a living organism or two potential parents, such as, two individuals that plan on mating or between one individual seeking a genetic donor and each of a plurality of candidates from a pool of genetic donors. The potential parents' genetic sequences may be obtained from genetic DNA samples of biological material from the potential parents. A mating may be simulated between two potential parents, for example, by combining the genetic information from the two potential parents' genetic sequences to generate one or more genetic sequences of simulated virtual progeny.
The living or virtual organism's genetic sequence may be aligned with one or more of the reference genetic sequences to identify one or more alleles that evolved from the same ancestral genetic loci. The organism may be assigned one or more of the likelihoods of exhibiting deleterious traits associated with one or more alleles or mutations in the virtual progeny genetic sequences based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci in the reference genetic sequences.
Embodiments of the invention overcome the limitations of relying on specific information derived from human or cellular studies of the effect of mutation in order to score the propensity or probability that a particular mutation or allele will cause a deleterious phenotype, trait or disease.
An insight recognized according to embodiments of the invention is that extant genetic variation, that is existing or surviving genetic variation present amongst homologous or paralogous reference DNA sequences present in different organisms or members of a population, represents the outcome of an experiment that can be informative for predicting whether a given mutation or allele variation in an organism's genome is likely to be deleterious.
This experiment is the four billion year long process of evolution, which has governed the replication and diversification of life on Earth. Today, there are many species, and individuals within a species all contain copies of genetic material, which is derived from common ancestral versions. As species and individuals reproduce and copy their DNA, mutations appear which make these descendent copies distinct from the parental versions. The eventual fate of such new mutations, whether they will continue to be passed along to offspring or eventually die out, is determined by a stochastic process that is influenced by the mutation's effect on the reproductive fitness of the organism. Mutations that have no functional effect (neutral mutations) or are beneficial to an organism (positive mutations) are more likely to eventually increase in frequency and persist in the population, increasing diversity or replacing their parental version. In contrast, mutations which lower the reproductive fitness of an organism (negative or deleterious mutations) are unlikely to persist and contribute to future genetic variation.
Over the course of evolutionary time, a great many mutations have appeared and persisted, leading to the present diversity amongst DNA sequences derived from a common ancestor. However, this diversity is not equally distributed amongst all sequence positions in a genome. Although mutations are essentially introduced during the replication process independent of any functional effect they may have, the evolutionary filtering process is greatly influenced by such effects. As such, when comparing the genomes of several species or individuals today, some areas are conserved (such as having the same coding sequence and/or non-coding sequence), while others have much more greatly diverged (having very diverged sequences from each other or relative to the ancestral copy number).
Reference is made to
The abbreviations for the species in
In the example of
A multiple sequence alignment of present day reference genetic sequences may be derived from common ancestral genetic loci of multiple species (e.g. different vertebrate sequenced genomes) or multiple individuals within a single species (e.g. a collection of human sequences). A substantially large sample size of organisms, populations or species (e.g., tens, hundreds, or more) may be used for statistically significant likelihoods, for example, to reduce bias error due to a skewed sample set.
Embodiments of the invention may compute a measure of evolutionary variation of alleles f at each of one or more aligned genetic loci as a function of variation in alleles F at corresponding aligned genetic loci in the multiple sequence alignment (MSA). The measure of evolutionary variation of alleles f may be transformed into a likelihood or sub score (SjE) associated with a relative propensity that this allele mutation would be damaging in an organism. This likelihood or subscore (SjE) may be derived, for example, using two functional transformations F and S, to convert columns of aligned genetic loci of a multiple sequence alignment (MSA) and a putative mutation or allele in an organism into a propensity score or likelihood (SjE) relevant to assessing the effect of that particular allele or mutation on the organism, for example, as:
Multiple Sequence Alignment (MSA)→f=F(MSA)→SjE=S(f) (11)
The first functional transformation shown in equation (11), f=F(MSA), may be used to compute a measure of evolutionary variation of alleles f at each of one or more genetic loci derived from one or more common ancestral genetic loci in the multiple organisms as a function of variation in alleles F at corresponding aligned genetic loci in the multiple aligned genetic sequences. The first functional transformation may create a raw score that quantifies the relative amount of sequence conservation at the one or more genetic loci. There are many possible instantiations of this function that may be used according to embodiments of the invention. For example, one such function may input information from the DNA or amino acid genetic sequences present in the alignment and output a Shannon entropy of the sequence characters at each of the one or more genetic loci. Denoting a frequency of a particular symbol (DNA base or amino acid) at a particular genetic locus or column position, (j), in a multiple sequence alignment as Pi, i={A, C, G, T} (for DNA, or the set of amino acid symbols if considering a protein alignment), the Shannon entropy function may be calculated, for example, as shown in equation (12):
F(MSAj)=Σipi·log2pi (12)
Another example of the first functional transformation shown in equation (11), f=F(MSA), may take the average pairwise difference between different symbols (S) in an aligned sequence column of length N, for example, as in equation (13):
Other possible functional forms of the first functional transformation, F(MSA), may calculate a distance metric from a particular species or sequence in the reference alignment. For example, the function may rank all the sequences in the alignment according to their Hamming distance from the reference (e.g., human) sequence, and then calculate the rank of the first sequence with a divergent symbol at the relevant position in the alignment, or if ranking a particular mutation, the rank of the first sequence matching that particular mutation. Additional functional forms such as not using the ordinal rankings of sequences by Hamming distance, but instead using the Hamming distance itself as the metric may be used.
Additionally, the function F(MSA) need not return a single value or be a function of a single column in the multiple sequence alignment. The function may be a composite function of one or more of the functions previously described in addition to others (e.g., F1, F2, F3, . . . ), and may output a vector of values (e.g., (s1, s2, S3, . . . )) rather than a single value. The function may also take as input multiple columns (j), or even the entire alignment, when calculating the value(s), f, and may also take as input a particular mutation under consideration, which may or may not affect the calculation of the values returned by the function.
The second functional transformation specified by equation (11), SjE=S(f), is a function which converts the measure of evolutionary variation of alleles into a subscore or likelihood of being damaging. Many possible instantiations of this function are also possible. For example, a function SjE=S(f) may score the value(s) of f according to its ranking in the empirical distribution of values for all mutations or alleles considered, or that could be considered, based on a collection of multiple reference sequence alignments. Other scoring methods may also be used. For example, a function trained to discriminate mutations in a database of known or suspected to be damaging alleles from neutral alleles may be used to assess the likelihood of damage (e.g., using any variety of statistical models or derived variants from experimental findings), or a function which is trained to assess the likelihood of a mutation reaching a certain frequency in the population. In all cases, this functional transformation, in combination with the first, allows particular genetic allele mutations to be ranked and assessed for likelihood of survival or damage if produced in a child.
Embodiments of the invention may create functional forms of the measure of evolutionary variation F(MSA) using a phylogenetic history for assessing likelihoods of deleterious effects of alleles. Because DNA replication is semi-conservative, the evolutionary history of a DNA molecule may be represented by a bifurcating tree, known as a “phylogenetic” tree, that represents the known or inferred historical or evolutionary relationships between present day extant reference genetic sequences. A large body of scientific literature has developed over the past 30 years that studies the problem of inferring this tree from present day sequences. Typically, such models envision the evolutionary process between nodes of the tree as being similar to a general time reversible (GTR) Markov chain. In these models, in an interval of time, t, there is a certain probability that a base in the sequence will mutate, or transition to another base (e.g., A→C). Such models may be described using a transition matrix that describes the relative probability of transition from one base to another, for example, as shown in equation (14):
In equation (14) above, the elements of the transition matrix Q may define a probability that each base denoted by the row will transition to each base denoted by the column, for example, in an infinitely small evolutionary time interval Δt. Note that the diagonal terms in the transition matrix are not shown, as they are simply equal to one minus the other elements in the row (the probabilities of the elements in each row sum to 1). The πi terms represent the equilibrium frequency of the nucleotide bases {i=A, C, G, T}, and the symbols a, b, c, d, e and f are parameters that further govern the substitution dynamics. This matrix represents a generalized time-reversible model, in which each rate below the diagonal equals a reciprocal rate above the diagonal multiplied by the equilibrium ratio of the two bases. Equation (14) is only one example of a substitution matrix used for phylogenetic inference.
Reference is made to
After fitting the Markov model that accounts for the phylogenetic history of the sequence alignment, whether using a neural network, Bayesian approach or maximum likelihood approach, embodiments of the invention may provide not only an inferred phylogenetic tree, but also a model for the evolutionary process of the multiple reference sequences in that tree.
Embodiments of the invention may use this evolutionary model to directly assess the likelihood that a mutation is damaging by examining the probability that a mutation found in an organism's genome persists into the future. That is, rather than using the model to infer the past history of evolution, the model is trained using the outcomes of the past and then used to calculate the likelihood of an allele persisting into the future. Because this likelihood is directly related to the probability that the allele is damaging (less likely mutations being more likely to be damaging), this phylogenetic approach, which directly accounts for the past history of the sequence in a parametric model, is a uniquely valuable functional form for the F(MSA) specified by equation (11).
Important extensions to the phylogenetic model are those which either change the model to account for sequence context (e.g., information about sequence location or what a sequence encodes, such as, methylation or homopolymer status) and functional effect (e.g., synonymous vs. non-synonymous, or affecting or not affecting expression), or that partition the sequence in some way to account for varying rates of substitutions, for example, based on the location of loci in the genome.
To account for varying rates of substitutions, for example, instead of the direct application of the substation rate matrix in equation (14) to all sequence substitutions, an alternate model may specify that although the relative rates of different types of substitutions at all alignment positions was governed by (14), the global rate at each site (that is the total mutation rate, denoted μ), may vary across sites or loci in the genome. A multitude of such models are possible. For example, the rates at different genetic loci may be drawn from a parametric distribution, such as a Γ-distribution that is also fit during the modeling procedure, or the distribution of rates may be derived from several categories of possible distributions. In one embodiment, this model may specify two different distribution categories (such as, conserved or rapidly evolving categories) and then train the model to identify to which distribution category the observed sequence belongs, for example, using a hidden Markov procedure. In some embodiments of the invention, the inferred or posterior probability that a genetic locus or mutation belongs to a category in the phylogenetic model may be returned by the F(MSA) function instead of the likelihood itself.
To account for functional effects, the phylogenetic model used to form the likelihood for the F(MSA) function may directly account for the functional consequence of a mutation. For example, the coding sequence of a protein is determined by triplets of neighboring DNA nucleotides that form a functional unit referred to as a “codon.” A mutation in a nucleotide within a codon may either have a functional effect of changing the amino acid sequence encoded by that codon (in which case it is referred to as “non-synonymous” since the mutation encodes for a different amino acid sequence) or may be a substitution with no functional effect on the amino acid encoded due to the redundancy of the genetic code (in which case it is referred to as “synonymous” since the mutation encodes for the same amino acid sequence). The Markov model that may be used in predicting the likely damaging effect of a mutation may directly account for such functional effects. For example, the transition probability of mutating from nucleotide i to j specified by the ijth matrix Q element qij in equation (14), may be replaced by an instantaneous transition probability tij, for example, defined by equation (15).
This would allow a new instantaneous transition matrix, T, to be used in the model, and a new parameter, ω, which is equal to the non-synonymous to synonymous substitution rate to be used in predicting the likelihood that an allele is damaging or that it persists into the future. In practice, the ω parameter may be constant for an entire multiple sequence alignment, may be assigned to each codon position in the alignment by assuming they are drawn from some hierarchical distribution, or may be uniquely assigned to each codon position. The substitution model specified by equation (15) may also be altered to account for each combination of the 64×64 possible elements of a transition matrix representing the rate in which each of the 64 possible codons (e.g., the 43=64 different combinations of four nucleotide states (A, T, C, G) at three nucleotide positions in each codon) transition to each of the 64 possible codons. In all instantiations of the evolutionary model, the functional effect of a sequence change, whether on the amino acid, regulatory context or other biological context may be directly accounted for, and used to predict the likelihood that the allele was damaging.
The particular values, scores, subscores, parameters, graphs, functions, thresholds, relationships, slopes, or rates depicted herein may vary for example based on the input data, validation datasets, updated disease classifications, analyzed genome sequence, reference genome sequences, or other data; however, the trends and relationships such as increasing values, decreasing values, direct and indirect proportionality, continuity, stair-step relationships, etc. may be generalized in some embodiments of the invention to any or a broad range of datasets. These values were obtained using datasets that may have been limited and may be revised with more current datasets to provide more accurate results.
Example Data Acquisition and Exome Sequencing Interval Processing
The VGD model generated according to some embodiments of the invention was trained in one example using variant calling metadata from 60,706 ExAC samples. This dataset was filtered into non-overlapping intervals covering coding regions of 480 autosomal genes associated with severe recessive pediatric disease. These regions were selected because they are known to contribute to genetic disease and collectively constitute a commonly used genetic testing panel.
Filtered intervals covered a total of 1,343,919 base pairs of the human genome. E.g., three overlapping subsets of variants were considered located in targeted regions: 238,461 variants were observed in at least one subject in these regions in the ExAC dataset, 17,748 ClinVar variants, and 6,274 VariBench variants. Only 8,341 ClinVar and 2,345 VariBench variants in our intervals were present in the ExAC samples. Curated variants not found in the ExAC samples (e.g., having a frequency of ≦1/124,000) may be assigned an allele frequency of zero. In total, individual variant-specific gene dysfunction (VGD) contributions were determined for 250,419 unique variants.
Novel Variant Scoring in a Genotype-Based Disease Liability Analysis
Embodiments of the invention provide a novel and dynamic neural network that incorporates multiple sources of information to compute one or more likelihoods of variant-specific gene dysfunction, VGDj, for each variant j in a variant dataset (j=1, 2, . . . , J). VGDj may be computed as the weighted sum of B component subscores (e.g., equation (2)). Each component subscore may be computed on a continuous uniform scale (e.g., a linear scale defined by [M,N], where M, N are rational numbers, such as a [0.0, 1.0] scale). In one embodiment, the VGD score may be directly proportional to gene dysfunction, for example, where substantially low or below threshold values correlate with a fully functional gene and substantially high or above threshold values correlate with complete gene dysfunction. (Alternatively, the VGD score may be inversely proportional to gene dysfunction, with opposite trends). In one embodiment, the VGD score may be computed as a combination of any two or more of the following components:
VGDj=wjClSjCl+wjMSjM+wjPPSjPP+wjESjE+wjPSjP (16),
where SjCl (or ClinScorej) is the clinical assessment of “pathogenicity,” SjP (or PathScorej), SjM (or MutScorej) is the expected impact of the mutation class e.g., on RNA processing and translation, SjPP (or PathScorej) is a combination of predictive damage values obtained from established tools, SjE (or EvoScorej) is the evolutionary constraint factor measuring the natural selection against a variant across one or more species, and SjP (or PopScorej) is the population selection factor measuring the natural selection against a variant within each of multiple individual human populations. Each subscore Sjb is associated with a weight, wjb, for example, directly determined by the attributes of the corresponding variant and representing a level of confidence in the corresponding scoring component. A general overview of example equations, parameters and values used to calculate these components is outlined in
Embodiments of the invention may thereby provide a solution that accounts for many distinct types of variants and incorporates many genetic properties and components to increase detection sensitivity. Rather than indiscriminately accepting any single metric (or a fixed combination of metrics) as being the most appropriate tool for all variants, embodiments of the invention may evaluate the pertinence or weight of the available data for each unique variant. None of the utilized public metrics is the “best” for all variants and each misses diagnosing variants of a sub-optimal type; the tool with the largest numerical contribution to the final VGDj score is directly determined by the corresponding variant. Further, the final VGDj score may be dynamically updated at any time with the advent of additional data or variant evaluation tools.
Impact of Allele Frequency and Clinical Annotations on VGD Score
The VGD model described according to embodiments of the invention uniquely incorporates observed population frequencies and species frequencies from a large ethnically diverse cohort in order to generate a summary disease contribution score VGDj for each variant j. In this way, the VGD model is able to recognize mislabeled variants that others may miss (see e.g., Table 1 and description of
Failure to find literature-based evidence for disease liability becomes less and less tenable as the frequency of disease variants increases. Embodiments of the invention generally assume that if a variant were both truly damaging and relatively common, the variant would have a sufficiently high clinical visibility (bolstering the heterozygous effect in the population selection component) and there would be a sufficient number of affected individuals to lead to this variant's clinical classification (e.g., in the ClinVar database) (bolstering the clinical classification component). Therefore, variants with a high frequency in a population or ExAC dataset, in the absence of annotation in clinical databases, are very unlikely to actually be damaging, regardless of the other scoring factors.
Among the variants with existing clinical annotation, in an example ClinVar dataset, 6,651 variants were unambiguously labeled as “pathogenic.” These variants were classified as “ClinVar-5.1” and have a VGD score approximately equal to Aj. Because damaging variants are unlikely to exist at high frequencies in the population, the vast majority of ClinVar 5.1 variants will have scores close to 1.0 (see below).
An additional 140 ClinVar-5 variants with replicated homozygosity for the presumed disruptive allele in the ExAC cohort, and/or also assigned to one of the “non-pathogenic” ClinVar (ClinVar-2 or ClinVar-3) categories were labeled as “ClinVar-5.2” or “ambiguously assigned ClinVar 5's” (Table 4).
Though the vast majority of ClinVar-5s do not exhibit replicated homozygosity, the few that do are disproportionately represented in the population. To estimate each set of variants' influence on the population, the number of chromosomes in the ExAC cohort were totaled for all variants in each category. For example, the 6,651 ClinVar-5.1 variants were detected in 31,948 chromosomes, while the 140 ClinVar-5.2 variants were detected in 425,139 chromosomes. Because one individual can have multiple different ClinVar-5 variants, some chromosomes may be counted more than once, and the total number of chromosomes for a set of variants may exceed the actual number of chromosomes in the ExAC data.
An alternative cause of inconsistency between a variant's raw VGD score and frequency may occur when there is a true disease liability in homozygotes, but the allele frequency is lifted for example by a heterozygous advantage. A well-known example of such a variant is the HBB sickle cell variant G6V. Under the assumption that all such instances would have been identified and recorded by now, the corresponding allele database (e.g., OMIM) is expected to include compound heterozygote descriptions for the variant. This incidence of compound heterozygote is accounted for in the VGD score as the heterozygote effect parameter (ch) defining a number of compound heterozygotes or combinations of the variant with other variants that is described in the clinical literature. The more compound heterozygote instances a variant has, the more likely the variant has a disease contributing effect, boosting the final VGD damage score. Of the 17,748 ClinVar variants tested, 2,008 were found to be compound heterozygotes, only 163 of which have more than 2 compound heterozygotes instances. For reference, the well-known CFTR variant F508del was measured to have 19 compound heterozygotes, more than any other variant examined in these tests.
Example Distribution of VGD Scores
Not surprisingly, in one example test, the VGD model identified a majority (e.g., >50%) of the 238,461 ExAC variants examined as not causing disease of gene dysfunction. The median damage score was 0.32 with a large interquartile range (IQR) of (0.03, 0.77) (Table 5). The distribution of the final VGD score may be bimodal, with a large peak e.g., near 0.0 and a more moderate peak near e.g., 1.0; however, there are many variants with scores in between these extremes (see e.g.,
VariBench Categorization Results
VariBench data was also analyzed as an alternate to the ClinVar data described herein. Reference is made to
The VGD median score of benign VariBench variants is e.g., 0.00 (IQR: 0.00, 0.35) compared to e.g., 0.98 (IQR: 0.83, 1.00) for pathogenic VariBench variants (Table 5). The 86 “pathogenic” VariBench variants were classified with replicated homozygosity in the ExAC data as “VariBench P.2” variants. These VariBench P.2 variants were assigned a much lower score than their VariBench P.1 counterparts, which lacked replicated homozygosity in the ExAC data (e.g., having a median VGD score of 0.00 and 0.98, respectively) (Table 5).
The pathogenic predictors, PolyPhen-2, VEST, CADD, and PROVEAN, were computed for the VariBench P.2 variants, which produced less deleterious damage scores on average than the VariBench P.1 variants (see e.g.,
Comparison of ClinVar and VariBench Categorizations
Demonstrating their ascertainment biases, the ClinVar and VariBench datasets both have a relatively high median damage score of at least 0.96 for all analyzed variants (Table 5). The ClinVar variants were slightly more common than those in the VariBench database. Both types of variants had a median maximum population frequency of 0.0, but the upper bound of the frequency IQR was 4.05×10−4 for ClinVar and only 8.77×10−5 for VariBench (Table 5). In addition, 25% of all ClinVar variants were homozygous in at least three ExAC individuals (Table 5).
To determine the accuracy of the VGD model compared to clinical models, all ClinVar variants were analyzed using the VGD model to generate a naive VGD score, VGD−Cl, calculated without any clinical classification (e.g., ClinVar) information. Reference is made to
Even without their clinical classification data, the VGD model assigned “likely pathogenic” (ClinVar-4) and “pathogenic” (ClinVar-5) variants naive dysfunction scores, VGD−Cl, significantly greater than their benign counterparts (see e.g.,
ClinVar-5.2 variants have significantly lower VGD scores than their ClinVar-5.1 counterparts (e.g., median score of 0.02 compared to 1.0). The VGD model differentiates these two types of variants and scores them accordingly.
Variants categorized as “non-pathogenic” (ClinVar-2) or “likely non-pathogenic” (ClinVar-3) generally have very low VGD scores; for example, both variant types have median VGD scores of for example 0.00 and average values for example less than 0.07 (Table 5,
Embodiments of the invention may similarly identify both benign and pathogenic variants characterized by the VariBench data source (see e.g.,
VGD Score for Each Mutation Type
To determine the accuracy of the VGD model as compared to mutation type data, all variants were analyzed using the VGD model to generate the naive VGD score, VGD−M, calculated without any mutation type data (see e.g.,
Start-loss, essential splice site, nonsense, and frame-shift variants have the largest impact on the resulting protein, and by design, also have the largest scores in the VGD model (Table 2,
Missense variants have a large spread of damage scores, with slightly more deleterious than non-deleterious variants (e.g., median VGD=0.57). Stop-loss variants have an intermediate level of damage (e.g., median VGD=0.36). All other mutation types are associated with lower damage scores (e.g., medians less than 0.02) (Table 6,
The VGD model classification was therefore shown to generally concur with the variant's mutation type data, even in a blind test without considering the mutation type data: variants that are associated with non-damaging mutation types e.g. synonymous mutations, non-essential splice sites, etc., were assigned a relatively low probability of disease contribution according to their VGD scores and variants that are associated with damaging mutation types e.g. start-loss, essential splice site, nonsense, and frame-shift variants, were assigned a relatively high probability of disease contribution according to their VGD scores.
Comparison of Results to Other Variant Scoring Techniques
To determine the accuracy of the VGD model as compared to pathogenic predictor metrics, a side-by-side comparison of VGD scores and PolyPhen-2, PROVEAN, CADD, and VEST scores is provided. Reference is made to
In general, for many of the analyzed variants, each of the four pathogenic predictor metrics correctly assigns a deleterious score to “pathogenic” and a non-deleterious score to “benign” ClinVar and VariBench variants (see e.g.,
Further limitations of the four pathogenic predictor metrics are shown in
VGD Outperforms Existing Variant Classification Databases
Embodiments of the invention may assess the disease-risk of novel or de novo mutations or variants that have never before been validated or studied in diseased patients. The VGD model identified 29,317 novel variants as pathogenic (e.g., with VGD scores of at least 0.95) that were not yet clinically classified (e.g., by ClinVar or VariBench) as well as 68,594 novel variants as benign (e.g., with VGD scores less than or equal to 0.05) (Table 1). Accordingly, embodiments of the invention may be used to discover disease-correlated variants before the variants are clinically validated, thereby predicting disease risk in patients that would have otherwise been ignored. The majority of these novel variants have relatively low maximum population frequencies (e.g., median of 6.06×10−5 for the likely damaging variants and 9.63×10−5 for the likely benign variants), which may explain why they have yet to be observed in enough diseased individuals to be included in clinical (e.g., ClinVar and VariBench) curations.
For the 192 genes with at least ten ClinVar 5 and 5.1 variants, the 99% and 99.9% one-sided bootstrap confidence intervals were calculated for the average VGDj−Cl (VGDj without including the ClinVar information) among all ClinVar 5 and just ClinVar 5.1 variants (see e.g., Table 7,
Several genes were found that have a relatively large difference between their interval bounds for their ClinVar 5 and 5.1 variants. Such relatively large intervals are shown, for example, in
An organism that is genetically screened according to embodiments of the invention may be a living organism whose DNA is obtained from the organism's biological sample and sequenced to identify a genetic mutation and assess one or more likelihoods that the genetic mutation causes gene-dysfunction in organisms. The living organism may be one or more of a pre-natal organism, a fetus, a newborn, a post-natal organism, a child, an adult, blood, tissue, saliva, a stem-cell, and a tumor, to perform genotype screening, such as, pre-natal genotype screening, post-natal genotype screening, newborn genotype screening, stem-cell screening, and tumor screening. Other types of living organisms and genotype screenings may be used.
An organism that is genetically screened according to embodiments of the invention may be a virtual or simulated organism. Although the organism is virtual, the organism's genetic information is real, for example, derived by combining at least a portion of real genetic information sequenced from DNA obtained from biological samples of two living potential parents. Accordingly, a virtual organism's genetic information represents transformed, permuted, or intertwined biological DNA samples of two living human organisms.
Reference is made to
For each of the two potential parents, a processor (e.g., sequence analyzer processor 112 of
The two virtual gamete haploid genetic sequences 406 and 408 for the two respective potential parents may be combined to simulate a mating between the first and second potential parents resulting in a virtual progeny diploid genetic sequence 410 (a discrete genome of a child potentially to be conceived).
Since the selection of alleles is at least partially random, this mating is just one of the many possible genetic combinations for the first and second potential parents. This process may be repeated multiple times (e.g., hundreds or thousands of times), each time following a different recombination path (e.g., a different sequence of alleles selected) for one or both of the potential parents, to generate multiple genetic permutations that are possible for mating the first and second potential parents. The virtual progeny diploid genetic sequence 410 may include a single (e.g., most probable) genetic sequence or a probability distribution of multiple possible sequences, for example, to indicate, for many possible matings, the overall likelihood of each of multiple alleles at each of one or more loci in a virtual or hypothetical progeny.
Embodiments of the invention may use methods for simulating a mating between two potential parents and generating a virtual progeny genetic sequence described in U.S. Pat. No. 8,805,620, which is incorporated herein by reference in its entirety. Other methods may also be used.
Once the virtual progeny genetic sequence 410 is generated, it may be assigned one or more of the likelihoods that one or more alleles or mutations in the virtual genetic sequence 410 would be deleterious, for example, if replicated in a real living progeny.
Operations described herein may be executed by one or more one or more processor(s) (e.g., controller(s) or processor(s) 108, 110, and/or 112 of
Reference is made to
In operation 1900, a memory may store and a processor may access a neural network including multiple nodes respectively associated with multiple different gene-dysfunction metrics and multiple different confidence weights. The neural network may combine, aggregate or compose the multiple gene-dysfunction metrics according to the respective associated confidence weights to generate one or more likelihoods that a genetic mutation causes gene-dysfunction in organisms.
In operation 1902, a processor may execute a training-phase. In the training-phase, the processor may train the neural network using an input data set including one or more genetic mutations to generate new gene-dysfunction metrics and new associated confidence weights that optimize the neural network based on a cost factor. In some embodiments, the processor may optimize the neural network in the training-phase by shifting a center of the one or more likelihoods of known pathogenic mutations toward one or more maximal likelihoods, shifting a center of the one or more likelihoods of known benign mutations toward one or more minimal likelihoods, and/or shifting a center of the one or more likelihoods of uncharacterized mutations away from the one or more maximal or minimal likelihoods. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the known pathogenic mutations by generating one or more pathogenic thresholds providing a lower bound for the one or more likelihoods of a plurality of the known pathogenic mutations and minimizing the difference between the one or more pathogenic thresholds and respective one or more maximal likelihoods. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the uncharacterized mutations by generating one or more pathogenic thresholds providing a lower bound for the one or more likelihoods of a plurality of the known pathogenic mutations and minimizing the number of the uncharacterized mutations having one or more likelihoods above the one or more pathogenic thresholds. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the known benign mutations by minimizing mean distribution values of the one or more likelihoods of the known benign mutations. In some embodiments, the processor may optimize the neural network in the training-phase on a gene-by-gene basis and may aggregate gene-specific optimization results, for example, across a genome to obtain a combined genome-wide cost factor. Training phase operation 1902 may be repeated, for example, each time a new input data set is received, new nodes are added to the neural network, optimization parameters are changed, or periodically.
In operation 1904, a processor may execute a run-time phase. In the run-time phase, the processor may identify a genetic mutation and compute the multiple gene-dysfunction metrics for the identified genetic mutation based on the new gene-dysfunction metrics and the associated new confidence weights of the neural network. The run-time phase may include one or more of operations 1904-1918.
In operation 1906, a processor may compute one or more population selection nodes in the neural network associated with multiple population-specific measures of homozygosity for each of multiple populations, which is further described in reference to
In operation 1908, a processor may compute one or more evolutionary constraint or selection nodes in the neural network associated with a measure of evolutionary variation of alleles at each of one or more common ancestral genetic loci in multiple organisms corresponding to one or more loci of the identified genetic mutation. In one embodiment, the one or more likelihoods that the identified genetic mutation causes gene-dysfunction in organisms may be indirectly proportional to the measure of evolutionary variation in alleles. In one embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a frequency with which the genetic mutation has occurred and persisted in the multiple organisms over evolutionary history. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a proximity in a phylogenetic tree representing an evolutionary timescale between a reference genetic sequence of the same species as the organism and one or more other species in which the genetic mutation has occurred. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on an average pairwise difference between different alleles at the corresponding one or more loci of the identified genetic mutation. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a distance metric between a reference genetic sequence and a genetic sequence of the organism. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a probability of transitioning from a reference allele to the genetic mutation. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on ratio w of a non-synonymous substitution rate to a synonymous substitution rate, wherein a non-synonymous substitution is an allele substitution in a codon that does not change an amino acid encoded by the codon and a synonymous substitution is an allele substitution in the codon that does change the amino acid. The processor may weigh the evolutionary constraint nodes with one or more evolutionary constraint confidence weights based on a distribution of mutation rates at different genetic loci in the multiple aligned genetic sequences. In some embodiments, the multiple organisms may be from multiple different species, whereas in other embodiments, the multiple organisms are from a single species. The evolutionary selection node is described in further detail in reference to
In operation 1910, a processor may compute one or more mutation class nodes in the neural network that measure a mutation type metric associated with a mutation type of the identified genetic mutation. In various embodiments, the mutation type of the identified genetic mutation may include one or more of: start-loss, stop-loss, stop-gain, stop-retained, frame-shift indel, in-frame indel, essential splice site, splice region, microsatellite, synonymous, missense, intron, and untranslated region. In one embodiment, the processor may compute the mutation class metric for the identified genetic mutation of a microsatellite mutation type inversely proportionally to a length of a repeating microsatellite sequence in the identified genetic mutation. The processor may weigh the microsatellite mutation class metric with a microsatellite mutation class weight that is directly proportional to the length of the repeating microsatellite sequence in the identified genetic mutation. In one embodiment, the processor may compute the mutation class metric for the identified genetic mutation of an in-frame indel mutation type directly proportionally to a number of codons inserted or deleted by the identified genetic mutation. The processor may weigh the in-frame indel mutation class metric with an in-frame indel mutation class weight that is directly proportional to a number of codons inserted or deleted by the identified genetic mutation.
In operation 1912, a processor may compute one or more pathogenic predictor nodes in the neural network that measure one or more pathogenic predictor metrics predicting a degree of pathology of the identified genetic mutation. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming a PROVEAN value of the identified genetic mutation to a linear scale and an associated confidence weight proportional to the PROVEAN value. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming a CADD value of the identified genetic mutation from a Phred scale to a linear scale and an associated confidence weight proportional to the CADD value. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming two PolyPhen-2 values of the identified genetic mutation, HumDiv and HumVar, into an average value and an associated confidence weight that is inversely proportional to the difference between the two PolyPhen-2 values. In one embodiment, the processor may compute the one or more pathogenic predictor metrics from a VEST value of the identified genetic mutation.
In operation 1914, a processor may compute one or more clinical classification nodes in the neural network that measure one or more clinical classification metrics defining available clinical classification data for the identified genetic mutation. In one embodiment, the processor may compute the clinical classification metrics to be substantially maximal when the identified genetic mutation is clinically classified as pathogenic and substantially minimal when the identified genetic mutation is clinically classified as benign. In one embodiment, the processor may compute a clinical classification confidence weight associated with the clinical classification metrics to be substantially maximal when the clinical classification is uncontested and relatively lower than maximal when the clinical classification is contested.
In operation 1916, a processor may combine or aggregate the values or outputs form the multiple gene-dysfunction metrics to compute one or more likelihoods that the identified genetic mutation causes gene-dysfunction in the organism. In one embodiment, in the run-time phase, a processor may compare the one or more likelihoods to one or more pathogenic threshold ranges to predict if the genetic mutation will cause gene-dysfunction in the organism.
In operation 1918, a display may display results, input data, or intermediate data from operations 1900-1916 or data represented as
Reference is made to
In operation 1920, a processor may receive multiple population-specific sets of genetic sequences each including multiple genetic sequences obtained from genetic samples of organisms from a different respective one of multiple populations.
In operation 1922, a processor may generate each of multiple population-specific measures of homozygosity of the genetic mutation for each of the respective multiple populations. The measures of homozygosity in each population may be computed by comparing the count of observed homozygotes of the genetic mutation measured on both chromosomes at a genetic locus in the population-specific set and an expected homozygote count based on a total observed count of the genetic mutation measured on either chromosome at the genetic locus in the population-specific set.
In operation 1924, a processor may generate each of multiple population-specific measures of heterozygosity associated with the genetic mutation for each of the respective multiple populations based on the total observed count of the genetic mutation and the clinical visibility of the genetic mutation. In one embodiment, each of the multiple population-specific measures of heterozygosity is directly proportional to the clinical visibility of the genetic mutation and indirectly proportional to the total count of the genetic mutation in the corresponding population. In one embodiment, each of the multiple population-specific measures of heterozygosity increases when the clinical visibility of the genetic mutation is relatively greater than the total count of the genetic mutation in the corresponding population. In one embodiment, the processor may weigh the multiple population-specific measures of heterozygosity based on the total frequency of the genetic mutation in the corresponding population. In one embodiment, the processor may compute the clinical visibility of the genetic mutation based on one or more measures of: a frequency with which the genetic mutation is cited in clinical studies or literature, a number of published articles referencing the genetic mutation, a number of compound heterozygotes of the genetic mutation with other variants described in clinical studies or literature, and a number of search results or an order or ranking in a search result for the genetic mutation.
In operation 1926, a processor may generate one or more measures of a dominant effect associated with the genetic mutation based on an allele count of the identified genetic mutation compared to a distribution of allele counts across a plurality of pathogenic mutations in an identified gene. In one embodiment, the processor may weigh the measures of the dominant effect with one or more confidence weights defined based on the allele count, a number of the pathogenic mutations in the identified gene, and/or a distribution of allele counts across the plurality of pathogenic mutations in the identified gene.
In operation 1928, a processor may compute one or more likelihoods that the genetic mutation causes gene-dysfunction in the organism based on one or more of the multiple population-specific measures of homozygosity. In one embodiment, the processor may compute the one or more likelihoods that the genetic mutation causes gene-dysfunction to be greater when the observed homozygote count is less than the expected homozygote count and to be smaller when the observed homozygote count is greater than the expected homozygote count. In one embodiment, the processor may compute the one or more likelihoods that the genetic mutation causes gene-dysfunction to increase when the observed homozygote count is less than the expected homozygote count. In one embodiment, the processor may compare the one or more likelihoods to one or more pathogenic threshold ranges to predict if the genetic mutation will cause gene-dysfunction in the organism. In one embodiment, the processor may compare the one or more likelihoods to one or more benign threshold ranges to predict if the genetic mutation is benign in the organism. In some embodiments, the processor may compute the one or more likelihoods based on a combination of the multiple population-specific measures of homozygosity corresponding to the multiple populations, each weighted according to an independent population-specific weight. In one embodiment, each of the population-specific weights is defined based on a magnitude of the observed or expected homozygote count in each population-specific set. In one embodiment, the population-specific weights are defined based on degrees from which the organism descended from each population. In some embodiments, the processor may compute the one or more likelihoods based on a single population-specific measure of homozygosity corresponding to a single primary population from which the organism descended. In some embodiments, the processor may compute the one or more likelihoods and weights by training a model to discriminate between genetic mutations known to cause pathology and genetic mutations known to be benign. The one or more likelihoods may be computed on a continuous scale.
In operation 1930, a display may display results, input data, or intermediate data from operations 1920-1928 or data represented as
Reference is made to
In operation 1932, a processor may receive multiple aligned reference genetic sequences of multiple extant organisms representative of one or more species or populations (e.g., as shown in
In operation 1934, the processor may build or obtain a model representing measures of evolutionary variation of alleles or nucleotides at one or more aligned genetic loci between the multiple organisms. The model may be a single-species model (e.g., the multiple organisms are from the same single species) or a multi-species model (e.g., the multiple organisms are from different multiple species). The model may include, for example, a phylogenetic tree (e.g., as shown in
In operation 1936, the processor may receive a genetic sequence of an organism to be genetically screened.
In operation 1938, the processor may use the model of operation 1934 defining the evolutionary past variation among multiple extant organisms of different populations or species to predict or interpolate a likelihood or probability of evolutionary health of the organism in operation 1936. The processor may determine the differences between the organism's genetic sequence and one or more aligned reference genetic sequences and may assign each allele (or only different or mutated alleles) a measure of evolutionary variation that is a function of variations in alleles at corresponding aligned genetic loci in the multiple aligned genetic sequences (e.g., loci derived from one or more common ancestral genetic loci in the multiple organisms). The processor may compute one or more likelihoods that an allele mutation at each of the one or more genetic loci in the organism is deleterious based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci for the multiple organisms. The likelihoods may include one or more likelihoods or likelihood distributions for one or more alleles, one or more allele mutations, one or more genes, one or more codons, one or more genetic loci or loci segments, for one or more living organisms or virtual progeny of two potential parents (e.g., generated by repeatedly simulating a mating using different virtual gamete(s) in each iteration) and/or for one or more pairs of potential parents (e.g., generated by repeatedly simulating a mating step, in each iteration using the genetic information obtained in operation 1936. The one or more likelihoods may be compared to one or more thresholds or other statistical models to predict if (or a likelihood or degree in which) an allele mutation is deleterious in the organism. For example, mutations at genetic loci with relatively constant or fixed alleles and relatively lower measures of evolutionary variation may be associated with relatively higher likelihoods of deleterious traits, whereas mutations at genetic loci with relatively volatile or changing alleles and relatively higher measures of evolutionary variation may be associated with relatively lower likelihoods of resulting in deleterious traits.
In operation 1940, an output device (e.g., output device 320 of
The organism screened for gene-dysfunction in
When the organism is a virtual organism, a processor may generate the virtual progeny by combining at least a portion of genetic information representing DNA obtained from biological samples of two living potential parents. In one embodiment, the processor may simulate a mating between the two potential parents by combining their genetic sequences to generate one or more virtual progeny genetic sequences (e.g., sequence 410 of
Other or different operations or orders of operations may be used and operations may be repeated, e.g., until the likelihoods or optimization cost factor converge or asymptotically approach a statistically stable result.
Current literature and variant classification databases distribute recessive-disease associated variants into binary “pathogenic” and “benign” groupings. These generalizations are based on outdated and limited genetic practices.
There is a need in the art to incorporate multiple complementary resources to generate a more accurate computational score. With the advent of extensive disease databases, such as, the ExAC database, and with additional exome sequencing data of healthy and diseased individuals, the VGD model provides a comprehensive score that incorporates in silico models and actual observed variant frequencies. The VGD model is flexible and dynamic, for example, using a neural network to incorporate a growing combination of model components or nodes, and to re-train the model based on growing clinical resources and genetic datasets. The VGD model may be used to assess gene-dysfunction for any variant and dynamic enough to handle unique variants. The VGD model combined with the reduction in cost of next-generation sequencing provides the capability to efficiently and accurately estimate disease contribution on a continuous scale for any variant.
There is an imperative need for a disease classification system that moves beyond the binary “pathogenic” and “benign” categorizations for recessive-disease associated variants. The VGD model hereby provides estimates of the functional impact that variants have on the expression of single locus recessive diseases on a continuous scale. The VGD model further combines several metrics into one or more likelihoods, increasing the accuracy and sensitivity for assessing the probability that a variant is disease contributing, and identifying new variants that were previously ignored as well as mistaken variants that were previously erroneously classified as pathogenic due to more rudimentary methods, exposing their limitations and restrictions. The VGD model can be used to quantify the level of risk associated with any variant and represents a more accurate metric for describing potential disease contributing variants.
As used herein, a “chromosome” may refer to a molecule of DNA with a sequence of basepairs that corresponds closely to a defined chromosome reference sequence of the organism in question.
As used herein, a “gene” may refer to a DNA sequence in a chromosome that codes for a product (either RNA or its translation product, a polypeptide) or otherwise plays a role in the expression of said product. A gene contains a DNA sequence with biological function. The biological function may be contained within the structure of the RNA product or a coding region for a polypeptide. The coding region includes a plurality of coding segments (“exons”) and intervening non-coding sequences (“introns”) between individual coding segments and non-coding regions preceding and following the first and last coding regions respectively.
As used herein, a “locus” may refer to any segment of DNA sequence defined by chromosomal coordinates in a reference genome known to the art, irrespective of biological function. A DNA locus can contain multiple genes or no genes; it can be a single base pair or millions of base pairs.
As used herein, a “polymorphic locus” may refer to a genomic locus at which two or more alleles have been identified.
As used herein, an “allele” may refer to one of two or more existing genetic variants of a specific polymorphic genomic locus.
As used herein a “variant” or “genetic mutation” may be any one or more bases, nucleotides, or alleles, which may or may not differ compared to reference, common or expected bases, nucleotides, or alleles, for example, of one or more reference genetic sequences.
As used herein, “genotype” may refer to the diploid combination of alleles at a given genetic locus, or set of related loci, in a given cell or organism. A homozygote includes two copies of the same allele and a heterozygote includes two distinct alleles. In the simplest case of a locus with two alleles “A” and “a”, three genotypes can be formed: A/A, A/a, and a/a, of which A/A and a/a are homozygotes and A/a are heterozygotes.
As used herein, “genotyping” may refer to any experimental, computational, or observational protocol for distinguishing an individual's genotype at one or more well-defined loci.
As used herein, a “haplotype” may refer to a unique set of alleles at separate loci that are normally grouped closely together on the same DNA molecule, and are observed to be inherited as a group. A haplotype can be defined by a set of specific alleles at each defined polymorphic locus within a haploblock.
As used herein, a “haploblock” may refer to a genomic region that maintains genetic integrity over multiple generations and is recognized by linkage disequilibrium within a population. Haploblocks are defined empirically for a given population of individuals.
As used herein, “linkage disequilibrium” may refer to the non-random association of alleles at two or more loci within a particular population. Linkage disequilibrium is measured as a departure from the null hypothesis of linkage equilibrium, where each allele at one locus associates randomly with each allele at a second locus in a population of individual genomes.
As used herein, a “genome” may refer to the total genetic information carried by an individual organism or cell, represented by the complete DNA sequences of its chromosomes.
As used herein, a “genome profile” may refer to a representative subset of the total information contained within a genome. A genome profile contains genotypes at a particular set of polymorphic loci.
As used herein, a “personal genome profile”, abbreviated PGP, may refer to the genome profile of a particular individual person.
As used herein, a genetic “trait” may refer to a distinguishing attribute of an individual, whose expression is fully or partially influenced by an individual's genetic constitution.
As used herein, a “phenotype” may refer to a class of alternative traits which may be discrete or continuous.
As used herein, “haploid cell” may refer to a cell with a haploid number (n) of chromosomes.
As used herein, “gametes”, may refer to specialized haploid cells (e.g., spermatozoa and oocytes) produced through the process of meiosis and involved in sexual reproduction.
As used herein, “gametotype” may refer to single copies with one allele of each of one or more loci in the haploid genome of a single gamete.
As used herein, an “autosome” may refer to any chromosome exclusive of the X and Y sex chromosomes.
As used herein, “diploid cell” may have a homologous pair of each of its autosomal chromosomes, and has two copies (2n) of each autosomal genetic locus.
As used herein, a “haplopath” may refer to a haploid path laid out along a defined region of a diploid genome by a single iteration of a Monte Carlo simulation or a single chain generated through a Markov process. A haplopath can be formed by starting at one end of a personal chromosome or genome and walking from locus to locus, choosing a single allele at each step based on available linkage disequilibrium information, inter-locus allele association coefficients, and formal rules of genetics that describe the natural process of gamete production in a sexually reproducing organism. A “haplopath” is generated through the application of formal rules of genetics that describe the reduction of the diploid genome into haploid genomes through the natural process of meiosis.
As used herein, a “Virtual Gamete” may refer to a single haplopath that extends across an entire genome.
As used herein, a “Virtual Progeny” or “Virtual Progeny genome sampling” may refer to the discrete genetic product of two Virtual Gametes. Virtual Progeny may be generated as disclosed in U.S. Pat. No. 8,805,620, incorporated herein by reference in its entirety.
As used herein, a “Virtual Progeny genome” may refer to a collection of discrete Virtual Progeny genome samplings, each generated by combining two uniquely-derived random Virtual Gametes. In some instances, a Virtual Progeny genome is represented as a probability mass function over a sample space of all discrete genome states. In some instances, a Virtual Progeny is an informed simulation of a child or children that might result as a consequence of sexual reproduction between two individuals.
As used herein, a “Virtual Progeny phenome” may refer to a multi-dimensional likelihood function representing the likelihood and/or likely degree of expression of a set of one or more traits from a complete Virtual Progeny genome. In some instances, a Virtual Progeny phenome is represented as a probability mass function over a sample space of discrete or continuous phenotypic states. In some instances, a Virtual Progeny phenome is an informed simulation of a child or children that might result as a consequence of sexual reproduction between two individuals.
As used herein, “potential parent” may refer to an individual who genetic information is combined with another's genetic information to simulate a mating before a child is conceived. The mating may be simulated for two potential parents both interested in their combined genetic code, or a single individual iterating the mating over a plurality of candidate donors, for example, to select an optimal donor from a sperm or egg donor bank. A “partner” may refer to a marriage partner, sexual or reproductive partner, domestic partner, opposite-sex partner, and same-sex partner.
As used herein, a “living” organism may refer to a real, extant, surviving, currently living, or previously living (now deceased), organism. A “virtual” organism may refer to a never or non-living organism, for example, simulated by computer models, where all its genetic information is derived by combining data representing real biological DNA obtained from living organisms such as two potential parents by simulated a mating or conception process.
As used herein, a “DNA image” may refer to a magnified picture or image or real biological DNA, or may refer to a simplified schematic representation thereof such as a nucleotide or DNA sequence. In one example, a DNA image may be zoomed out view of a DNA sequence.
As used herein, “reference genetic sequence” may refer to a genetic sequence used to generate an evolutionary model, such as, a phylogenetic tree. Reference genetic sequences may include standardized genetic sequences from organisms representative of one or more evolutionarily extant (currently or previously living) populations or species, such as those released by genome consortia (e.g., human reference genome, such as, Genome Reference Consortium Human Build 37 (GRCh37) provided by the Genome Reference Consortium). Reference genetic sequences may additionally or alternatively include non-standardized sequences of organisms, such as, any member of a population or species. A single-species model may be generated using reference genetic sequences from multiple organisms of the same single species, e.g., 1,000 chimpanzee or humans. A multi-species model may be generated using reference genetic sequences from multiple organisms of multiple different species, e.g., one model using 1,000 humans, 10 chimpanzee and one gorilla, or another model using a single different organism from each different species as shown in
As used herein, “potential parent genetic sequences” may refer to genetic sequences of real (currently or previously living) potential parents, for example, from which genetic information is combined to simulate a virtual mating generating one or more virtual children or progeny, to predict before they conceive a child, a likelihood that such a child would have a deleterious trait. The potential parent genetic sequences may be obtained from genetic samples of two potential parents seeking to mate, or from a first potential parent seeking a genetic donor and a second potential parent from a pool of candidate donors.
As used herein, “virtual progeny genetic sequences” may refer to genetic sequences of simulated (never living) virtual progeny generated by simulating a mating or combining genetic information from two potential parent genetic sequences. Each virtual progeny genetic sequence may be a prediction or simulation of one possible genetic sequence of a child of the two potential parents, before that child is conceived. To achieve more robust results, the simulated mating may be repeated to generate multiple virtual progeny genetic sequences for each pair of potential parents. The virtual progeny genetic sequences may be compared to the reference genetic sequences, for example, to identify evolutionarily rare, and therefore, likely deleterious traits.
In some embodiments, genetic information may be used interchangeably for potential parent genetic sequences and reference genetic sequences. In one example, genetic information from a potential parent or donor may be used instead of, or in combination with reference consortium genetic sequences, to generate an evolutionary model or phylogenetic tree. In another example, reference consortium genetic sequences may be used instead of, or in combination with potential parent or donor genetic sequences, to simulate matings or predict likelihoods of deleterious traits in offspring.
As used herein, a “genetic sequence” may include genetic information representing one or more bases, nucleotides or alleles (sequences of nucleotides defining different forms of a gene) for any number of sequential or non-sequential genetic loci. For example, a “genetic sequence” may refer to allele information at a single genetic locus, or multiple genetic loci, such as, one or more gene segments or an entire genome. A genetic sequence is a data structure representing genetic information at one or more loci of a real or virtual genome. Genetic sequence data structures may include, for example, one or more vectors, scalar values, functions, sequences, sets, matrices, tables, lists, arrays, and/or other data structures, representing one or more bases, nucleotides, genes, alleles, codons or other generic material. The data structures representing each single chromosome sequence may be one dimensional (e.g., representing a single base or allele per locus) or multi-dimensional (e.g., representing multiple or all bases A, T, C, G or alleles at each locus and a probability associated with the likelihood of each existing in a potential progeny). The same (or different) data structures may be used for real and virtual genome sequences, though real genome sequences generally represent real genetic material (e.g., DNA extracted from a currently or previously existing genetic sample), while virtual genome sequences are generated by combining at least a portion of genetic sequences from biological DNA samples of two living potential parents.
As used herein, “a˜b” may represent a proportional “˜” relationship between a and b.
As used herein, “a≈b” may represent an approximate equivalence “≈” between a and b, for example, within 10% of either value.
In the foregoing description, various aspects of the present invention are described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the present invention. However, it will also be apparent to persons of ordinary skill in the art that the present invention may be practiced without the specific details presented herein. Furthermore, well known features may be omitted or simplified in order not to obscure the present invention.
Unless specifically stated otherwise, it is appreciated that throughout the specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.
Embodiments of the invention may include an article such as a computer or processor readable non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory device (e.g., memory unit(s) 114, 116, and/or 118 of
Different embodiments are disclosed herein. Features of certain embodiments may be combined with features of other embodiments; thus certain embodiments may be combinations of features of multiple embodiments.
The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. It should be appreciated by persons of ordinary skill in the art that many modifications, variations, substitutions, changes, and equivalents are possible in light of the above teaching. For example, it should be appreciated that sign conventions are equivalent and that embodiments of the invention in which values are above a lower bound or threshold are equivalent to embodiments of the invention in which values are below an upper bound or threshold, since the difference is a mere convention of sign. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
This patent application claims the benefit of U.S. Provisional Patent Application No. 62/151,116 filed Apr. 22, 2015 and is a continuation-in-part of U.S. patent application Ser. No. 14/568,456 filed Dec. 12, 2014, which claims the benefit of U.S. Provisional Patent Application No. 62/013,139 filed Jun. 17, 2014, all of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62151116 | Apr 2015 | US | |
62013139 | Jun 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14568456 | Dec 2014 | US |
Child | 15136093 | US |