DEVICE, SYSTEM AND METHOD FOR ASSESSING RISK OF VARIANT-SPECIFIC GENE DYSFUNCTION

FIELD OF THE INVENTION

Embodiments of the present invention relate generally to the field of genetics. In particular, embodiments of the present invention relate to predicting the risk that one or more specific allele variants will cause gene dysfunction or deleterious mutations associated with disease or reduced likelihood of surviving or reproducing in an organism.

BACKGROUND OF THE INVENTION

Every year thousands of babies are born with genetic diseases. Often, the parents of these children are both healthy, but each parent possesses genetic mutations that when passed in combination to the child, endow it from the time of conception with an unmitigated genetic defect. Children with such diseases may suffer, have diminished lifespans and can entail large emotional and financial costs, so many prospective parents attempt to minimize the chance that they pass on genetic elements that cause disease.

Carrier testing, in which both parents are genotyped at loci of their genomes that are known to cause disease, is a technique widely used to achieve this goal. Carrier testing is unique among medical diagnostics in that recessive disease is only predicted to occur in persons other than those actually being tested. Variants in a known disease gene are classified as “pathogenic” when observed in correlation with patients diagnosed with the corresponding disease. A panel of pathogenic variants from the same gene provides the basis for developing a specific test. Persons who carry any targeted clinically validated variant are scored “positive,” and two prospective parents who test positive for the same autosomal gene are assigned a 25% risk of conceiving a diseased child. Conventional carrier testing suffers from several limitations.

Firstly, carrier testing uses a binary classification system, defining an allele variant as having only either a “positive” (pathogenic) or a “negative” (benign) effect of causing a disease. This binary classification fails to identify any continuum or intermediate effects (e.g., a degree of disease or partial functionality of a phenotype) or to illuminate allele-specific or genotype-specific differences in predicted phenotypes from the same gene. In some cases, variants with partial functionality will express allele-specific or genotype-specific effects (e.g., associated with disease in some allelic combinations but not others). The binary classification system cannot differentiate between different phenotypes caused by different allele or genotype combinations of the same gene.

Binary classification is typically useful for patients with a known disease or phenotype to search for the variant that causes their disease or phenotype. A successful search distinguishes the patient's “pathogenic” variants from “benign” variants. If the patient's condition cannot be ascribed to previously characterized variants, a number of computational tools have been developed for filtering and ranking potential culprits. The performance of these discovery tools is typically measured by an area under a receiver operator characteristic (ROC) curve for benchmark sets of pathogenic and benign variants.

In recently published guidelines for scoring the pathogenicity of DNA sequence variants, the American College of Medical Genetics and Genomics encouraged clinical researchers to “arrive at a single conclusion” that is “determined by the entire body of evidence.” However, the assumption of all-or-none pathogenicity is inappropriate for variants in recessive disease genes. “Pathogenic” implies that a variant has an absolute or determinative causal relationship to a disease or phenotype, and yet, in molecular terms, a single recessive disease allele cannot independently cause a disease, but participates passively in a reduction or loss of function that is tolerated in the heterozygous presence of a fully functional gene copy. Recessive disease will only ensue in a homozygote or compound heterozygote where the molecular sum of functional products from both gene copies fails to rise above the threshold required for health.

A second limitation of conventional carrier testing is that it is very difficult to identify the disease-risk of variants of recessive diseases or traits because many of the patients carrying those variants are heterozygous and do not express the recessive disease or trait. Newly arising mutations in recessive disease genes will usually be transmitted silently from one generation of heterozygotes to the next, without appearing in diseased patients. In a recent analysis of the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene in 60,000 exomes from individuals not affected with cystic fibrosis, the number of likely disease-causing variants that had not been clinically validated was twice the number that were validated. The expanded use of Next Generation Sequencing (NGS) for genetic screening of all recessive disease genes will result in the detection of many more “untested” variants that are not available for informed reproductive decision making under the current testing regimen.

A third limitation of conventional carrier testing is that it typically only tests for variants validated to cause disease in clinical studies. Carrier testing typically relies on the curation of clinical reports as its primary source for variant inclusion. Such tests rely on a defined set of alleles known to cause diseases, and then screen for the presence of these alleles in one or both parents prior to conception. The alleles screened in such tests have been established to cause disease by examining pedigrees of patients with the disease, by using cellular or animal models of the effect of the particular allele, or alternate means. The incompleteness of these tests is evidenced by the fact that the number of alleles associated with disease in public databases such as ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) and OMIM (http://www.ncbi.nlm.nih.gov/omim) continues to grow every year, and in turn so do the number of loci tested by carrier screening. Similarly, many patients can present with pathologies which appear to have a genetic basis, but for which no specific underlying genetic mutation has yet been determined. In many of these cases, a novel pathogenic variant or variants is then later discovered by various means and added to the catalog of known disease associated mutations. For example, the genomes of many patients with similar pathologies can be sequenced and shared mutations found. Alternatively, mutations that occur in an individual patient's genome which appear damaging (missense, nonsense, etc.) and are present in genes known to be associated with a biological process related to the pathology, may be tested in a cellular or animal model.

While the steady increase of the catalog of variants known to cause disease implies that carrier testing will get better, it also evinces that it suffers from the limitation that it only screens for clinically validated mutations, and cannot assess the impact of novel or de novo mutations. If a variant is specific to an individual or family and has not been previously studied, carrier testing cannot determine what effect it may have on future offspring.

A fourth limitation of conventional carrier testing is that a diseased child must be born and diagnosed in order to find a new disease associated allele. In all cases, the correlation between alleles and genetic diseases are determined by studying one or more individuals that have already been born with the disease. In the case of recessive disease, the problem is compounded because novel variants usually initially only appear as one half of a heterozygote genotype which does not express disease, and will spread silently through populations before it is combined with itself or another recessive mutation as homozygotes to express the disease in patients. Thus, it is very difficult to resolve the effect of the mutation until children suffering from the disease are born, and from the perspective of a parent who wants to avoid passing on disease causing alleles, it is too late.

SUMMARY OF THE INVENTION

A system, device and method are described to overcome the aforementioned longstanding issues inherent in the art.

In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction caused by a defined genetic mutation in the genome of an organism. A neural network may be stored, for example in one or more memory units. The neural network may comprise multiple nodes respectively associated with multiple different gene-dysfunction metrics and multiple different confidence weights. One or more processors may process the neural network to combine the multiple gene-dysfunction metrics according to the respective associated confidence weights to generate one or more likelihoods that a genetic mutation causes gene-dysfunction in organisms. The one or more processors may process the neural network in a training-phase and a run-time phase. In a training-phase, the neural network may be trained using an input data set including one or more genetic mutations to generate new gene-dysfunction metrics and new associated confidence weights that optimize the neural network based on a cost factor. In a run-time phase, a genetic mutation may be identified and one or more likelihoods may be computed that the identified genetic mutation causes gene-dysfunction in the organism based on the new gene-dysfunction metrics and the associated new confidence weights of the neural network. The multiple different gene-dysfunction metrics may include combinations of one or more of a population selection component, an evolutionary selection component, a pathogenic predictor component, a mutation class component and/or a clinical classification component.

In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction associated with a genetic mutation in an organism based on population-specific selection factors. Multiple population-specific sets of genetic sequences may be received each including multiple genetic sequences obtained from genetic samples of organisms from a different respective one of multiple populations. Each of multiple population-specific measures of homozygosity of the genetic mutation may be generated for each of the respective multiple populations by comparing the count of observed homozygotes of the genetic mutation measured on both chromosomes at a genetic locus in the population-specific set and an expected homozygote count based on a total observed count of the genetic mutation measured on either chromosome at the genetic locus in the population-specific set. One or more likelihoods may be computed that the genetic mutation causes gene-dysfunction in the organism based on one or more of the multiple population-specific measures of homozygosity.

In an embodiment of the invention, a device, system and method is provided for predicting gene-dysfunction associated with a genetic mutation in an organism based on the evolution of genetic variation of multiple organisms within one species (“single-species” or “intra-species” model) or across multiple different species (“multi-species” or “inter-species” model). Past evolutionary trends in allele mutations of extant or surviving (currently or once-living) organisms representative of one or more species or populations may be analyzed to predict the future fitness of a living organism or a potential hypothetical or virtual progeny simulated for two living potential parents. In some embodiments of the invention, a system, device and method may receive multiple aligned genetic sequences obtained from genetic samples of multiple organisms of one or more different species. Genetic loci may be aligned from different sequences for different organisms that are derived from one or more common ancestral genetic loci correlated with the same trait(s), disease(s), codon(s), that are positioned or sandwiched between other correlated marker loci, or that are otherwise related. A measure of evolutionary variation may be computed for one or more alleles at each of one or more aligned genetic loci of the multiple aligned sequences. The measure of evolutionary variation may be a function of variation in alleles at corresponding aligned genetic loci in the multiple aligned genetic sequences. One or more likelihoods may be computed that an allele, either a new mutation or one present in the alignment, at each of the one or more genetic loci in an organism will be deleterious based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci for the multiple organisms.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:

FIG. 1 schematically illustrates a system for assessing risk or likelihood of variant-specific gene dysfunction according to an embodiment of the invention;

FIG. 2 schematically illustrates a display visualizing a DNA image and/or sequence of an organism comprising one or more variants or mutations labeled in the DNA sequence together with a likelihood of variant-specific gene dysfunction for the organism according to an embodiment of the invention;

FIGS. 3A-3D schematically illustrate an example of (A) a neural network combining multiple gene dysfunction components for determining the risk of variant-specific gene dysfunction in an organism; (B) an exploded view of the population selection component; (C) an exploded view of the clinical classification component, mutation class component and pathogenic predictors component; and (D) optimized parameters for the components in FIGS. 3A-3C, according to an embodiment of the invention;

FIGS. 4A and 4B show violin plots of an example of variant-specific gene dysfunction (A) without and (B) with a clinical classification component according to an embodiment of the invention;

FIGS. 5A and 5B show violin plots of an example of variant-specific gene dysfunction (A) without and (B) with a mutation type component according to an embodiment of the invention;

FIGS. 6A-6D show violin plots of an example comparison of variant-specific gene dysfunction and (A) PolyPhen-2, (B) adjusted CADD, (C) adjusted PROVEAN, and (D) VEST values, stratified by clinical classification category according to an embodiment of the invention;

FIGS. 7A-7B are graphs of an example of (A) one-sided bootstrap confidence intervals per gene, (B) a distribution of the density of confidence interval thresholds versus variant-specific gene dysfunction (VGD^−Cl), according to some embodiments of the invention;

FIG. 8 is a graph of an example of Phred scaled CADD values versus adjusted CADD values (R_j^C) and raw weights (w_j^0C) according to an embodiment of the invention;

FIG. 9 is a graph of an example of raw PROVEAN values versus adjusted PROVEAN values R_j^PRand raw weights (w_j^0PR) according to an embodiment of the invention;

FIG. 10 is a graph of an example relationship between in-frame indel mutation class values (S_j^M) vs. the number of inserted codons according to an embodiment of the invention;

FIGS. 11A-11B are plots of maximum population frequency for all variants in (A) Pore Forming Protein 1 (PRF1) and (B) Phosphorylase, Glycogen, Muscle (PYGM), according to some embodiments of the invention;

FIG. 12 shows a violin plot of an example of variant-specific gene dysfunction for each variant stratified by VariBench category according to an embodiment of the invention;

FIGS. 13A-13D show violin plots of an example comparison of variant-specific gene dysfunction and (A) PolyPhen-2, (B) adjusted CADD, (C) adjusted PROVEAN, and (D) VEST values, stratified by VariBench category according to an embodiment of the invention;

FIGS. 14A-14D show violin plots of an example comparison of variant-specific gene dysfunction and (A) PolyPhen-2, (B) adjusted CADD, (C) adjusted PROVEAN, and (D) VEST values, stratified by mutation type for all variants according to an embodiment of the invention;

FIG. 15 schematically illustrates an example of an alignment of multiple genetic sequences according to an embodiment of the invention;

FIG. 16 schematically illustrates an example of simulating a hypothetical mating of two potential parents for generating a virtual progeny according to an embodiment of the invention;

FIG. 17 schematically illustrates an example of a phylogenetic tree inferred from the multiple sequence alignment shown in FIG. 16 according to an embodiment of the invention;

FIGS. 18A and 18B are graphs of the likelihood of variant-specific gene dysfunction for 400 variants of an example disease gene, transglutaminase (TGM) 1, compared with clinically validated results in (A) 2012 and (B) 2016 according to an embodiment of the invention;

FIG. 19A is a flowchart of a method for assessing risk of variant-specific gene dysfunction according to an embodiment of the invention;

FIG. 19B is a flowchart of a method of predicting gene-dysfunction associated with a genetic mutation in an organism based on population-specific selection factors or nodes according to an embodiment of the invention; and

FIG. 19C is a flowchart of a method of predicting gene-dysfunction associated with a genetic mutation in an organism based on evolutionary selection factors or nodes according to an embodiment of the invention.

It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

DETAILED DESCRIPTION OR EMBODIMENTS OF THE INVENTION

Embodiments of the invention provide a system, device and method for analyzing a DNA sequence to determine risk or probability of gene dysfunction associated with specific variants or allele combinations in the DNA sequence, for example, associated with disease or reduced likelihood of surviving or reproducing in an organism. The DNA sequence may be sequenced from a biological sample of a living organism (a “real” or “extant” organism) or may be simulated (e.g., simulating a mating) by combining at least a portion of genetic information representing genetic material obtained from biological DNA samples of two living potential parents (e.g. as shown in FIG. 17) (a “virtual” or “simulated” progeny). All genetic information that is genetically screened is derived or transformed from biological DNA samples of living human organisms.

Embodiments of the invention replace the unrealistic conventional binary classification system of disease-risk with a continuous variant-weighted component-based dysfunction scoring system for each single gene copy. Embodiments of the invention may compute one or more likelihoods of variant-specific gene dysfunction by integrating multiple gene dysfunction categories. These multiple gene dysfunction categories may be weighted according to variant-defined levels of confidence and summed to generate a variant-specific gene dysfunction (VGD). The multiple gene dysfunction components may be combined using a neural network (e.g. FIGS. 3A-3D). Each node of the neural network may represent a different gene dysfunction component, for example, associated with one or more score(s) and weight(s). The neural network may be trained by machine-learning based on a training set of known variant dysfunction measures to optimize or tune network parameters to improve the network score(s) and weight(s) (e.g. see “Parameter” key in FIG. 3D). The neural network is trained, for example, using a cost or error factor (e.g. see “Optimization cost function” in FIG. 3A), for example, to optimize predictive agreement with one or more benchmark datasets of pathogenic and benign variants on a targeted set of multiple recessive disease genes. Clinical classifications of variants known to be “pathogenic” or “benign” may be used as a truth class to validate the model for each gene and set parameters (e.g., FIG. 3D) and sensitivity thresholds (e.g., FIGS. 7A-7B).

The multiple gene dysfunction categories integrated into the variant-specific gene dysfunction score may include, for example, clinical classification, mutation class, pathogenic predictors, evolutionary constraint, and population selection.

The population selection component measures a “homozygous effect,” a “heterozygous effect” and/or a “dominant effect” in each of a plurality of human populations (r=1, . . . R). In one example, populations include non-Finnish Europe, Finland, South Asia, East Asia, Africa, and the Americas, although other populations or groups may be used.

The homozygous effect may measure each population's natural selection against homozygote forms of a recessive genetic variant. The homozygous effect may compare the observed incidence or count of a homozygote of a variant genotype Q_j^r,obs(e.g. observed on both chromosomes) to a predicted or expected incidence or count of the homozygous variant genotype Q_j^r,exp(e.g. based on the observed allele frequency on either chromosome (f_j^r) and the population size (N_j^r), such as, Q_j^r,exp=(f_j^r)²N_j^r), in each population. The homozygous effect is based on a “null hypothesis” that if a variant's effect is neutral (e.g. having substantially no negative or positive consequence on survival or reproductive ability), the observed incidence of the homozygous variant genotype would be approximately equal to the expected incidence of the homozygous variant genotype

$(e . g . \frac{Q^{obs}}{Q^{\exp}} \approx 1) .$

If however, the variant genotype causes dysfunction, disease, or reduced likelihood of surviving or reproducing, there would be a selective force against that variant expressing in the population, thereby suppressing the observed incidence of the homozygous variant genotype relative to the heterozygote or total variant genotype

$(e . g . \frac{Q^{obs}}{Q^{\exp}} < 1) .$

A relatively low observed incidence of the homozygous variant genotype compared to the expected incidence of the homozygous variant genotype (e.g. Q^obs<<Q^exp) results in a relatively high variant-specific gene dysfunction score, whereas a relatively neutral or high observed incidence of the homozygous variant compared to the expected incidence of the homozygous variant (e.g. Q^obs≧Q^exp) results in a relatively low variant-specific gene dysfunction score. At the limits of the homozygous effect, if there is no observed incidence of a homozygous variant genotype in a population (r) (e.g. Q^r,obs=0), the homozygous effect score reaches its maximum (e.g., T^hom,r≈1) whereas if there are the same or more observed incidences of a homozygous variant genotype than expected in a population (r) (e.g. Q^r,obs≧Q^r,exp) the homozygous effect score reaches its minimum (e.g., T^hom,r=0). The homozygous effect is a powerful and accurate measure of gene dysfunction caused by each variant (j) in a population (r), particularly in situations with a sufficiently large sampled population (e.g., N_j^r>>1,000, such as >50,000 people) or a sufficiently high frequency of mutation (e.g., f_j^r>1%). However, in cases in which the population does not have many sequenced individuals (e.g. N_j^r<1,000 people) or the mutation is at a sufficiently low allele frequency (e.g., f_j^r<0.1%), random fluctuations in mutations may skew the ratio of observed to expected homozygotes, and thus the homozygous effect becomes less powerful. To reflect the varied power in the score, the weight or confidence of the homozygous effect is proportional to the maximal number of observed or expected incidences of the homozygous variant genotype, and thus is also diminished in cases with low variant frequency and/or population size. In such cases when the power of the homozygous effect is diminished, the heterozygous effect may compensate.

The heterozygous effect may measure the impact of heterozygote forms of a recessive genetic variant. The heterozygous effect may measure the relationship between the count or frequency of a variant in each population (f_j^r) with the variant's clinical visibility. A variant (j)'s clinical visibility CV_jmay be based on, for example a number of published articles that reference the variant (pm), a number of compound heterozygotes or combinations of the variant with other variants that is described in the clinical literature (ch), and/or a number of search results for the variant or an order or ranking of a search result for the variant in a database or web search. A rare variant or mutation is generally expected to only rarely (or never) appear in clinical studies. However, extensive documentation in clinical studies of a rare variant is shown to correlate with a higher likelihood that the variant causes gene dysfunction resulting in disease. The heterozygous effect increases when a variant has an unexpectedly or relatively high clinical visibility compared to its frequency in a population. The heterozygous effect is thereby of particular importance, for example, in cases where a variant has a high allele frequency (e.g. greater than 0.5%) or disproportionately high clinical visibility. In cases where a variant has a disproportionately high clinical visibility, the heterozygous effect may identify damaging or disease-causing variants even when they appear frequently in a population (e.g., with a sufficiently high frequency that would have otherwise indicated lack of damage). Although most disease variants are rare, for example, the variant primarily responsible for sickle-cell anemia (SCA) affects up to 10% of people in sub-Saharan countries with endemic malaria, which is a relatively high allele frequency for a damaging variant. However, the SCA variant has an extremely high clinical visibility, and thus both the score and weight for its heterozygous effect are relatively high. Allele frequencies from different populations may be treated unequally given that clinical studies are disproportionately prevalent in certain regions of the world. For example, for a variant with a null or 0 clinical visibility (e.g., not mentioned in the clinical literature), a 0.5% allele frequency in Europe would indicate a relatively low or minimal likelihood of that variant causing dysfunction, but a 0.5% allele frequency in Africa may still be aligned with the variant causing dysfunction, under the assumption that more clinical studies have been conducted in Europe than in Africa.

The dominant effect may measure each population's natural selection against dominant genetic variants. The dominant effect may measure the relationship between a variant's observed allele count (e.g., the number of people with either one or two copies of the variant) compared to an expected allele count for any pathogenic variant in the same gene, for example, based on a distribution of allele counts across a plurality of (or all) known pathogenic mutations in the gene. In a gene with a fully dominant disease, pathogenic variants are generally not expected to be observed in more than a small fraction of healthy individuals. If, for example, none or few of a gene's pathogenic variants have been observed in any individuals, and the number of pathogenic variants in the gene is sufficiently large (e.g., >30), then a variant that has been observed significantly more than would be expected of a pathogenic variant in that gene (e.g., allele count>2) would be given a relatively low dominant effect score. The weight or confidence of the dominant effect may be based on the variant's allele count, the number of pathogenic variants in the gene, and/or the distribution of allele counts across a plurality of pathogenic mutations in the gene.

Whereas the population selection component represents a variant's success or failure on a population-level, e.g., separately for each of a plurality of specific populations within the single human species (e.g., non-Finnish Europe, Finland, South Asia, East Asia, Africa, and the Americas), the evolutionary selection component represents a variant's success or failure on a species-level (in the “single-species” model) or across all species (in the “multi-species” model).

The evolutionary selection component predicts the likelihood that variants cause disease or dysfunction, for example, based on their frequency or rarity of occurrence, across multiple reference genetic sequences (e.g. see FIG. 15) sampled within one species (“single-species” model) or across multiple different species (“multi-species” model, see e.g. FIG. 16). The evolutionary selection component links a variant's success or failure to propagate throughout evolution by natural selection to the likelihood that the variant causes dysfunction or disease (rare variants) or that the variant is innocuous or positive (frequent variants). Accordingly, the evolutionary selection component for dysfunction may be inversely related to the frequency of the variant through evolution. Allele mutations or variations that have a relatively low frequency across the reference genetic sequences (e.g. negatively selected for evolutionarily) increase the variant-specific gene dysfunction score, whereas allele mutations or variations that are relatively more frequent across the reference genetic sequences (e.g. positively or neutrally selected for evolutionarily) decrease the variant-specific gene dysfunction score. At the limits of the evolutionary selection component, if there is no observed incidence of a variant in the specie(s) (e.g., f_j=0), the evolutionary selection component reaches its maximum (e.g., S_j^E=1), whereas if the variant has a sufficiently high frequency (e.g. f_j>>1), the evolutionary selection component approaches its minimum (e.g., S_j^E→0).

The evolutionary selection and population selection component may complement each other, providing improved prediction when used together than when used separately. In one instance, the evolutionary selection component may predict disease in variants that are suppressed throughout evolution, for example, the variant that eliminates the development of wisdom teeth. However, whereas wisdom teeth are typically essential to survival of many animal species, humans are an exception protected by the intelligence and altruism of our species and the adaptation of diet. This leads to an anomaly, whereas the evolutionary selection component alone would have caused a mischaracterization of the wisdom teeth variant as damaging, it combination with the population selection component neutralizes its effect.

The mutation class component may measure the type or class of mutation of variant (j). Table 2 shows an example of the mutation class component for various mutation types. Example mutation classes include start-loss, stop-gain, stop-retained, frame-shift indel, essential splice site (associated with loss-of-function), splice region, untranslated region, microsatellite (e.g., a microsatellite or sequence of a repeating base type, such as, AAAAA . . . , of length STR_j(short tandem repeats), synonymous (e.g., not affecting gene expression), intron, in-frame indel (e.g., an insertion or deletion of a multiple of three bases, or an integer number of amino acids AA_j, so as not to shift the reading frame), missense (e.g., a non-synonymous variant that changes an amino acid), and stop-loss (e.g., loss of the normal stop codon by mutation to encode an amino acid). The loss-of-function (LoF) mutations, start-loss, stop-gain, frame-shift indel, essential splice site, typically cause complete dysfunction and may be assigned a relatively high or maximum dysfunction score (e.g., S_j^M=1) with relatively high confidence (e.g., w_j^M>50). A missense mutation alters an amino acid and may be assigned an intermediate dysfunction score (e.g., S_j^M=0.5), but because the amino acid may or may not damage a protein depending on which amino acid is damaged and other more complicated factors involved in the protein structure, it is associated with a relatively small weight (e.g., w_j^M=0.01). An in-frame indel mutation inserts or deletes an integer number (AA_j) of amino acids. Because the likelihood of dysfunction typically increases the greater the number of damaged amino acids, in-frame indel mutation dysfunction scores and weights increase as the number (AA_j) of damaged amino acids increases (e.g., as shown in FIG. 10). A micro-satellite mutation typically alters, adds, or deletes, one or more variants of a micro-satellite sequence. Because the likelihood of dysfunction typically decreases the longer the length of the micro-satellite (STR_j) the dysfunction score of a micro-satellite mutation is inversely proportional to the number of bases or length of the micro-satellite sequence (STR_j) of damaged amino acids (e.g., S_j^M˜STR_j⁻¹) and the weight of the micro-satellite mutation is directly proportional to the length of the micro-satellite (e.g., w_j^M˜STR_j).

The clinical classification component may measure dysfunction for variants that were clinically validated, for example, as “pathogenic” (e.g., S_j^C=1) or “benign” (e.g., S_j^C=0). The weights of the score may be based on validated confidence levels (e.g., uncontested or certain classifications have a relatively high or maximal weight such as w_j^C=20, probable classifications have a relatively moderate weight such as w_j^C=10, and contested classifications have a relatively low or minimal weight of w_j^C=1). The clinical classification component may be null when there is no clinical classification for a variant, in which case, the remaining (e.g., non-zero) gene-dysfunction components compensate for the null clinical classification component to predict dysfunction for the variant in the absence of clinically validated data.

The pathogenic predictor component may measure a likelihood that a variant is pathogenic, inputting one or more of the following metrics: PROVEAN predicts whether a protein sequence variation affects protein function, Combined Annotation Dependent Depletion (CADD) predicts the effects of a single nucleotide variants as well as insertion/deletions variants, Variant Effect Scoring Tool (VEST) predicts the effects of missense mutations, and PolyPhen-2 (Polymorphism Phenotyping v2) predicts the effects of an amino acid substitution on the structure and function of a human protein. These metrics may be composed into the pathogenic predictor component as follows. The PROVEAN metric may be transformed to a linear scale subscore (e.g., the greater the PROVEAN metric, the more likely the variant damages a gene) and may be assigned a weight proportional to the metric (e.g., the greater the PROVEAN metric, the more certain its impact on gene function). The CADD metric may be transformed from a Phred scale to a linear scale subscore (e.g., the greater the CADD metric, the more likely the variant damages a gene) and may be assigned a weight that increases when the CADD metric is above a threshold parameter (sc) (e.g., above a threshold, the greater the CADD metric, the more certain its impact on gene function). PolyPhen-2 includes two metrics (HumDiv and HumVar) that may be averaged (e.g., the greater the combined metrics, the more likely the variant damages a gene) and assigned a weight inversely proportional to the difference between the metric (e.g., the more the two metrics disagree, the less reliable are the metrics). The VEST metric may be assigned as a subscore (e.g., the greater the VEST metric, the more likely the variant damages a gene) and may be assigned a constant weight. In general, all of these metrics have approximately 80% accuracy in assigning predicted non-pathogenic variants relatively low scores and predicted pathogenic variants relatively high scores. However, these four metrics are generally inconsistent in the scoring of a subset of partially functional pathogenic variants due to the overly simplistic pathogenic categorization in ClinVar. It is only by combining these components with other gene dysfunction components in the neural network that embodiments of the invention produce cumulative gene dysfunction scores with sufficient accuracy of for example greater than 90-95% as described below.

Improvements

Embodiments of the invention may provide the following improvements and overcome the aforementioned longstanding issues inherent in the art:

Improved results: Table 1 and FIGS. 18A-B (discussed below) both show examples of novel or de novo variants, missed by standard models, but identified by the VGD model as pathogenic or benign. The discovery of these novel variants is due to the introduction of several new metrics of gene-dysfunction—the population selection component, the homozygous effect, the heterozygous effect, the dominant effect and the evolutionary constraint component—as well as an optimization of their combination using a neural network.

Neural Network: The neural network composes multiple gene-dysfunction components representing different and complementary aspects of gene dysfunction in an optimized manner to produce a cumulative prediction that is greater than the sum of its parts. Whereas separately analyzing all of these gene dysfunction components one-at-a-time predicts for example up to at most only 80% of pathogenic variants, analyzing multiple gene dysfunction components optimized together in the neural network improves gene-dysfunction accuracy predicting for example greater than 90-95% of pathogenic variants (see e.g., discussion of FIGS. 18A-B below). For example, in one instance, the homozygous effect provides relatively high accuracy for predicting gene dysfunction of variants in cases with relatively high variant frequency or relatively large population size, but provides relatively low accuracy in cases with relatively low variant frequency or population size. In these latter cases, the homozygous effect weight is diminished and other components such as the heterozygous effect dominate to compensate for an inaccurate homozygous effect. For example, in cases where a population sample and/or variant frequency is small, a disease-gene may still have wide clinical visibility, bolstering its heterozygous effect, or may be identified as damaging by its clinical classification, pathogenic predictors or mutation class. In another case of dominant traits or disease, the homozygous effect (adapted to identify recessive traits) and the heterozygous effect (only significant if the trait has substantial clinical visibility) may be insufficient. For dominant traits, the dominant effect compensates for these other effects, accounting for dominant disease variants by diminishing scores of known gene mutation that have higher than expected incidence compared to an expected incidence for pathogenic mutations of the same dominant gene. In another example, relatively high frequency variants typically have a relatively low evolutionary constraint component. Such relatively high frequency variants would therefore be ignored if they were classified on the basis of the evolutionary constraint component alone. Because the variant-specific gene dysfunction integrates multiple gene dysfunction components, for diseases with relatively high frequencies in certain populations, such as sickle-cell anemia (SCA), this suppressed evolutionary component may be augmented by a relatively high population selection, pathogenic predictors, clinical classification and/or mutation class components. In one example, variants within a gene associated with deafness may have a very high-damage score for the evolutionary constraint component indicating death or low-likelihood of survival because the ability to hear is critical to survival for most species. However, human populations have adapted to deafness, so the ability to hear is no longer critical for survival in most human populations. Accordingly, the population selection component corrects for an otherwise inflated damage score for deafness due to the evolutionary constraint component. Thus, each of the variant-specific gene dysfunction components are adapted to identify gene dysfunction with more or less accuracy in different sets of circumstances, thereby filling in “blind spots” where other components may not fully capture the gene effect, so that the combination of the multiple variant-specific gene dysfunction components together more accurately predicts gene dysfunction than analyzing the individual components separately.

Continuous Classification: In contrast to conventional binary classification systems, embodiments of the invention provide a continuous classification system of scores distributed on a continuous (e.g., linear) scale. Because the causation between variants and gene-dysfunction is typically uncertain, in particular, when analyzing novel variants not yet clinically validated, or only validated to a contested or uncertain confidence, the continuous likelihood or variant-specific gene dysfunction described according to embodiments of the invention may more accurately represent pathology as compared to a conventional binary classification. Further, the continuous likelihood described according to embodiments of the invention may account for the varying degree of gene dysfunction, for example, to identify any continuum or intermediate effects (e.g., a degree of disease or partial functionality of a phenotype) or to differentiate between different degrees of gene-dysfunction caused by different allele or genotype combinations of the same gene.

Population Selection: In contrast to the conventional carrier testing which has no way to test for population-specific anomalies or differences in selective factors, such as, deafness or the elimination of wisdom teeth, embodiments of the invention provide a population selection component that measures the relative propensity or aversion for specific variants on a population-by-population basis.

Homozygous Recessive Predictive Screening: A recessive gene must typically be a homozygote (having two copies) to express a recessive disease or trait. Because many newly arising mutations in recessive disease genes have not yet expressed as homozygotes, or worse yet, have killed off all patients with those homozygotes, conventional carrier testing has no way to test for new recessive disease gene mutations. In contrast, embodiments of the invention provide a homozygous effect component that measures the observed incidence or frequency of homozygote variants compared to an expected homozygote incidence. The expected homozygote incidence may be generated by extrapolating from the heterozygote or total variant incidence to predict what the expected homozygote incidence would be if there was no selective factor against the homozygote form of the variant. A relatively low observed homozygote incidence compared to the expected homozygote incidence indicates a likely selective factor against the homozygote forms of the genotype, increasing the likelihood that the variant causes a recessive disease or dysfunctional trait.

Heterozygous Effect: In contrast to the conventional classification systems which have no way of testing the effect of heterozygous variants for recessive traits, embodiments of the invention propose a heterozygous effect component that is a relative measure between variant frequency and clinical visibility. The heterozygous effect component identifies variants that have a disproportionately high clinical visibility compared to their frequency in a population. This unexpectedly high clinical visibility may indicate that these variants are likely candidates for recessive disease or dysfunction. Conversely, variants that have a disproportionately low clinical visibility compared to their frequency in a population may indicate that these variants are unlikely to contribute to recessive disease or dysfunction.

Dominant Predictive Screening: Variants linked to dominant disease or gene-dysfunction are typically difficult to detect because organisms with those variants seldom survive, or only proliferate for a few generations. The VGD model however can predict dominant disease gene mutations based on the allele count of the variant compared to a distribution of allele counts across a plurality of pathogenic mutations in the same gene. If the allele count of that particular variant is relatively higher than expected based on the distribution of the allele counts of the pathogenic variants for that gene, the likelihood that the variant causes gene dysfunction as a dominant allele may be relatively decreased.

Evolutionary Selection: The evolutionary selection component may use evolution as a “four billion year experiment.” During the course of evolution, nearly all variants or mutations have likely been tested and their success or failure in propagating through one or more species by natural selection indicates which variants cause dysfunction or disease (rare variants) and which variants are innocuous or positive (frequent variants). The evolutionary selection component may thus relate the measure of evolutionary variation of alleles at a genetic locus to the likelihood that a variant at that locus will cause disease or dysfunction.

Pre-clinical screening: In contrast to the conventional carrier testing which only tests for clinically validated disease-causing variants, embodiments of the invention may use the population selection and evolutionary selection components to predict the effect or disease-risk of allele variations without clinicians having ever observed those allele variations in diseased patients.

Reference is made to FIGS. 18A and 18B, which are graphs of the VGD scores for 400 variants of an example disease gene, transglutaminase (TGM) 1, compared with clinically validated results in (A) 2012 and (B) 2016 according to an embodiment of the invention. Each dot in each graph represents one of the about 400 variants that were analyzed. The VGD score for each variant is plotted along the x-axis of both graphs. The vertical dashed line in the graphs (e.g., at VGD=0.953) delineates a pathogenic threshold, above which variants are predicted to be pathogenic or disease-causing. The key at the bottom of the graphs represents clinical data indicating which variants were clinically validated (observed to cause an effect in a real patient) to be “Benign”, “Contested Pathogenic”, “Probably Pathogenic”, or “Pathogenic” (observed to cause disease in a real diseased patient), and which variants were not yet clinically validated “No ClinVar Pathogenicity Classification” (not yet observed to cause an effect in a real patient). The two graphs show different years of clinical validation data—FIG. 18A shows clinical validation data for 2012, labeling the only 11 variants that were clinically validated as pathogenic for TGM1 in 2012, and FIG. 18B shows clinical validation data for 2016, labeling the increased number of 26 variants that were clinically validated as pathogenic for TGM1 by 2016.

The 2012 graph in FIG. 18A shows that the VGD model correctly identified 10 out of 11 pathogenic variants known for the TGM1 gene in 2012 (90.9% accuracy). The 2016 graph in FIG. 18B shows that the VGD model correctly identified 24 out of 26 pathogenic variants known for the TGM1 gene in 2016 (92.3% accuracy). Comparing the two graphs in FIGS. 18A and 18B, the VGD model predicted 14 out of the 15 variants (93.3%) of the TGM1 gene that were newly validated as Pathogenic between 2012 and 2016 (variants whose classification switched from a “No Clinical Validation” classification in the 2012 graph of FIG. 18A to a “Pathogenic” classification in the 2016 graph of FIG. 18B). Thus, 14 novel disease-causing variants were predicted based on the VGD scores (at the time of prediction, there was no clinical curation because these variants were not yet known).

The VGD scores use the population selection, mutation class, pathogenic predictors, and evolutionary constraint components to predict the effect or disease-risk of allele variations without clinicians having ever observed those allele variations in diseased patients. This allows geneticists to assess the disease-risk of novel or de novo mutations or variants that have never before been validated or studied in diseased patients. The graphs in FIGS. 18A and 18B show that the VGD score accurately predicted pre-clinical disease-risk with a confidence of greater than 90% and predicted 93.3% of newly validated variants as disease-causing in the example of the TGM1 gene. Carrier screening is thereby no longer restricted to testing for disease caused by the limited number of already-known clinically validated disease variants. These graphs show that, in 2012, carrier screening only screened for at most 11 pathogenic variants of TGM1, while the VGD model screened for 24 pathogenic variants of TGM1. This indicates that the VGD score accurately predicted 13 new pathogenic or disease-causing variants for the TGM1 gene (only later validated in 2016) before geneticists would have ever tested for them in carrier screening because those 13 variants had not yet been validated as disease-causing at the time of testing (non-validated in 2012). This improves genomics and carrier screening by discovering disease-correlated variants before the variants are clinically validated, thereby predicting disease risk in patients that would have otherwise been ignored, improving the accuracy of carrier screening and potentially mitigating disease and saving lives. (Table 1 also provides a list of example de novo variants, missed by standard models, but identified by the VGD model as likely to be disease-contributing or neutral.)

Pre-conception screening: Conventional carriers screening only tests for variants that have already been identified and validated in a living diseased child. Some embodiments of the invention compute the VGD score or likelihood of disease-risk of allele variations in “virtual progeny” (non-existing, pre-conception progeny) based on measures of population selection or evolutionary variation, instead of only based on clinically validating the genomes of real diseased children (existing, post-conception progeny). Embodiments of the invention can thereby predict the likelihood that an allele variation will cause disease without requiring that any child ever be conceived with that disease or disease-causing variant. As shown in FIGS. 18A and 18B, the VGD score can be used to identify novel, never before validated, variants as disease-causing with extremely high accuracy (93.3% of newly validated variants in the example of the TGM1 gene), without ever having to clinically observe such a variant in a real living organism. This improves genomics by discovering disease-causing variants at an earlier stage, before a child is ever conceived with the disease or disease-causing variant, thereby potentially reducing disease in children.

Other or different advantages may be realized according to embodiments of the invention.

System Overview

Reference is made to FIG. 1, which schematically illustrates a system 100 for assessing risk or likelihood of variant-specific gene dysfunction according to an embodiment of the invention.

System 100 may include a genetic sequencer 102, a sequence aligner 104 and/or a sequence analyzer 106. Units 102-106 may be implemented in one or more computerized devices as hardware and/or software units, for example, specifying instructions configured to be executed by a processor. One or more of units 102-106 may be implemented as separate devices or combined as an integrated device.

Genetic sequencer 102 may input DNA obtained from biological samples, such as, blood, tissue, or saliva, of one or more real living organisms and may output each organism's genetic sequence including the organism's genetic information at one or more genetic loci, for example, a human genome. A single organism's DNA sample may be sequenced for performing carrier testing on that individual, two potential parents' DNA samples may be sequenced for performing carrier testing on a virtual progeny generated by combining at least a portion of the two potential parents' genetic sequences, or a single potential parent's DNA sample may be sequenced for combining with each of a plurality of candidate donor sequences to generate a plurality of virtual progeny to determine an optimal and/or a least optimal subset of one or more donors.

Sequence analyzer 106 may generate virtual progeny by inputting two potential parents' genetic sequences to simulate a mating by combining at least a portion of genetic information derived therefrom and output a virtual progeny genetic sequence of a virtual gamete, for example, as described in reference to FIG. 17.

Sequence aligner 104 may align one or more loci of the organism or virtual progeny's genetic sequence with a plurality of reference genetic sequences of extant organisms from (a) one or more different populations for generating a population selection component and (b) one or more different species for generating an evolutionary constraint component. In some embodiments, a sequence aligner need not be used.

Sequence analyzer 106 may input the multiple sequence alignment and may compute measures of (a) population-specific variation of alleles and (b) species-wide evolutionary variation of alleles at one or more aligned genetic loci. Sequence analyzer 106 may generate (a) a population selection component and (b) an evolutionary constraint component based on these measures (e.g. as shown in FIG. 3A-B). Sequence analyzer 106 may use a neural network to combine these and other components (e.g., population selection component, evolutionary constraint component, clinical classification component, and/or mutation class components) to generate one or more variant-specific gene dysfunction likelihoods or scores measuring the relative propensity or risk that these alleles are damaging in a living organism or would be damaging if produced in a child. Sequence analyzer 106 may have a training phase and a runtime phase. In the training phase, sequence analyzer 106 may compare VGD predictions to known results to validate the model and set parameters (see e.g., “Parameter” in FIG. 3D), confidence weights, and sensitivity thresholds. In the runtime phase, sequence analyzer 106 may generate VGD scores to test the risk of gene dysfunction due to specific allele variants or mutations, including de novo variants (not yet clinically validated), in living organisms or in virtual progeny of potential parents (before a child of those potential parents is conceived). In some embodiments, sequence analyzer 106 may analyze a specific one or more targeted variants, or may progress locus-by-locus to analyze all (or a plurality) of variants throughout the human genome. Sequence analyzer 106 may analyze variants in one pass (a full analysis of all components), or multiple passes (e.g., using a coarse analysis of only one or a subset of components in a first pass to flag a subset of variants with an above-threshold likelihood to be more closely analyzed in a second pass). In the training and/or runtime phases, sequence analyzer 106 may compute each node in the neural network or variant, in series or in parallel.

Genetic sequencer 102, sequence aligner 104, and sequence analyzer 106 may include one or more controller(s) or processor(s) 108, 110, and 112, respectively, configured for executing operations and one or more memory unit(s) 114, 116, and 118, respectively, configured for storing data such as genetic information or sequences and/or instructions (e.g., software) executable by a processor, for example for carrying out methods as disclosed herein. Processor(s) 108, 110, and 112 may include, for example, a central processing unit (CPU), a digital signal processor (DSP), a microprocessor, a controller, a chip, a microchip, an integrated circuit (IC), or any other suitable multi-purpose or specific processor or controller. Processor(s) 108, 110, and 112 may individually or collectively be configured to carry out embodiments of a method according to the present invention by for example executing software or code. Memory unit(s) 114, 116, and 118 may include, for example, a random access memory (RAM), a dynamic RAM (DRAM), a flash memory, a volatile memory, a non-volatile memory, a cache memory, a buffer, a short term memory unit, a long term memory unit, or other suitable memory units or storage units. Genetic sequencer 102, sequence aligner 104, and sequence analyzer 106 may include one or more input/output devices, such as output display 120 (e.g., such as a monitor or screen) for displaying to users results provided by sequence analyzer 106 (e.g., visualizing FIGS. 2, 4-18) and an input device 122 (e.g., such as a mouse, keyboard or touchscreen) for example to control the operations of system 100 and/or provide user input or feedback, such as, selecting one or more models or phylogenetic trees, selecting one or more genus or species to use for generating the models, selecting input genetic sequences, selecting two potential parents or multiple donor candidates from a pool of potential parents with which to simulate mating, selecting a number of iterations for simulating a mating with a different pair of virtual gametes in each iteration from each pair of potential parents, etc.

Reference is made to FIG. 2, which schematically illustrates a display visualizing a DNA image 200 and/or sequence 202 of an organism comprising one or more variants or mutations 204 labeled in the DNA together with detailed information 206 including a likelihood of variant-specific gene dysfunction according to an embodiment of the invention.

DNA image 200 may be an image of DNA extracted from biological samples, such as, blood, tissue, or saliva. The organism may be a living organism or a virtual organism. When screening a living organism, DNA image 200 may be an image of DNA of the living organism undergoing screening. When screening a virtual organism, DNA image 200 may be an image of DNA of one or more of the two living potential parents whose DNA is combined to generate the virtual organism undergoing screening. For example, when two potential parents undergo carrier screening to predict disease or dysfunction in their potential child, the two potential parents' DNA may both be imaged, whereas when one potential parent seeks screening with a pool of donor candidates, the image of the DNA of the one potential parent may be displayed alone (without DNA images of candidate donors, e.g., for privacy issues) or together in a sequence of displays with the DNA image of each respective candidate donor. DNA image 200 may display a portion of or the entire length of a human genome.

DNA sequence 202 may be a schematic representation of the DNA described above, such as a sequence of nucleotides, bases, amino acids or other genetic information representing the DNA (e.g., sequenced by genetic sequencer 102 of FIG. 1). In one embodiment, DNA sequence 202 may be a magnified view of a section or segment of DNA image 200, and DNA image 200 may be highlighted to label the DNA segment to which DNA sequence 202 corresponds. DNA image 200 and/or DNA sequence 202 may be dynamic displays. For example, a user may control DNA image 200 by translating or zooming in or out of the stream, or may control DNA sequence 202 by selecting a corresponding center-point, segment, or magnification range, of DNA image 200.

One or more genetic mutations 204 or their positions may be identified or labeled in DNA image 200 and/or DNA sequence 202 e.g., by color, tone, labeling, highlighting, etc. Identified genetic mutations 204 may be variants identified or flagged by a processor (e.g., 112 of FIG. 1) to have an above threshold likelihood of variant-specific gene dysfunction (e.g., flagged as likely “pathogenic”), or variants selected by user-input to undergo screening.

The display may provide detailed information 206 about the identified genetic mutations 204, for example, automatically where flagged as pathogenic or if selected or identified by a user (e.g., by hovering a cursor over the variant in DNA image 200 and/or DNA sequence 202). The detailed information 206 may include, for example, one or more likelihoods that the identified genetic mutation 204 causes gene-dysfunction in the organism (e.g., a VGD score and/or a naïve VGD score, VGD^−Cl, omitting clinical classification data), a frequency of the genetic mutation (e.g., a total number of instances of the mutation in one or more populations, and population in formation), a mutation class, HGVsp (the standard notation for identifying an amino acid substitution), and/or clinical classification. Other information, combinations of information, or visualizations of information.

Example Data Sources

Analytic targets—Analyses described according to embodiments of the invention targeted an example set of exons from 480 genes associated with autosomal recessive disease. The gene set was composed of all autosomal genes covered by Illumina's TruSight Inherited Disease Sequencing Panel (URL: http://support.illumina.com/downloads/trusight_inherited_disease_product_files.html accessed on Sep. 24, 2014) as well as all additional autosomal genes targeted by Counsyl's Family Prep platform. Illumina TruSight One Sequencing Panel's intervals were used to target the exon regions of each gene analyzed. Exon intervals were padded with a minimum of 10 by (base pairs) to a maximum of 50 by of intronic sequence to include variants listed as “pathogenic” in the ClinVar datasets.

ExAC dataset—Population-specific allele and genotype frequency data for variants in targeted intervals were obtained from an example of 60,706 sequenced exomes that were consolidated and processed by the Exome Aggregation Consortium (ExAC) and made publically available through the Consortium's website (Version 0.3, URL: http://exac.broadinstitute.org accessed on Jan. 13, 2015). The ExAC cohort is composed of unrelated individuals from six geographically defined, and in most cases, genetically distinguishable populations: non-Finnish Europe (NFE; population size, n=33,370), Finland (FIN; n=3,307), South Asia (SAS; n=8,256), East Asia (EAS; n=4,327), Africa (AFR; n=5,203), and the Americas (AMR; n=5,789). ExAC subjects represent individual participants in several large-scale disease-specific and population genetic studies. Persons diagnosed with targeted pediatric diseases were generally excluded from participation.

ClinVar annotation archive—The ClinVar archive of DNA variants with clinical annotations is maintained in two partially overlapping datasets indexed on the human reference genome (hg19) or the HGVS format for describing variation in transcribed genomic intervals. Clinical assertions were retrieved related to variants located in targeted regions by parsing the two files clinvar.vcf and variant_summary.txt (both accessed at URL: http://www.ncbi.nlm.nih.gov/clinvar/ on Mar. 6, 2015; file parsing details can be found in more detail below).

VariBench dataset—Data from the VariBench website (URL: http://structure.bmc.lu.se/VariBench/, accessed on Feb. 6, 2015) was downloaded and processed as a second variant benchmark set. The VariBench metric clusters experimentally verified variants into “pathogenic” and “neutral” (or synonymously benign) datasets. The VariBench variants were divided into subsets that overlap targeted intervals.

Online Mendelian Inheritance in Man (OMIM)—OMIM is a catalogue of human genes and diseases with separate English language narrations provided for major allelic variants of disease genes (accessed as the compressed file omim.txt.Z located at URL: ftp.omim.org/OMIM/ on Mar. 3, 2015). Among other information, allele-specific narrations contain references to other indexed alleles in the same gene that have been detected in compound heterozygous patients. This analysis used a “natural language” approach and cross-indexing among alleles to enumerate distinct compound heterozygous second alleles found in association with each primary allele.

All data sources and testing parameters are described as examples, and are optional and may be omitted, or replaced by equivalents.

Variant-Specific Gene Dysfunction Modeling

Reference is made to FIGS. 3A-3D, which schematically illustrate an example of (A) a neural network combining multiple gene dysfunction components for determining the risk of variant-specific gene dysfunction in an organism; (B) an exploded view of the population selection component; (C) an exploded view of the clinical classification component, mutation class component and pathogenic predictors component; and (D) optimized parameters for the components in FIGS. 3A-3C, according to an embodiment of the invention.

In FIG. 3A, the neural network combines multiple gene-dysfunction components for determining the risk or likelihood that a specific variant or genetic mutation causes gene-dysfunction in organisms. For example, the multiple gene-dysfunction components include clinical classification, mutation class, pathogenic predictors, evolutionary constraint, and/or population selection. The neural network includes a plurality of nodes (b=1, 2, . . . , B), each node representing a different gene-dysfunction component, for example, associated with one or more sub-likelihoods or subscore(s) S_j^band associated weight(s) w_j^b. For each of one or more variants, j (j=1, 2, . . . J), the neural network may compute the B dysfunction components or subscores at the nodes. Each subscore S_j^bmay be computed on a continuous scale (e.g., [0.0, 1.0], or more generally [N, M] where N, M are rational numbers). Each subscore, S_j^b(b=1, 2, . . . , B), may be multiplied by a corresponding “final” component weight, w_j^b, which may represent the relative level of confidence ascribed to a particular scoring component b for a particular variant j.

Component weights may initially be “raw” weights, w_j^0b(initial or raw denoted by a “0” in the superscript) on a continuous raw scale (e.g., [0.0 to 99]), for example, derived according to component-defined sets of functions and rules. Variant components with missing or unused values may be designated as “unassigned” and/or may be given a null raw weight of, for example, 0.0. The final or scaled component weights, w_j^b, for each variant j may be obtained by recalibrating the raw weights, for example, to sum to a maximal value on the scale (e.g., 1.0 or N):

$\begin{matrix} w_{j}^{b} = \frac{w_{j}^{0 b}}{\sum_{b = 1}^{B} w_{j}^{0 b}} . & (1) \end{matrix}$

The likelihood of variant-specific gene dysfunction for variant j, denoted as VGD_j, may be computed as a weighted sum of the B dysfunction components:

$\begin{matrix} V G D_{j} = \sum_{b = 1}^{B} w_{j}^{b} S_{j}^{b} . & (2) \end{matrix}$

The specific component subscores and weights are explained in more detail in their respective sections below. The neural network of FIGS. 3A-3D may incorporate an unlimited number B of component dysfunction subscores (s_j^b) at its nodes, each weighted according to an independent confidence or weight (w_j^b). In some embodiments of the invention, the neural network is not static, but dynamic. Dynamic embodiments may add new component nodes or update the values for the current nodes (e.g., adding new clinical classification or mutation type data as it is discovered), or may update the system parameters (see e.g., “Parameter” in FIG. 3D), confidence weights, and sensitivity thresholds by re-training the model using new input data in a training phase. In some examples, the VGD model may be re-trained periodically and/or upon a trigger such as the uploading or indication of new input or training data. Typically, the population size (N_j^r), variant frequencies (f_j^r) and/or clinical visibility (CV) are revised with the expansion of available data. The training phase may optimize the neural network based on a cost factor, for example:

$\begin{matrix} C_{VGD} = \frac{1}{3 G} \sum_{g}^{G} [(1 - t_{g}^{bs}) + \frac{u_{g}^{> bs}}{U_{g}} + {\hat{d}}_{g}^{cv 2}], & (3) \end{matrix}$

where t_g^bsis a pathogenic bootstrap threshold providing a lower bound for the VGD scores of a predetermined number or percentage (e.g., 99%) of known pathogenic mutations, u_g^>bsis a measure of how many uncharacterized variants with a minimum allele count (e.g. allele count>5) fall above that pathogenic bootstrap threshold, U_gis a total count of uncharacterized variants to scale u_g^>bsas a ratio, {circumflex over (d)}_g^cv2is a measure of the mean dysfunction values for benign variants (e.g., classified as ClinVar-2 or ClinVar-3). The neural network is optimized in the training-phase by reducing the cost factor C_VGD. Minimizing or reducing C_VGDmay optimize the VGD model to better match three groups of variants: (1) known pathogenic variants are matched by shifting the center of VGD of known pathogenic mutations toward one or more maximal likelihoods (e.g., by minimizing (1−t_g^bs) in the cost factor when the maximal likelihood is 1); (2) known benign variants are matched by shifting the center of VGD of known benign mutations toward one or more minimal likelihoods (e.g., by minimizing {circumflex over (d)}_g^cv2in the cost factor); and (3) uncharacterized variants are matched by shifting the center of VGD of uncharacterized away from the one or more maximal and/or minimal likelihoods (e.g., by minimizing u_g^>bs/U_gin the cost factor). The neural network is optimized in the training-phase on a gene-by-gene basis, after which the gene-specific optimization results are aggregated or summed across multiple genes, genome segments or an entire genome, for example, to obtain a combined genome-wide cost factor to be minimized.

Population Selection Component

The “population selection” component PopScore_j(S_j^P) may provide a population-specific measure that each individual population suppresses or naturally selects against a particular allele variant or mutation. The population selection component may be inversely related to the observed frequency of the allele in the population. Because natural selection typically factors against damaging or disease-causing variants, the more frequent a variant is, the less likely the variant is to cause a trait that is considered dysfunctional or disease-causing in a particular population; whereas the less frequent a variant is, the more likely the variant is to cause such a trait. Because each population is unique, different populations will generally select differently against some traits (e.g., deafness may be considered acceptable for survival in some modern populations, but not in other populations such as hunting populations) and may select similarly against some common or universally dysfunctional traits (e.g., traits that threaten survival for all humans such as cancers). The population selection component balances a homozygous effect, a heterozygous effect and/or a dominant effect.

The homozygous effect may be used to identify variants that cause recessive disease or traits. The homozygous effect may measure the frequency or rarity of homozygote variants for each of multiple populations, for example, by comparing an observed homozygote incidence of a variant (e.g., measured on both chromosomes at the variant's genetic locus) relative to an expected homozygous incidence (e.g., based on a total observed incidence of the identified genetic mutation measured on either chromosome at the genetic locus and the population size) in each population. The likelihood of variant-specific gene dysfunction score may be inversely proportional to the ratio of the observed versus expected homozygote variants because a suppressed incidence of homozygotes in the population may indicate a population selection against those homozygotes (e.g., for causing a recessive trait or disease).

The heterozygous effect may measure the total frequency of variants (e.g. measured on either chromosome) in each population relative to its prevalence in the clinical literature. The likelihood of variant-specific gene dysfunction score may increase when the relative clinical visibility compared to its allele frequency in a population increases, indicating the variant is an anomaly or especially relevant to disease diagnostics.

The dominant effect may be used to identify variants that cause dominant diseases or traits. The dominant effect may measure a variant's observed allele count compared to an expected allele count for any pathogenic variant in the same gene, for example, based on a distribution (e.g., a Poisson or CFD distribution) of allele counts across a plurality of (or all) known pathogenic mutations in the gene. If the allele count of that particular variant is relatively higher than expected based on the distribution of the allele counts of the pathogenic variants for that gene, the likelihood that the variant causes gene dysfunction as a dominant allele is relatively decreased.

Pathogenic Predictors Component

The “pathogenic predictors” component PathScore_j(S_j^PP) may compose one or more metrics (R_j^pp) that predict a predicted degree of pathology of a variant or variant class under analysis. For example, a PROVEAN metric (R_j^PR) and a VEST metric (R_j^V) predict whether a protein sequence variation affects protein function, a CADD metric (R_j^C) predicts the effects of any type of variant, and PolyPhen-2 metric (R_j^P2) predicts the effects of a missense amino acid substitution on the structure and function of a human protein. All four pathogenic predictor metrics are trained by machine-learning on sets of presumed pathogenic and presumed benign variants. The pathogenic predictors component (S_j^PP) may be defined, for example, as:

S
_j
^PP=Σ_pp=1^PPu_j^PPR_j^PP (4),

where (R_j^pp) is the predictive damage score generated from each of PP pathogenic predictor metrics (pp=1, 2, . . . , P), and u_j^ppis the corresponding subcomponent weight. The choice of pathogenic predictor metrics (R_j^pp) may also be dynamic; the pathogenic predictors component may add, recompose, and remove subcomponent metrics in equation (4), for example, eliminating nodes that have no data (e.g., “unassigned”) or when a confidence weight is substantially negligible (e.g., nodes with negligible weight may be considered equivalent to discounting the nodes completely, as long as other nodes have significant weight).

Raw data for the pathogenic predictors, PolyPhen-2, VEST, CADD, and PROVEAN, may be obtained, for example, by sending query batches to the respective websites (http://genetics.bwh.harvard.edu/pph2/bgi.shtml, http://www.cravat.us/, http://cadd.gs.washington.edu/score, and http://provean.jcvi.org/genome_submit_2.php).

Each of the raw pathogenic predictor data may be differently scaled and the pathogenic predictor metrics (R_j^pp) may map or transform the raw pathogenic predictor data onto a uniform scale. The scale for this and all component scores may be, for example, a [0.0, 1.0]) scale, or any [M, N] scale, where M, N are rational numbers such as integers.

Raw VEST data is provided on a [0.0, 1.0] scale. The VEST metric (R_j^V) may map the raw VEST values linearly by a constant factor. For example, if the pathogenic predictors component is provided on a [0.0, N.0] scale (where N is an integer), the raw VEST values may be multiplied by N. In the example where the pathogenic predictors components (R_j^pp) use the same [0.0, 1.0] scale as the raw VEST values, the raw VEST values need not be scaled and may be set to equal the VEST metric (R_j^V).

Raw PolyPhen-2 data provides two metrics, HumDiv and HumVar, which are trained using two different models. Each of the HumDiv and HumVar values are provided on a [0.0, 1.0] scale. The PolyPhen-2 metric (R_j^P2) averages these two metrics (HumDiv and HumVar), thereby mapping each of them onto a half-scale (dividing each metric by a factor of 2). For example, if the pathogenic predictors component is provided on a [0.0, N.0] scale (where N is an integer), the raw PolyPhen-2 values HumDiv and HumVar may each be multiplied by N/2. In the example where the pathogenic predictors components (R_j^pp) use the same [0.0, 1.0] scale as the raw PolyPhen-2 values, the raw PolyPhen-2 values HumDiv and HumVar are each scaled by ½. In other embodiments, the HumDiv and HumVar may be scaled differently (e.g., 1/n and (n−1)/n), for example, depending on the relative sample size or confidence level of the damaging vs. non-damaging training set. The raw PolyPhen-2 weight (w_j^P2) is a measure of the confidence of the PolyPhen-2 data, for example, based on the consistency or difference between its two values, (HumDiv−HumVar).

Raw CADD data may provide a “C-score” (CADD_j) for each variant representing its pathogenic or deleteriousness ranking, for example, relative to the approximately 8.6 billion (˜10^9.9) single base changes that could occur in the human genome. The raw CADD C-score (CADD_j) is defined on a “Phred” scale, which is a base-10 logarithmic scale in which each 10 points corresponds to an order of magnitude. The raw CADD C-scores may be transformed, for example, based on an optimized logistic function into a sigmoidal distribution of the adjusted CADD scores (e.g., equation (6)). Reference is made to FIG. 8, which is a graph of an example logistic transformation from the raw Phred scaled CADD scores (CADD_j) to the adjusted CADD scores (R_j^C) (solid line) according to an embodiment of the invention. The raw Phred scaled CADD scores (CADD_j) may be transformed to a linear scale such as [M, N], for example, [0.0, 1.0].

Raw PROVEAN scores (PROV_j) may be transformed to a linear scale using an optimized logistic transformation (e.g., equation (7)). Reference is made to FIG. 9, which is a graph of an example logistic transformation from the raw PROVEAN scores (PROV_j) to the adjusted PROVEAN scores (R_j^PR) (solid line) according to an embodiment of the invention. The raw PROVEAN scores (PROVO may be transformed to a linear scale such as [M, N], for example, [0.0, 1.0].

The raw CADD and PROVEAN values may also indicate intrinsic levels of confidence implied by initially extreme raw CADD and PROVEAN scores. For example, a variant ranked by CADD as among the top 0.1% of variants (e.g., CADD_j>30) or among the top 0.0000001% of variants (e.g., CADD_j>90), in which all (or substantially all) of the contributing support vectors agree that the variant is damaging, may be transformed to the most damaging pathogenic predictor level (e.g., R_j^C≈1.0). The confidence level of the CADD and PROVEAN values may be represented by the pathogenic predictor weights, u_j^Cand u_j^PR. The raw CADD and PROVEAN values are transformed to raw CADD and PROVEAN weights, u_j^0Cand u_j^0PR, for example, using a power function such as in equations (8) and (9), respectively. FIG. 8 shows the relationship between the raw CADD scores (CADD_j) and the raw CADD weights (u_j^0C) and FIG. 9 shows the relationship between the raw PROVEAN scores (PROV_j) and the raw PROVEAN weights (u_j^0PR) (dashed lines). These non-linear exponential weight relationships provide disproportionately greater weights to the extreme-valued raw metrics (extremely high CADD values and extremely small PROVEAN values, as compared to moderate or relatively small CADD values and relatively large PROVEAN values). Accordingly, relatively highly deleterious raw values of CADD_jand PROV_jresult in relatively higher weights for u_j^Cand u_j^PR, respectively (e.g., as shown in FIGS. 8 and 9).

When there is no raw data for a particular variant, the respective subcomponent score may be designated as “not assigned,” and the raw weight, u_j^0pp, may be set to a null or negligible value, for example, to 0.0. The final individual metric weights, u_j^pp, may be recalibrated, for example, so that they sum to 1.0. The overall raw weight for the predictive damage component in equation (2), w_j^PP, may be set equal or proportional to the sum of the final individual metric weights (e.g., w_j^0PP˜Σ_pp=1^PPu_j^pp, as shown in FIG. 3C).

Mutation Class Component

The mutation class component MutScore_j(S_j^M) may represent a mutation type, category or class, of a variant. Table 2 shows an example of mutation class subscores (S_j^M) and raw weights (w_j^0M) for various mutation types. Mutation types may be a first order categorization of molecular impact on RNA splicing, mRNA translation, and protein function. Examples of mutation types include, start-loss, stop-gain, frame-shift indel, essential splice site, microsatellite, synonymous, in-frame indel, missense, and stop-loss, although other mutation types may be used.

Start-Loss, Stop-Gain, Frame-shift Indel, and Essential Splice Site Mutation Classes: These mutation classes are associated with severe gene dysfunction. Variants in these mutation classes may be assigned a relatively high mutation class subscore (e.g., S_j^M=1.0) and a relatively high confidence raw weight (e.g., w_j^0M=99).

Synonymous Mutation Class: These mutation classes are associated with low incidences of gene dysfunction. Variants in the synonymous mutation class may be assigned a relatively low or null mutation class subscore (e.g., S_j^M=0.0). The synonymous mutation class is given a neutral confidence raw weight (e.g., w_j^0M=1.0).

Microsatellite Mutation Class: For variants in the microsatellite mutation class, the longer the length of the repeating microsatellite sequence, the less likely the microsatellite impacts gene function and the less likely that a mutation of one of its alleles will cause dysfunction. Accordingly, the mutation class subscore for variants in a microsatellite class may be inversely proportional to the length of the repeating microsatellite sequence (e.g., S_j^M=1/(1+mmsSTR^ems, where STR is equal to the number of short tandem repeats that exist in all of the microsatellites in ExAC at a given position, and ems and mms are tunable parameters). The microsatellite mutation class may be given a neutral confidence raw weight (e.g., w_j^0M=1.0).

Missense and Stop-loss Mutation Class: Variants in the missense and stop-loss mutation classes may be assigned a relatively moderate mutation class subscore (e.g., S_j^M=0.5) because their impact on gene function is highly variable.

In-Frame Indel Mutation Class: An in-frame indel mutation typically inserts or deletes an integer number (AA_j) of amino acids or codons. Because the likelihood of dysfunction typically increases the greater the number of damaged amino acids, in-frame indel mutation dysfunction scores may increase with the number of codons added or subtracted, for example, asymptotically approaching a maximum score (e.g., 1.0), for example, as defined in equations (10a) or (10b). Reference is made to FIG. 10, which is a graph of an example relationship between in-frame indel mutation class scores (S_j^M) vs. the number of inserted codons according to an embodiment of the invention. Red asterisks identify codons.

In-frame indels, missense, stop-loss and other mutation classes may be assigned a relatively low raw weight (e.g., w_j^0M=0.01) because their impact on gene function may be more accurately assessed by other scoring components. The mutation class subscore for variants of these mutation classes will thus only significantly contribute to the aggregate VGD_jscore if the variant's remaining gene dysfunction components also have relatively small weights or are unassigned. All remaining mutation classes, including introns and un-translated region, may be assigned a relatively low or null mutation class score (e.g., S_j^M=0.0).

Clinical Classification Component

The clinical classification component, ClinScore_j(S_j^Cl), may represent a clinical classification assigned to a variant j. The clinical classification component is typically only considered (e.g., non-zero) when a variant has been clinically validated as causing disease or dysfunction in a living patient with that disease or dysfunction. The clinical classification component may therefore not be considered (e.g., zero) for all new or de novo variants not yet linked to cause disease or dysfunction in a clinical setting (e.g., variants listed in Table 1 or labeled as “No ClinVar Pathogenicity Classification” in FIG. 18). Where a variant is clinically validated, for example as “pathogenic” or “benign”, that information should be weighed heavily against other predictive factors.

FIG. 3C shows one embodiment of clinical classification subscores (S_j^Cl) and raw weights (w_j^0Cl) for various types of clinical classification. The clinical component, ClinScore_j(S_j^Cl) may be assigned a maximal relatively high value (e.g., S_j^Cl=1.0) for all variants that have a (e.g., ClinVar) classification of “pathogenic” and a minimal or relatively low value (e.g., S_j^Cl=0.0) for all variants that have a most damaging (e.g., ClinVar) classification of “benign” (see e.g., Table 3). The raw clinical classification weight (w_j^0Cl) may represent the certainty of the classification. For example, variants classified as “uncontested” pathogenic or benign may be assigned a maximal or relatively high weight (e.g., w_j^0Cl=20), variants classified as “probably” pathogenic or benign may be assigned a moderate or relatively lower weight (e.g., w_j^0Cl=10), and variants classified as “contested” pathogenic or benign may be assigned a minimal or relatively lowest weight (e.g., w_j^0Cl=1) (see e.g., Table 3).

In some embodiments, to counterbalance low weights for variants with probable or contested pathogenic classifications, the weight for all non-benign variants may be increased based on a compound heterozygote count, h_j

$(e . g . {(\frac{h_{j}}{3})}^{2}),$

where h_jmay be the number of alternative alleles reported in compound heterozygote patients, for example, by a disease database such as OMIM. The compound heterozygote count, h_j, may act as a proxy for independent clinical replication of pathogenic findings. Thus, the compound heterozygote count, h_j, may also be used to reduce the confidence assigned to a benign classification, e.g., with h_j≧3, for example to reduce its weight (e.g., to w_j^0Cl=0.0).

Bootstrap Pathogenicity Thresholds

Pathogenicity thresholds may be used to delineate a continuous range of VGD scores that are pathogenic or associated with gene dysfunction. Reference is made to FIG. 7A, which is a graph of an example of one-sided bootstrap confidence intervals per gene (t_g^bs) according to some embodiments of the invention. One-sided bootstrap confidence intervals may be generated, e.g., during a training-phase, as a lower bound for the VGD scores of a plurality or majority (e.g., 99%) of known pathogenic mutations. That is, a predetermined number or percentage (e.g., 99%) of a random sampling of uncontested pathogenic variant's VGD scores are computed above this threshold. In one example of a training-phase, the one-sided bootstrap confidence intervals may be generated to conform non-clinical VGD scores (e.g., average VGD_j^−Clwithout the clinical classification (ClinVar) component among ClinVar-5 and 5.1 variants for each gene) with clinical classification data. In some embodiments, an optimization may be executed to shift the center of the VGD scores of known pathogenic variants above the threshold and shift the center of the VGD scores of uncharacterized variants below the threshold. In one example training-phase, 10,000 boot-strapped samples were used in the training set to calculate pathogenicity thresholds and genes were only considered with at least ten ClinVar-5 variants. Other training parameters, optimizations, and training sets may be used. Once bootstrap pathogenicity thresholds are computed, they may be applied in a run-time phase for example to detect novel or de novo (e.g., uncharacterized) variants as dysfunctional that have VGD scores above these thresholds (see e.g., Table 1).

Expected Discovery Rate of Novel Disease Variants in Heterozygotes or Homozygotes

An expected newborn frequency of deleterious variants in heterozygotes compared to homozygotes (HET:HOM) is derived from the Hardy-Weinberg equilibrium as a function of disease incidence:

$\begin{matrix} \frac{HET}{HOM} = \frac{2}{\sqrt{I}} - 2. & (5) \end{matrix}$

For example, based on a disease incidence of 1/2,500, a cystic fibrosis-causing CFTR variant is 98-fold more likely to appear in an unaffected person compared to an affected newborn. Most serious recessive diseases have lower individual incidences and, thus, higher predicted HET:HOM ratios.

Clinical Classification Data Parsing Details

ClinVar has two sources of data: clinvar.vcf (CV in standard VCF format) and variant_summary.txt (VST) files. The CV and VST files label positions of deletions differently and their data are thus difficult to combine. The VST file right-aligns the position of deletions, while the CV file (like other VCF files) left-align the position of deletions. For example, in a situation where the first four positions of a reference sequence is ‘AAAA’, and there is a deletion of a single ‘A’. There is no way to determine which of the 4 ‘A’s is actually deleted; however the deletion must be assigned a single position. VCF format labels this deletion at position 0, while VST labels this same deletion at position 4; yet they refer to the exact same variant. A similar issue occurs when labeling a specific sequence change of any indel. There is a need in the art to provide a consistent way to label specific sequence changes that works with any initial format, for example, even identical indels with different reference alleles.

To properly recognize any indel, embodiments of the invention propose a new technique to transform all variants into a standard labeling format. In this format, the reference is a single base, and an indel may be indicated by a first code or symbol (e.g., “+”) preceding all inserted bases and a second code or symbol (e.g., “−”) preceding all deleted bases. This format properly labels the insertion(s) and deletion(s) made to mutate the reference base into a mutated variant or allele. Additional unique difficulties and exceptions may be handled on a case-by-case basis for the corresponding variants.

Population Details

In some embodiments, allele frequency in the Finnish population may be ignored because it provides skewed results due to a bottleneck effect that took place during the founding of Finland roughly 2,000 years ago. This bottleneck effect has contributed to “Finnish Disease Heritage,” in which deleterious mutations that are rare elsewhere may exist at disproportionately higher frequencies in Finland. Accordingly, in such embodiments, the Finnish data set may be ignored or only used as part of the global data (in other embodiments, the Finnish data set may be considered).

In some embodiments, low-quality samples (e.g., indicated with a low allele number such as AN<500) may be selectively removed or ignored in the training set on a variant-by-variant basis. In one example, frequencies for a variant were only considered based on super-populations with an above threshold AN (e.g., AN_super-pop≧500). If none of the super-populations surpassed this threshold, but a global AN exceeded a global threshold (e.g., AN_global≧1,000) for the variant, the global population frequency may be used instead. For a rare case in which the global AN is below the global threshold, or the variant was not observed in ExAC, the variant may be determined to be novel.

Transformation of CADD and PROVEAN Values

The raw CADD and PROVEAN values may be mapped or transformed onto a uniform linear VGD scale, for example, a [M, N] scale such as [0.0, 1.0]).

To transform the raw CADD values, the Phred scale CADD C-scores may be transformed to values on a continuous VGD scale (e.g., [0.0, 1.0]), for example, using the following optimized logistic equation with midpoint (m_C) and steepness (k_C) parameters:

$\begin{matrix} R_{j}^{c} = \frac{1}{1 + 10^{kc (m c - {CADD}_{j}) / 10}} . & (6) \end{matrix}$

FIG. 8 shows a graph of this example logistic transformation from the raw Phred scaled CADD scores (CADD_j) to the adjusted CADD scores (R_j^C) (solid line). In one example implementation, parameters were optimized to generate a midpoint (m_C) of 20.4 and a steepness (k_C) of 2.3 (although other parameters and values may be used).

Similarly, the raw PROVEAN values (PROV_j) for each variant (j) may be transformed to the continuous VGD scale (e.g., [M, N], [0.0, 1.0]), for example, using the following equation:

$\begin{matrix} R_{j}^{pr} = \frac{1}{1 + e^{kpr ({PROV}_{j} + mpr)}} . & (7) \end{matrix}$

FIG. 9 shows a graph of this example logistic transformation from the raw PROVEAN scores (PROV_j) to the adjusted PROVEAN scores (R_j^PR) (solid line). In one example implementation, the parameters were optimized to generate a midpoint (m_PR) of 1.6 and a steepness (k_PR) of 1.6 (although other parameters and values may be used).

The confidence of the raw CADD and PROVEAN values may increase disproportionately greatly for highly deleterious values of CADD_jand PROV_j(e.g., extremely high, such as top 0.1%, of CADD_jvalues and extremely low, such as bottom 0.1%, of PROV_jvalues, since PROV_jvalues are negative). Accordingly, the raw CADD and PROVEAN weights, u_j^0Cand u_j^0PR, may be defined, for example, as:

$\begin{matrix} u_{j}^{0 c} = bc + {(\frac{\max ({CADD}_{j} - sc, 0.0)}{10})}^{ec} . & (8) \\ u_{j}^{0 pr} = bpr + {(\frac{{PROV}_{j} + mpr}{dpr})}^{2} (e^{- sp + [{PROV}_{j} + mpr]}) . & (9) \end{matrix}$

FIG. 8 shows the relationship between the raw CADD values (CADD_j) and the raw CADD weights (u_j^0C) and FIG. 9 shows the relationship between the raw PROVEAN values (PROV_j) and the raw PROVEAN weights (u_j^0PR) (dashed lines). These relationships provide disproportionately greater weights to extreme-valued raw metrics (extremely high CADD values and extremely small PROVEAN values). For example, the raw CADD and PROVEAN weights, u_j^0Cand u_j^0PR, scale variants with the most deleterious values (e.g., the top 0.1% of values) to have relatively high raw weights (e.g., u_j^0Cand u_j^0PR≈99), while the remaining majority of variants have relatively negligible or neutral raw weights (e.g., u_j^0Cand u_j^0PR≈1.0). The precise power functions and tuning metrics may be optimized in the training-phase.

Mutation Class Component MutScore_jof In-Frame Indels

The mutation class component MutScore_j(S_j^M) of in-frame indels typically increases with the number of codons added or subtracted for coding amino acids, denoted (AA_j) (e.g., AA_j=⅓*number of bases added or subtracted, so that if 12 bases are inserted, AA_j=4). As AA_jincreases, the likelihood of the mutation disrupting protein function increases, but at a diminishing rate (e.g., asymptotically approaching a maximum value, such as, 1.0).

In one embodiment, as shown in FIG. 10 and Table 2, the mutation class component MutScore_j(S_j^M) of in-frame indels may be calculated, for example, as:

S
_j
^M
=I
_j[0.6+0.4*(1−e^−AA^j^/10)] (10a).

where I_jis an indicator variable, for example, that is 1 if variant j is an in-frame indel, and 0 otherwise.

In another embodiment, shown in FIG. 3, the mutation class component MutScore_j(S_j^M) of in-frame indels may be calculated, for example, as:

$\begin{matrix} S_{j}^{M} = \frac{1}{1 + mii ({bii}^{- {AA}_{j}})} . & (10 b) \end{matrix}$

Evolutionary Selection Component

The “evolutionary selection” component EvoScore_j(S_j^E) may provide a single or cross-species measure that natural selection suppresses a particular allele variant or mutation. An evolutionary model may predict likelihoods that allele mutations or variations would be deleterious based on their frequency or rarity of occurrence across multiple reference genetic sequences in a single species (“single-species” model) or multiple different species (“multi-species” model). For example, allele mutations or variations that are relatively more rare across the reference genetic sequences may be considered negatively selected for evolutionarily (e.g. associated with a deleterious trait for which an organism cannot or has a relatively lower likelihood of surviving or reproducing), while allele mutations or variations that are relatively more common across the reference genetic sequences may be considered positively or neutrally selected for evolutionarily (e.g. not associated with a deleterious trait, but traits for which an organism has a neutral or improved likelihood of surviving or reproducing). Embodiments of the invention may compute a measure of evolutionary variation of alleles (f_j) at each of one or more aligned genetic loci as a function of variation in alleles at corresponding aligned genetic loci in the multiple sequence alignment (MSA) (e.g., FIG. 15). The measure of evolutionary variation of alleles may be transformed into a likelihood or subscore (S_j^E) associated with a relative propensity that this allele mutation is damaging in an organism or would be damaging if produced in a child. The evolutionary selection component (S_j^E) may represent one or more likelihoods that an allele mutation at each of the one or more genetic loci will cause dysfunction or disease based on the measure of evolutionary variation (f_j) of alleles at the corresponding aligned genetic loci for one or multiple organisms.

Some embodiments may assign one or more likelihoods that an allele mutation in an organism is deleterious based on the evolutionary variation at the allele loci in real extant species or populations, for example, in order to diagnose living organisms, cells or tumors, or to analyze virtual progeny to filter out prospective pairings of gametes prior to conception.

In some embodiments, multiple reference genetic sequences from the one or more species may be aligned to link or associate one or more genetic loci in each of the multiple different sequences. Aligned loci of the different sequences may be derived from one or more common ancestral genetic loci and/or may relate to the same features, diseases or traits. A measure of evolutionary variation of alleles at one or more of the aligned genetic loci may be computed, for example, as a function of variation in alleles at corresponding aligned genetic loci in the multiple aligned reference genetic sequences. Aligned genetic loci associated with a relatively lower frequency of allele variation may indicate that the alleles are “functional” or relatively important to an organism's survival and their mutations may have a relatively higher likelihood of causing deleterious traits in an organism, whereas aligned genetic loci associated with a relatively higher frequency of allele variation in the reference genetic sequences may indicate that the alleles are less or non-functional and may be mutated with a relatively lower likelihood of impacting the survival or formation of deleterious traits in an organism. In some embodiments, the reference genetic sequences in the model may be weighted according to their evolutionary proximity of its population or species to the population or species of the organism and/or potential parent. For example, more weight may be assigned to reference genetic sequences of populations or species that are relatively more evolutionarily related (e.g., closer on a phylogenetic tree or having a relatively smaller Hamming distance).

Genetic sequences may be obtained from a living organism or two potential parents, such as, two individuals that plan on mating or between one individual seeking a genetic donor and each of a plurality of candidates from a pool of genetic donors. The potential parents' genetic sequences may be obtained from genetic DNA samples of biological material from the potential parents. A mating may be simulated between two potential parents, for example, by combining the genetic information from the two potential parents' genetic sequences to generate one or more genetic sequences of simulated virtual progeny.

The living or virtual organism's genetic sequence may be aligned with one or more of the reference genetic sequences to identify one or more alleles that evolved from the same ancestral genetic loci. The organism may be assigned one or more of the likelihoods of exhibiting deleterious traits associated with one or more alleles or mutations in the virtual progeny genetic sequences based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci in the reference genetic sequences.

Embodiments of the invention overcome the limitations of relying on specific information derived from human or cellular studies of the effect of mutation in order to score the propensity or probability that a particular mutation or allele will cause a deleterious phenotype, trait or disease.

An insight recognized according to embodiments of the invention is that extant genetic variation, that is existing or surviving genetic variation present amongst homologous or paralogous reference DNA sequences present in different organisms or members of a population, represents the outcome of an experiment that can be informative for predicting whether a given mutation or allele variation in an organism's genome is likely to be deleterious.

This experiment is the four billion year long process of evolution, which has governed the replication and diversification of life on Earth. Today, there are many species, and individuals within a species all contain copies of genetic material, which is derived from common ancestral versions. As species and individuals reproduce and copy their DNA, mutations appear which make these descendent copies distinct from the parental versions. The eventual fate of such new mutations, whether they will continue to be passed along to offspring or eventually die out, is determined by a stochastic process that is influenced by the mutation's effect on the reproductive fitness of the organism. Mutations that have no functional effect (neutral mutations) or are beneficial to an organism (positive mutations) are more likely to eventually increase in frequency and persist in the population, increasing diversity or replacing their parental version. In contrast, mutations which lower the reproductive fitness of an organism (negative or deleterious mutations) are unlikely to persist and contribute to future genetic variation.

Over the course of evolutionary time, a great many mutations have appeared and persisted, leading to the present diversity amongst DNA sequences derived from a common ancestor. However, this diversity is not equally distributed amongst all sequence positions in a genome. Although mutations are essentially introduced during the replication process independent of any functional effect they may have, the evolutionary filtering process is greatly influenced by such effects. As such, when comparing the genomes of several species or individuals today, some areas are conserved (such as having the same coding sequence and/or non-coding sequence), while others have much more greatly diverged (having very diverged sequences from each other or relative to the ancestral copy number).

Reference is made to FIG. 15, which shows an example of an alignment of multiple genetic sequences (SEQ ID Nos.: 1-36, proceeding from top to bottom, respectively) according to an embodiment of the invention.

The abbreviations for the species in FIG. 15 are as follows: LATCH=Latimeria chalumnae; XENTR=Xenopus tropicalis; TAEGU=Taeniopygia guttata; MELGA=Meleagris gallopavo; CHICK=Gallus gallus; ORNAN=Ornithorhynchus anatinus; LOXAF=Loxodonta africana; HORSE=Equus caballus; TURTR=Tursiops truncatus; MYOLU=Myotis lucifugus; AILME=Ailuropoda melanoleuca; OTOGA=Otolemur garnettii; CALJA=Callithrix jacchus; MACMU=Macaca mulatta; NOMLE=Nomascus leucogenys; PONAB=Pongo abelii; GORILLA=Gorilla gorilla; CHIMP=Pan troglodytes; HUMAN=Homo sapiens; TUPBE=Tupaia belangeri; OCHPR=Ochotona princeps; CAVPO=Cavia porcellus; SPETR=Spermophilus tridecemlineatus; DIPOR=Dipodomys ordii; MOUSE=Mus musculus; RAT=Rattus norvegicus; SARHA=Sarcophilus harrisii; MONDO=Monodelphis domestica; MACEU=Macropus eugenii; DANRE=Danio rerio; GADMO=Gadus morhua; ORYLA=Oryzias latipes; GASAC=Gasterosteus aculeatus; ORENI=Oreochromis niloticus; TETNG=Tetraodon nigroviridis; TAKRU=Takifugu rubripes.

In the example of FIG. 15, the multiple aligned reference genetic sequences represent a portion of the DNA sequence coding for the PEX10 proteins present in organisms from multiple vertebrate species. Item A in the figure shows a nucleic acid genetic locus which is completely conserved across all species in the alignment, as all species have a Guanine (symbolized by the letter G) at this locus position. Although many mutations that change the amino acid at this position have undoubtedly been introduced into this gene over the course of the 500 million years of vertebrate evolution, the fact that no such mutation persists today is a strong indication that such mutations are likely to be deleterious and reduce evolutionary fitness. In contrast, the position in the gene indicated by item B in FIG. 15 is much more variable, with different species having at that locus position one of the following DNA bases: Guanine (G), Adenine (A), Thymine (T), Cytosine (C). The diversity of DNA (or alternatively the amino acids encoded by the DNA) at this genetic locus provides an indication that it is relatively less likely that a mutation at this position in an organism's genotype will be deleterious, relative to a mutation at the genetic locus position indicated by item A.

A multiple sequence alignment of present day reference genetic sequences may be derived from common ancestral genetic loci of multiple species (e.g. different vertebrate sequenced genomes) or multiple individuals within a single species (e.g. a collection of human sequences). A substantially large sample size of organisms, populations or species (e.g., tens, hundreds, or more) may be used for statistically significant likelihoods, for example, to reduce bias error due to a skewed sample set.

Embodiments of the invention may compute a measure of evolutionary variation of alleles f at each of one or more aligned genetic loci as a function of variation in alleles F at corresponding aligned genetic loci in the multiple sequence alignment (MSA). The measure of evolutionary variation of alleles f may be transformed into a likelihood or sub score (S_j^E) associated with a relative propensity that this allele mutation would be damaging in an organism. This likelihood or subscore (S_j^E) may be derived, for example, using two functional transformations F and S, to convert columns of aligned genetic loci of a multiple sequence alignment (MSA) and a putative mutation or allele in an organism into a propensity score or likelihood (S_j^E) relevant to assessing the effect of that particular allele or mutation on the organism, for example, as:

Multiple Sequence Alignment (MSA)→f=F(MSA)→S_j^E=S(f) (11)

The first functional transformation shown in equation (11), f=F(MSA), may be used to compute a measure of evolutionary variation of alleles f at each of one or more genetic loci derived from one or more common ancestral genetic loci in the multiple organisms as a function of variation in alleles F at corresponding aligned genetic loci in the multiple aligned genetic sequences. The first functional transformation may create a raw score that quantifies the relative amount of sequence conservation at the one or more genetic loci. There are many possible instantiations of this function that may be used according to embodiments of the invention. For example, one such function may input information from the DNA or amino acid genetic sequences present in the alignment and output a Shannon entropy of the sequence characters at each of the one or more genetic loci. Denoting a frequency of a particular symbol (DNA base or amino acid) at a particular genetic locus or column position, (j), in a multiple sequence alignment as P_i, i={A, C, G, T} (for DNA, or the set of amino acid symbols if considering a protein alignment), the Shannon entropy function may be calculated, for example, as shown in equation (12):

F(MSA_j)=Σ_ip_i·log₂p_i (12)

Another example of the first functional transformation shown in equation (11), f=F(MSA), may take the average pairwise difference between different symbols (S) in an aligned sequence column of length N, for example, as in equation (13):

$\begin{matrix} F ({MSA}_{j}) = {(\begin{matrix} N \\ 2 \end{matrix})}^{- 1} \sum_{i = 1}^{N} \sum_{k = (i + 1)}^{N} {\begin{matrix} 1 & if S_{i} \neq S_{k} \\ 0 & if S_{i} = S_{k} \end{matrix}} & (13) \end{matrix}$

Other possible functional forms of the first functional transformation, F(MSA), may calculate a distance metric from a particular species or sequence in the reference alignment. For example, the function may rank all the sequences in the alignment according to their Hamming distance from the reference (e.g., human) sequence, and then calculate the rank of the first sequence with a divergent symbol at the relevant position in the alignment, or if ranking a particular mutation, the rank of the first sequence matching that particular mutation. Additional functional forms such as not using the ordinal rankings of sequences by Hamming distance, but instead using the Hamming distance itself as the metric may be used.

Additionally, the function F(MSA) need not return a single value or be a function of a single column in the multiple sequence alignment. The function may be a composite function of one or more of the functions previously described in addition to others (e.g., F₁, F₂, F₃, . . . ), and may output a vector of values (e.g., (s₁, s₂, S₃, . . . )) rather than a single value. The function may also take as input multiple columns (j), or even the entire alignment, when calculating the value(s), f, and may also take as input a particular mutation under consideration, which may or may not affect the calculation of the values returned by the function.

The second functional transformation specified by equation (11), S_j^E=S(f), is a function which converts the measure of evolutionary variation of alleles into a subscore or likelihood of being damaging. Many possible instantiations of this function are also possible. For example, a function S_j^E=S(f) may score the value(s) of f according to its ranking in the empirical distribution of values for all mutations or alleles considered, or that could be considered, based on a collection of multiple reference sequence alignments. Other scoring methods may also be used. For example, a function trained to discriminate mutations in a database of known or suspected to be damaging alleles from neutral alleles may be used to assess the likelihood of damage (e.g., using any variety of statistical models or derived variants from experimental findings), or a function which is trained to assess the likelihood of a mutation reaching a certain frequency in the population. In all cases, this functional transformation, in combination with the first, allows particular genetic allele mutations to be ranked and assessed for likelihood of survival or damage if produced in a child.

Embodiments of the invention may create functional forms of the measure of evolutionary variation F(MSA) using a phylogenetic history for assessing likelihoods of deleterious effects of alleles. Because DNA replication is semi-conservative, the evolutionary history of a DNA molecule may be represented by a bifurcating tree, known as a “phylogenetic” tree, that represents the known or inferred historical or evolutionary relationships between present day extant reference genetic sequences. A large body of scientific literature has developed over the past 30 years that studies the problem of inferring this tree from present day sequences. Typically, such models envision the evolutionary process between nodes of the tree as being similar to a general time reversible (GTR) Markov chain. In these models, in an interval of time, t, there is a certain probability that a base in the sequence will mutate, or transition to another base (e.g., A→C). Such models may be described using a transition matrix that describes the relative probability of transition from one base to another, for example, as shown in equation (14):

$\begin{matrix} T C A G Q = \begin{matrix} T \\ C \\ A \\ G \end{matrix} (\begin{matrix} \cdot & a π_{c} & b π_{A} & c π_{G} \\ a π_{T} & \cdot & d π_{A} & e π_{G} \\ b π_{T} & d π_{C} & \cdot & f π_{G} \\ c π_{T} & e π_{C} & f π_{A} & \cdot \end{matrix}) & (14) \end{matrix}$

In equation (14) above, the elements of the transition matrix Q may define a probability that each base denoted by the row will transition to each base denoted by the column, for example, in an infinitely small evolutionary time interval Δt. Note that the diagonal terms in the transition matrix are not shown, as they are simply equal to one minus the other elements in the row (the probabilities of the elements in each row sum to 1). The π_iterms represent the equilibrium frequency of the nucleotide bases {i=A, C, G, T}, and the symbols a, b, c, d, e and f are parameters that further govern the substitution dynamics. This matrix represents a generalized time-reversible model, in which each rate below the diagonal equals a reciprocal rate above the diagonal multiplied by the equilibrium ratio of the two bases. Equation (14) is only one example of a substitution matrix used for phylogenetic inference.

Reference is made to FIG. 16, which schematically illustrates an example of a phylogenetic tree inferred from a multiple sequence alignment, for example, as shown in FIG. 15, according to an embodiment of the invention. The length of the branches shown in the tree may represent or be a function of an inferred number of substitutions or allele mutations per genetic locus that have occurred from its direct ancestral sequence. In the example shown in FIG. 16, a scale bar for the branch lengths is shown for a 0.2 branch length. The phylogenetic tree may be directly inferred using the Markov model described above. In inferring the tree, the likelihood of a tree and its branch length (e.g., given in units of expected substitutions) are maximized by varying the tree or branch lengths, until a tree with the highest likelihood is found. Alternatively, the neural network described herein may be used to compute all parameters in the phylogenetic model. In another embodiment, a prior probability distribution, as used in Bayesian statistics, may be specified for all parameters in the phylogenetic model, and several trees may be sampled according to their posterior probability distribution, as used in Bayesian statistics.

After fitting the Markov model that accounts for the phylogenetic history of the sequence alignment, whether using a neural network, Bayesian approach or maximum likelihood approach, embodiments of the invention may provide not only an inferred phylogenetic tree, but also a model for the evolutionary process of the multiple reference sequences in that tree.

Embodiments of the invention may use this evolutionary model to directly assess the likelihood that a mutation is damaging by examining the probability that a mutation found in an organism's genome persists into the future. That is, rather than using the model to infer the past history of evolution, the model is trained using the outcomes of the past and then used to calculate the likelihood of an allele persisting into the future. Because this likelihood is directly related to the probability that the allele is damaging (less likely mutations being more likely to be damaging), this phylogenetic approach, which directly accounts for the past history of the sequence in a parametric model, is a uniquely valuable functional form for the F(MSA) specified by equation (11).

Important extensions to the phylogenetic model are those which either change the model to account for sequence context (e.g., information about sequence location or what a sequence encodes, such as, methylation or homopolymer status) and functional effect (e.g., synonymous vs. non-synonymous, or affecting or not affecting expression), or that partition the sequence in some way to account for varying rates of substitutions, for example, based on the location of loci in the genome.

To account for varying rates of substitutions, for example, instead of the direct application of the substation rate matrix in equation (14) to all sequence substitutions, an alternate model may specify that although the relative rates of different types of substitutions at all alignment positions was governed by (14), the global rate at each site (that is the total mutation rate, denoted μ), may vary across sites or loci in the genome. A multitude of such models are possible. For example, the rates at different genetic loci may be drawn from a parametric distribution, such as a Γ-distribution that is also fit during the modeling procedure, or the distribution of rates may be derived from several categories of possible distributions. In one embodiment, this model may specify two different distribution categories (such as, conserved or rapidly evolving categories) and then train the model to identify to which distribution category the observed sequence belongs, for example, using a hidden Markov procedure. In some embodiments of the invention, the inferred or posterior probability that a genetic locus or mutation belongs to a category in the phylogenetic model may be returned by the F(MSA) function instead of the likelihood itself.

To account for functional effects, the phylogenetic model used to form the likelihood for the F(MSA) function may directly account for the functional consequence of a mutation. For example, the coding sequence of a protein is determined by triplets of neighboring DNA nucleotides that form a functional unit referred to as a “codon.” A mutation in a nucleotide within a codon may either have a functional effect of changing the amino acid sequence encoded by that codon (in which case it is referred to as “non-synonymous” since the mutation encodes for a different amino acid sequence) or may be a substitution with no functional effect on the amino acid encoded due to the redundancy of the genetic code (in which case it is referred to as “synonymous” since the mutation encodes for the same amino acid sequence). The Markov model that may be used in predicting the likely damaging effect of a mutation may directly account for such functional effects. For example, the transition probability of mutating from nucleotide i to j specified by the ijth matrix Q element q_ijin equation (14), may be replaced by an instantaneous transition probability t_ij, for example, defined by equation (15).

$\begin{matrix} t_{ij} = {\begin{matrix} ω q_{ij} & if i \to j is non - synonymous \\ q_{ij} & if i \to j is synonymous \end{matrix} & (15) \end{matrix}$

This would allow a new instantaneous transition matrix, T, to be used in the model, and a new parameter, ω, which is equal to the non-synonymous to synonymous substitution rate to be used in predicting the likelihood that an allele is damaging or that it persists into the future. In practice, the ω parameter may be constant for an entire multiple sequence alignment, may be assigned to each codon position in the alignment by assuming they are drawn from some hierarchical distribution, or may be uniquely assigned to each codon position. The substitution model specified by equation (15) may also be altered to account for each combination of the 64×64 possible elements of a transition matrix representing the rate in which each of the 64 possible codons (e.g., the 4³=64 different combinations of four nucleotide states (A, T, C, G) at three nucleotide positions in each codon) transition to each of the 64 possible codons. In all instantiations of the evolutionary model, the functional effect of a sequence change, whether on the amino acid, regulatory context or other biological context may be directly accounted for, and used to predict the likelihood that the allele was damaging.

Results and Discussion

The particular values, scores, subscores, parameters, graphs, functions, thresholds, relationships, slopes, or rates depicted herein may vary for example based on the input data, validation datasets, updated disease classifications, analyzed genome sequence, reference genome sequences, or other data; however, the trends and relationships such as increasing values, decreasing values, direct and indirect proportionality, continuity, stair-step relationships, etc. may be generalized in some embodiments of the invention to any or a broad range of datasets. These values were obtained using datasets that may have been limited and may be revised with more current datasets to provide more accurate results.

Example Data Acquisition and Exome Sequencing Interval Processing

The VGD model generated according to some embodiments of the invention was trained in one example using variant calling metadata from 60,706 ExAC samples. This dataset was filtered into non-overlapping intervals covering coding regions of 480 autosomal genes associated with severe recessive pediatric disease. These regions were selected because they are known to contribute to genetic disease and collectively constitute a commonly used genetic testing panel.

Filtered intervals covered a total of 1,343,919 base pairs of the human genome. E.g., three overlapping subsets of variants were considered located in targeted regions: 238,461 variants were observed in at least one subject in these regions in the ExAC dataset, 17,748 ClinVar variants, and 6,274 VariBench variants. Only 8,341 ClinVar and 2,345 VariBench variants in our intervals were present in the ExAC samples. Curated variants not found in the ExAC samples (e.g., having a frequency of ≦1/124,000) may be assigned an allele frequency of zero. In total, individual variant-specific gene dysfunction (VGD) contributions were determined for 250,419 unique variants.

Novel Variant Scoring in a Genotype-Based Disease Liability Analysis

Embodiments of the invention provide a novel and dynamic neural network that incorporates multiple sources of information to compute one or more likelihoods of variant-specific gene dysfunction, VGD_j, for each variant j in a variant dataset (j=1, 2, . . . , J). VGD_jmay be computed as the weighted sum of B component subscores (e.g., equation (2)). Each component subscore may be computed on a continuous uniform scale (e.g., a linear scale defined by [M,N], where M, N are rational numbers, such as a [0.0, 1.0] scale). In one embodiment, the VGD score may be directly proportional to gene dysfunction, for example, where substantially low or below threshold values correlate with a fully functional gene and substantially high or above threshold values correlate with complete gene dysfunction. (Alternatively, the VGD score may be inversely proportional to gene dysfunction, with opposite trends). In one embodiment, the VGD score may be computed as a combination of any two or more of the following components:

VGD_j=w_j^ClS_j^Cl+w_j^MS_j^M+w_j^PPS_j^PP+w_j^ES_j^E+w_j^PS_j^P (16),

where S_j^Cl(or ClinScore_j) is the clinical assessment of “pathogenicity,” S_jP(or PathScore_j), S_j^M(or MutScore_j) is the expected impact of the mutation class e.g., on RNA processing and translation, S_j^PP(or PathScore_j) is a combination of predictive damage values obtained from established tools, S_j^E(or EvoScore_j) is the evolutionary constraint factor measuring the natural selection against a variant across one or more species, and S_j^P(or PopScore_j) is the population selection factor measuring the natural selection against a variant within each of multiple individual human populations. Each subscore S_j^bis associated with a weight, w_j^b, for example, directly determined by the attributes of the corresponding variant and representing a level of confidence in the corresponding scoring component. A general overview of example equations, parameters and values used to calculate these components is outlined in FIGS. 3A-3D, though other equations, parameters and values may be used.

Embodiments of the invention may thereby provide a solution that accounts for many distinct types of variants and incorporates many genetic properties and components to increase detection sensitivity. Rather than indiscriminately accepting any single metric (or a fixed combination of metrics) as being the most appropriate tool for all variants, embodiments of the invention may evaluate the pertinence or weight of the available data for each unique variant. None of the utilized public metrics is the “best” for all variants and each misses diagnosing variants of a sub-optimal type; the tool with the largest numerical contribution to the final VGD_jscore is directly determined by the corresponding variant. Further, the final VGD_jscore may be dynamically updated at any time with the advent of additional data or variant evaluation tools.

Impact of Allele Frequency and Clinical Annotations on VGD Score

The VGD model described according to embodiments of the invention uniquely incorporates observed population frequencies and species frequencies from a large ethnically diverse cohort in order to generate a summary disease contribution score VGD_jfor each variant j. In this way, the VGD model is able to recognize mislabeled variants that others may miss (see e.g., Table 1 and description of FIGS. 18A-18B). Embodiments of the invention generally assume that truly deleterious mutations typically cannot be maintained in a population or species at substantial frequencies due to negative selection. Accordingly, the VGD model generates a population selection component based on each variant's frequency on a population level in each of a plurality of distinct populations or super-populations. The VGD model also generates an evolutionary selection component based on each variant's frequency on a species-level using an allele frequency in each of a plurality of reference genomes within a single or across multiple species. Embodiments of the invention generally assume that if a variant is of high enough frequency in a single super-population to not be deleterious in that population, it is highly unlikely for that variant to be truly deleterious in any population. The same is generally assumed across multiple different species, though the correlation between human survival and survival in other species may be weighed based on the evolutionary proximity of the other species to humans.

Failure to find literature-based evidence for disease liability becomes less and less tenable as the frequency of disease variants increases. Embodiments of the invention generally assume that if a variant were both truly damaging and relatively common, the variant would have a sufficiently high clinical visibility (bolstering the heterozygous effect in the population selection component) and there would be a sufficient number of affected individuals to lead to this variant's clinical classification (e.g., in the ClinVar database) (bolstering the clinical classification component). Therefore, variants with a high frequency in a population or ExAC dataset, in the absence of annotation in clinical databases, are very unlikely to actually be damaging, regardless of the other scoring factors.

Among the variants with existing clinical annotation, in an example ClinVar dataset, 6,651 variants were unambiguously labeled as “pathogenic.” These variants were classified as “ClinVar-5.1” and have a VGD score approximately equal to A_j. Because damaging variants are unlikely to exist at high frequencies in the population, the vast majority of ClinVar 5.1 variants will have scores close to 1.0 (see below).

An additional 140 ClinVar-5 variants with replicated homozygosity for the presumed disruptive allele in the ExAC cohort, and/or also assigned to one of the “non-pathogenic” ClinVar (ClinVar-2 or ClinVar-3) categories were labeled as “ClinVar-5.2” or “ambiguously assigned ClinVar 5's” (Table 4).

Though the vast majority of ClinVar-5s do not exhibit replicated homozygosity, the few that do are disproportionately represented in the population. To estimate each set of variants' influence on the population, the number of chromosomes in the ExAC cohort were totaled for all variants in each category. For example, the 6,651 ClinVar-5.1 variants were detected in 31,948 chromosomes, while the 140 ClinVar-5.2 variants were detected in 425,139 chromosomes. Because one individual can have multiple different ClinVar-5 variants, some chromosomes may be counted more than once, and the total number of chromosomes for a set of variants may exceed the actual number of chromosomes in the ExAC data.

An alternative cause of inconsistency between a variant's raw VGD score and frequency may occur when there is a true disease liability in homozygotes, but the allele frequency is lifted for example by a heterozygous advantage. A well-known example of such a variant is the HBB sickle cell variant G6V. Under the assumption that all such instances would have been identified and recorded by now, the corresponding allele database (e.g., OMIM) is expected to include compound heterozygote descriptions for the variant. This incidence of compound heterozygote is accounted for in the VGD score as the heterozygote effect parameter (ch) defining a number of compound heterozygotes or combinations of the variant with other variants that is described in the clinical literature. The more compound heterozygote instances a variant has, the more likely the variant has a disease contributing effect, boosting the final VGD damage score. Of the 17,748 ClinVar variants tested, 2,008 were found to be compound heterozygotes, only 163 of which have more than 2 compound heterozygotes instances. For reference, the well-known CFTR variant F508del was measured to have 19 compound heterozygotes, more than any other variant examined in these tests.

Example Distribution of VGD Scores

Not surprisingly, in one example test, the VGD model identified a majority (e.g., >50%) of the 238,461 ExAC variants examined as not causing disease of gene dysfunction. The median damage score was 0.32 with a large interquartile range (IQR) of (0.03, 0.77) (Table 5). The distribution of the final VGD score may be bimodal, with a large peak e.g., near 0.0 and a more moderate peak near e.g., 1.0; however, there are many variants with scores in between these extremes (see e.g., FIGS. 4A-B). Only 68,406 (28.2%) ExAC variants were observed to have a damage score less than 0.05 and only 29,972 (12.3%) variants to have a VGD greater than 0.95. Fewer than 25% of ExAC variants were ever observed in the homozygous state, and only half have a maximum population frequency greater than 8.66×10-5 (Table 5).

VariBench Categorization Results

VariBench data was also analyzed as an alternate to the ClinVar data described herein. Reference is made to FIG. 12, which shows a violin plot of an example of variant-specific gene dysfunction for each variant stratified by VariBench category according to an embodiment of the invention. “Pathogenic” VariBench variants have generally very high disruption scores, while “benign” variants have generally very low scores. VariBench P.2 variants, which are homozygous in at least two ExAC subjects have generally lower scores than the other pathogenic VariBench variants.

The VGD median score of benign VariBench variants is e.g., 0.00 (IQR: 0.00, 0.35) compared to e.g., 0.98 (IQR: 0.83, 1.00) for pathogenic VariBench variants (Table 5). The 86 “pathogenic” VariBench variants were classified with replicated homozygosity in the ExAC data as “VariBench P.2” variants. These VariBench P.2 variants were assigned a much lower score than their VariBench P.1 counterparts, which lacked replicated homozygosity in the ExAC data (e.g., having a median VGD score of 0.00 and 0.98, respectively) (Table 5).

The pathogenic predictors, PolyPhen-2, VEST, CADD, and PROVEAN, were computed for the VariBench P.2 variants, which produced less deleterious damage scores on average than the VariBench P.1 variants (see e.g., FIG. 13A-13D, Table 5). Unlike the final VGD model, however, these raw pathogenic predictors, PolyPhen-2, VEST, CADD, and PROVEAN, fail to calculate a low probability of damage to the large majority of “benign” VariBench variants, and instead assigned these 752 variants a large spread of damaging scores (see e.g., FIGS. 13A-13D).

Comparison of ClinVar and VariBench Categorizations

Demonstrating their ascertainment biases, the ClinVar and VariBench datasets both have a relatively high median damage score of at least 0.96 for all analyzed variants (Table 5). The ClinVar variants were slightly more common than those in the VariBench database. Both types of variants had a median maximum population frequency of 0.0, but the upper bound of the frequency IQR was 4.05×10⁻⁴for ClinVar and only 8.77×10⁻⁵for VariBench (Table 5). In addition, 25% of all ClinVar variants were homozygous in at least three ExAC individuals (Table 5).

To determine the accuracy of the VGD model compared to clinical models, all ClinVar variants were analyzed using the VGD model to generate a naive VGD score, VGD^−Cl, calculated without any clinical classification (e.g., ClinVar) information. Reference is made to FIGS. 4A and 4B, which show violin plots of an example of VGD scores computed (a) without clinical classification data, VGD^−Cl, and (b) with clinical classification data, VGD, respectively, according to an embodiment of the invention.

Even without their clinical classification data, the VGD model assigned “likely pathogenic” (ClinVar-4) and “pathogenic” (ClinVar-5) variants naive dysfunction scores, VGD^−Cl, significantly greater than their benign counterparts (see e.g., FIG. 4A). Upon adding the clinical classification data, as expected, the variants' final VGD scores were further elevated (e.g., medians of 0.99 for ClinVar-4 and 1.00 for ClinVar-5) and low frequencies and homozygous incidences (e.g., all with medians of 0) (see e.g., FIG. 4B, Table 5). The 6,651 ClinVar-5.1 variants for which there was no conflicting classification (either in ExAC or the literature) have a median final VGD score of 1.0 (see e.g., FIG. 4B).

ClinVar-5.2 variants have significantly lower VGD scores than their ClinVar-5.1 counterparts (e.g., median score of 0.02 compared to 1.0). The VGD model differentiates these two types of variants and scores them accordingly.

Variants categorized as “non-pathogenic” (ClinVar-2) or “likely non-pathogenic” (ClinVar-3) generally have very low VGD scores; for example, both variant types have median VGD scores of for example 0.00 and average values for example less than 0.07 (Table 5, FIG. 4B). Consistent with their benign effect, these variants also have relatively high median maximum population frequencies (0.021 and 0.015, respectively) and numbers of homozygotes in the ExAC dataset (8 and 5, respectively) (Table 5).

Embodiments of the invention may similarly identify both benign and pathogenic variants characterized by the VariBench data source (see e.g., FIG. 12, Table 5). The VGD model classification was therefore shown to concur with clinically validated results, even in a blind test without considering clinical classification data: variants that are known to be non-pathogenic were assigned a relatively low probability of disease contribution according to their VGD scores and variants that are known to be pathogenic were assigned a relatively high probability of disease contribution according to their VGD scores.

VGD Score for Each Mutation Type

To determine the accuracy of the VGD model as compared to mutation type data, all variants were analyzed using the VGD model to generate the naive VGD score, VGD^−M, calculated without any mutation type data (see e.g., FIG. 5A). Reference is made to FIGS. 5A and 5B, which show violin plots of an example of VGD scores computed (a) without mutation type data, VGD^−M, and (b) with mutation type data, VGD, respectively, stratified by mutation type for all variants, according to an embodiment of the invention.

Start-loss, essential splice site, nonsense, and frame-shift variants have the largest impact on the resulting protein, and by design, also have the largest scores in the VGD model (Table 2, FIGS. 5A and 5B). The score distribution of variants of these types is highly skewed toward a maximum value (e.g., 1.0), especially when comparing dysfunction scores without VGD^−Mand with VGD the mutation type contribution (FIGS. 5A and 5B). The median VGD score for each of these variants is at least 0.98 (Table 6). In-frame indels also have relatively high damage scores (e.g., median VGD of 0.82).

Missense variants have a large spread of damage scores, with slightly more deleterious than non-deleterious variants (e.g., median VGD=0.57). Stop-loss variants have an intermediate level of damage (e.g., median VGD=0.36). All other mutation types are associated with lower damage scores (e.g., medians less than 0.02) (Table 6, FIG. 5B).

The VGD model classification was therefore shown to generally concur with the variant's mutation type data, even in a blind test without considering the mutation type data: variants that are associated with non-damaging mutation types e.g. synonymous mutations, non-essential splice sites, etc., were assigned a relatively low probability of disease contribution according to their VGD scores and variants that are associated with damaging mutation types e.g. start-loss, essential splice site, nonsense, and frame-shift variants, were assigned a relatively high probability of disease contribution according to their VGD scores.

Comparison of Results to Other Variant Scoring Techniques

To determine the accuracy of the VGD model as compared to pathogenic predictor metrics, a side-by-side comparison of VGD scores and PolyPhen-2, PROVEAN, CADD, and VEST scores is provided. Reference is made to FIGS. 6A-6D and FIGS. 13A-13D, which show violin plots of an example side-by-side comparison of VGD scores and (A) PolyPhen-2, (B) adjusted CADD, (C) adjusted PROVEAN, and (D) VEST scores, stratified by clinical classification category, according to an embodiment of the invention. FIGS. 6A-6D are stratified by ClinVar category and FIGS. 13A-13D are stratified by VariBench category. FIGS. 6A-6D use the same colors for each ClinVar category as used in FIGS. 4A-4B and FIGS. 13A-13D use the same colors for each VariBench category as used in FIG. 12. Only the variants with available pathogenic predictor metrics are included in these figures.

In general, for many of the analyzed variants, each of the four pathogenic predictor metrics correctly assigns a deleterious score to “pathogenic” and a non-deleterious score to “benign” ClinVar and VariBench variants (see e.g., FIGS. 6A-6D, FIGS. 13A-13D, Table 5). However, these four pathogenic predictor metrics are not sensitive enough to handle variants with conflicting evidence such as our ClinVar-5.2's. All of the four pathogenic predictor metrics inaccurately assign variants of these types relatively high probabilities of damage, whereas the VGD model more accurately assigns variants of these types relatively low probabilities of damage (see e.g., FIGS. 6A-6D, Table 5).

Further limitations of the four pathogenic predictor metrics are shown in FIGS. 14A-14D, which show violin plots of an example comparison of variant-specific gene dysfunction and (A) PolyPhen-2, (B) adjusted CADD, (C) adjusted PROVEAN, and (D) VEST scores, stratified by mutation type for all variants, according to an embodiment of the invention. FIG. 14A shows that PolyPhen-2 is inherently binary (e.g., few of these variants have intermediate PolyPhen-2 values), and assigns very high or low damage scores to all missense variants. FIG. 14B shows that CADD also assigns a lower damage score to stop-gain and stop-loss variants and is inherently binary for all missense variants (see e.g., FIG. 14B, Table 6).

VGD Outperforms Existing Variant Classification Databases

Embodiments of the invention may assess the disease-risk of novel or de novo mutations or variants that have never before been validated or studied in diseased patients. The VGD model identified 29,317 novel variants as pathogenic (e.g., with VGD scores of at least 0.95) that were not yet clinically classified (e.g., by ClinVar or VariBench) as well as 68,594 novel variants as benign (e.g., with VGD scores less than or equal to 0.05) (Table 1). Accordingly, embodiments of the invention may be used to discover disease-correlated variants before the variants are clinically validated, thereby predicting disease risk in patients that would have otherwise been ignored. The majority of these novel variants have relatively low maximum population frequencies (e.g., median of 6.06×10⁻⁵for the likely damaging variants and 9.63×10⁻⁵for the likely benign variants), which may explain why they have yet to be observed in enough diseased individuals to be included in clinical (e.g., ClinVar and VariBench) curations.

For the 192 genes with at least ten ClinVar 5 and 5.1 variants, the 99% and 99.9% one-sided bootstrap confidence intervals were calculated for the average VGD_j^−Cl(VGD_jwithout including the ClinVar information) among all ClinVar 5 and just ClinVar 5.1 variants (see e.g., Table 7, FIG. 7A). FIG. 7B shows a distribution of the density of confidence interval thresholds versus variant-specific gene dysfunction (VGD^−Cl) among all ClinVar 5 variants and ClinVar 5.1 variants for each gene at two confidence levels (99% and 99.9%). In FIG. 7B, the distribution of the lower confidence interval bounds is highly skewed, with most genes having values above 0.80. There is a 99% (or 99.9%) confidence that the average per-gene naive VGD score for ClinVar 5.1 (and 5) variants is above the corresponding threshold. These intervals may thus be utilized as per-gene lower bound proxies for “likely pathogenic” naive VGD scores for variants not covered by ClinVar.

Several genes were found that have a relatively large difference between their interval bounds for their ClinVar 5 and 5.1 variants. Such relatively large intervals are shown, for example, in FIG. 11A, which shows all variants in Pore Forming Protein 1 (PRF1), and in FIG. 11B, which shows all variants in Phosphorylase, Glycogen, Muscle (PYGM), according to some embodiments of the invention. The median maximum population frequency of the 20,055 non-ClinVar “likely pathogenic” variants (with naive VGD scores greater than their per-gene thresholds) is 6.06×10⁻⁵(IQR: 6.06×10⁻⁵to 0.00012), which is higher than their ClinVar 5.1 counterparts (0; IQR: 0 to 3.02×10⁻⁵) (Table 5).

Living Organisms

An organism that is genetically screened according to embodiments of the invention may be a living organism whose DNA is obtained from the organism's biological sample and sequenced to identify a genetic mutation and assess one or more likelihoods that the genetic mutation causes gene-dysfunction in organisms. The living organism may be one or more of a pre-natal organism, a fetus, a newborn, a post-natal organism, a child, an adult, blood, tissue, saliva, a stem-cell, and a tumor, to perform genotype screening, such as, pre-natal genotype screening, post-natal genotype screening, newborn genotype screening, stem-cell screening, and tumor screening. Other types of living organisms and genotype screenings may be used.

Virtual Organisms

An organism that is genetically screened according to embodiments of the invention may be a virtual or simulated organism. Although the organism is virtual, the organism's genetic information is real, for example, derived by combining at least a portion of real genetic information sequenced from DNA obtained from biological samples of two living potential parents. Accordingly, a virtual organism's genetic information represents transformed, permuted, or intertwined biological DNA samples of two living human organisms.

Reference is made to FIG. 17, which schematically illustrates an example of simulating a hypothetical mating of two (i.e. a first and a second) potential parents for generating a virtual progeny according to an embodiment of the invention.

For each of the two potential parents, a processor (e.g., sequence analyzer processor 112 of FIG. 1) may receive a potential parent's diploid genetic sequence 402, 404. A “diploid” genetic sequence includes two alleles from the two sets of chromosomes respectively labeled “A” and “B” at each genetic locus of a diploid cell of the potential parent, whereas a “haploid” genetic sequence includes one allele from one chromosome at each genetic locus of a haploid cell of the potential parent. For each of the two potential parents' diploid genetic sequences 402 and 404, the processor may simulate genetic recombination of the two sets of chromosomes A and B from the parent's diploid genetic sequence 402, 404 (having two alleles at each genetic locus) to generate a virtual gamete haploid genetic sequence 406, 408 (having one allele per genetic locus). The processor may simulate recombination by progressing locus-by-locus along a “haplopath” through each parent's diploid genetic sequence 402, 404 and selecting one of the two alleles at each genetic locus (either the allele in chromosome A or the allele in chromosome B). The selection of alleles may be at least partially random and/or at least partially non-random, for example, based on defined correlations between alleles at different loci referred to as “linkage disequilibrium”. The haploid genetic sequence may mimic or simulate recombination of the genetic material in the two chromosomes A and B to form a discrete haploid genetic sequence of a virtual gamete 406, 408, e.g., a virtual sperm or virtual egg.

The two virtual gamete haploid genetic sequences 406 and 408 for the two respective potential parents may be combined to simulate a mating between the first and second potential parents resulting in a virtual progeny diploid genetic sequence 410 (a discrete genome of a child potentially to be conceived).

Since the selection of alleles is at least partially random, this mating is just one of the many possible genetic combinations for the first and second potential parents. This process may be repeated multiple times (e.g., hundreds or thousands of times), each time following a different recombination path (e.g., a different sequence of alleles selected) for one or both of the potential parents, to generate multiple genetic permutations that are possible for mating the first and second potential parents. The virtual progeny diploid genetic sequence 410 may include a single (e.g., most probable) genetic sequence or a probability distribution of multiple possible sequences, for example, to indicate, for many possible matings, the overall likelihood of each of multiple alleles at each of one or more loci in a virtual or hypothetical progeny.

Embodiments of the invention may use methods for simulating a mating between two potential parents and generating a virtual progeny genetic sequence described in U.S. Pat. No. 8,805,620, which is incorporated herein by reference in its entirety. Other methods may also be used.

Once the virtual progeny genetic sequence 410 is generated, it may be assigned one or more of the likelihoods that one or more alleles or mutations in the virtual genetic sequence 410 would be deleterious, for example, if replicated in a real living progeny.

Workflows

Operations described herein may be executed by one or more one or more processor(s) (e.g., controller(s) or processor(s) 108, 110, and/or 112 of FIG. 1), data or data structures described herein may be stored in one or more memor(ies) (e.g., memory unit(s) 114, 116, and/or 118 of FIG. 1), and any visualizations, or data may be displayed on one or more display(s) (e.g., output display 120 or any use display).

Reference is made to FIG. 19A, which is a flowchart of a method for assessing risk of variant-specific gene dysfunction according to an embodiment of the invention.

In operation 1900, a memory may store and a processor may access a neural network including multiple nodes respectively associated with multiple different gene-dysfunction metrics and multiple different confidence weights. The neural network may combine, aggregate or compose the multiple gene-dysfunction metrics according to the respective associated confidence weights to generate one or more likelihoods that a genetic mutation causes gene-dysfunction in organisms.

In operation 1902, a processor may execute a training-phase. In the training-phase, the processor may train the neural network using an input data set including one or more genetic mutations to generate new gene-dysfunction metrics and new associated confidence weights that optimize the neural network based on a cost factor. In some embodiments, the processor may optimize the neural network in the training-phase by shifting a center of the one or more likelihoods of known pathogenic mutations toward one or more maximal likelihoods, shifting a center of the one or more likelihoods of known benign mutations toward one or more minimal likelihoods, and/or shifting a center of the one or more likelihoods of uncharacterized mutations away from the one or more maximal or minimal likelihoods. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the known pathogenic mutations by generating one or more pathogenic thresholds providing a lower bound for the one or more likelihoods of a plurality of the known pathogenic mutations and minimizing the difference between the one or more pathogenic thresholds and respective one or more maximal likelihoods. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the uncharacterized mutations by generating one or more pathogenic thresholds providing a lower bound for the one or more likelihoods of a plurality of the known pathogenic mutations and minimizing the number of the uncharacterized mutations having one or more likelihoods above the one or more pathogenic thresholds. In some embodiments, the processor may optimize the neural network in the training-phase by reducing the cost factor associated with the known benign mutations by minimizing mean distribution values of the one or more likelihoods of the known benign mutations. In some embodiments, the processor may optimize the neural network in the training-phase on a gene-by-gene basis and may aggregate gene-specific optimization results, for example, across a genome to obtain a combined genome-wide cost factor. Training phase operation 1902 may be repeated, for example, each time a new input data set is received, new nodes are added to the neural network, optimization parameters are changed, or periodically.

In operation 1904, a processor may execute a run-time phase. In the run-time phase, the processor may identify a genetic mutation and compute the multiple gene-dysfunction metrics for the identified genetic mutation based on the new gene-dysfunction metrics and the associated new confidence weights of the neural network. The run-time phase may include one or more of operations 1904-1918.

In operation 1906, a processor may compute one or more population selection nodes in the neural network associated with multiple population-specific measures of homozygosity for each of multiple populations, which is further described in reference to FIG. 19B. The processor may generate each of the multiple population-specific measures of homozygosity by comparing the count of observed homozygotes of the identified genetic mutation measured on both chromosomes at a genetic locus in a population-specific set of genetic sequences and an expected homozygote count based on a total observed count of the identified genetic mutation measured on either chromosome at the genetic locus in the population-specific set. The processor may weigh the measures of homozygosity with the one or more population-specific confidence weights defined based on a magnitude of the observed or expected homozygote counts in each population-specific set. The processor may compute one or more population selection nodes in the neural network associated with multiple population-specific measures of heterozygosity for each of multiple populations. The processor may generate each of the multiple population-specific measures of heterozygosity based on the total observed count of the identified genetic mutation and the clinical visibility of the identified genetic mutation. The processor may weigh the measures of heterozygosity with one or more population-specific confidence weights defined based on a total count of the identified genetic mutation in the corresponding population. The processor may compute one or more population selection nodes in the neural network associated with multiple population-specific measures of a dominant effect based on an allele count of the identified genetic mutation compared to a distribution of allele counts across a plurality of pathogenic mutations in an identified gene. The processor may weigh the measures of the dominant effect with one or more confidence weights defined based on a number of the plurality of pathogenic mutations in an identified gene. The population selection node is described in further detail in reference to FIG. 19B.

In operation 1908, a processor may compute one or more evolutionary constraint or selection nodes in the neural network associated with a measure of evolutionary variation of alleles at each of one or more common ancestral genetic loci in multiple organisms corresponding to one or more loci of the identified genetic mutation. In one embodiment, the one or more likelihoods that the identified genetic mutation causes gene-dysfunction in organisms may be indirectly proportional to the measure of evolutionary variation in alleles. In one embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a frequency with which the genetic mutation has occurred and persisted in the multiple organisms over evolutionary history. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a proximity in a phylogenetic tree representing an evolutionary timescale between a reference genetic sequence of the same species as the organism and one or more other species in which the genetic mutation has occurred. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on an average pairwise difference between different alleles at the corresponding one or more loci of the identified genetic mutation. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a distance metric between a reference genetic sequence and a genetic sequence of the organism. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on a probability of transitioning from a reference allele to the genetic mutation. In another embodiment, the processor may compute the measure of evolutionary variation corresponding to the identified genetic mutation based on ratio w of a non-synonymous substitution rate to a synonymous substitution rate, wherein a non-synonymous substitution is an allele substitution in a codon that does not change an amino acid encoded by the codon and a synonymous substitution is an allele substitution in the codon that does change the amino acid. The processor may weigh the evolutionary constraint nodes with one or more evolutionary constraint confidence weights based on a distribution of mutation rates at different genetic loci in the multiple aligned genetic sequences. In some embodiments, the multiple organisms may be from multiple different species, whereas in other embodiments, the multiple organisms are from a single species. The evolutionary selection node is described in further detail in reference to FIG. 19C.

In operation 1910, a processor may compute one or more mutation class nodes in the neural network that measure a mutation type metric associated with a mutation type of the identified genetic mutation. In various embodiments, the mutation type of the identified genetic mutation may include one or more of: start-loss, stop-loss, stop-gain, stop-retained, frame-shift indel, in-frame indel, essential splice site, splice region, microsatellite, synonymous, missense, intron, and untranslated region. In one embodiment, the processor may compute the mutation class metric for the identified genetic mutation of a microsatellite mutation type inversely proportionally to a length of a repeating microsatellite sequence in the identified genetic mutation. The processor may weigh the microsatellite mutation class metric with a microsatellite mutation class weight that is directly proportional to the length of the repeating microsatellite sequence in the identified genetic mutation. In one embodiment, the processor may compute the mutation class metric for the identified genetic mutation of an in-frame indel mutation type directly proportionally to a number of codons inserted or deleted by the identified genetic mutation. The processor may weigh the in-frame indel mutation class metric with an in-frame indel mutation class weight that is directly proportional to a number of codons inserted or deleted by the identified genetic mutation.

In operation 1912, a processor may compute one or more pathogenic predictor nodes in the neural network that measure one or more pathogenic predictor metrics predicting a degree of pathology of the identified genetic mutation. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming a PROVEAN value of the identified genetic mutation to a linear scale and an associated confidence weight proportional to the PROVEAN value. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming a CADD value of the identified genetic mutation from a Phred scale to a linear scale and an associated confidence weight proportional to the CADD value. In one embodiment, the processor may compute the one or more pathogenic predictor metrics by transforming two PolyPhen-2 values of the identified genetic mutation, HumDiv and HumVar, into an average value and an associated confidence weight that is inversely proportional to the difference between the two PolyPhen-2 values. In one embodiment, the processor may compute the one or more pathogenic predictor metrics from a VEST value of the identified genetic mutation.

In operation 1914, a processor may compute one or more clinical classification nodes in the neural network that measure one or more clinical classification metrics defining available clinical classification data for the identified genetic mutation. In one embodiment, the processor may compute the clinical classification metrics to be substantially maximal when the identified genetic mutation is clinically classified as pathogenic and substantially minimal when the identified genetic mutation is clinically classified as benign. In one embodiment, the processor may compute a clinical classification confidence weight associated with the clinical classification metrics to be substantially maximal when the clinical classification is uncontested and relatively lower than maximal when the clinical classification is contested.

In operation 1916, a processor may combine or aggregate the values or outputs form the multiple gene-dysfunction metrics to compute one or more likelihoods that the identified genetic mutation causes gene-dysfunction in the organism. In one embodiment, in the run-time phase, a processor may compare the one or more likelihoods to one or more pathogenic threshold ranges to predict if the genetic mutation will cause gene-dysfunction in the organism.

In operation 1918, a display may display results, input data, or intermediate data from operations 1900-1916 or data represented as FIGS. 2, 4-18. In some embodiments, the display may display a visualization of the genetic mutation predicted to cause gene-dysfunction in an image or sequence of the organism's DNA together with the one or more likelihoods that the genetic mutation causes gene-dysfunction.

Reference is made to FIG. 19B, which is a flowchart of a method of predicting gene-dysfunction associated with a genetic mutation in an organism based on population-specific selection factors or nodes according to an embodiment of the invention.

In operation 1920, a processor may receive multiple population-specific sets of genetic sequences each including multiple genetic sequences obtained from genetic samples of organisms from a different respective one of multiple populations.

In operation 1922, a processor may generate each of multiple population-specific measures of homozygosity of the genetic mutation for each of the respective multiple populations. The measures of homozygosity in each population may be computed by comparing the count of observed homozygotes of the genetic mutation measured on both chromosomes at a genetic locus in the population-specific set and an expected homozygote count based on a total observed count of the genetic mutation measured on either chromosome at the genetic locus in the population-specific set.

In operation 1924, a processor may generate each of multiple population-specific measures of heterozygosity associated with the genetic mutation for each of the respective multiple populations based on the total observed count of the genetic mutation and the clinical visibility of the genetic mutation. In one embodiment, each of the multiple population-specific measures of heterozygosity is directly proportional to the clinical visibility of the genetic mutation and indirectly proportional to the total count of the genetic mutation in the corresponding population. In one embodiment, each of the multiple population-specific measures of heterozygosity increases when the clinical visibility of the genetic mutation is relatively greater than the total count of the genetic mutation in the corresponding population. In one embodiment, the processor may weigh the multiple population-specific measures of heterozygosity based on the total frequency of the genetic mutation in the corresponding population. In one embodiment, the processor may compute the clinical visibility of the genetic mutation based on one or more measures of: a frequency with which the genetic mutation is cited in clinical studies or literature, a number of published articles referencing the genetic mutation, a number of compound heterozygotes of the genetic mutation with other variants described in clinical studies or literature, and a number of search results or an order or ranking in a search result for the genetic mutation.

In operation 1926, a processor may generate one or more measures of a dominant effect associated with the genetic mutation based on an allele count of the identified genetic mutation compared to a distribution of allele counts across a plurality of pathogenic mutations in an identified gene. In one embodiment, the processor may weigh the measures of the dominant effect with one or more confidence weights defined based on the allele count, a number of the pathogenic mutations in the identified gene, and/or a distribution of allele counts across the plurality of pathogenic mutations in the identified gene.

In operation 1928, a processor may compute one or more likelihoods that the genetic mutation causes gene-dysfunction in the organism based on one or more of the multiple population-specific measures of homozygosity. In one embodiment, the processor may compute the one or more likelihoods that the genetic mutation causes gene-dysfunction to be greater when the observed homozygote count is less than the expected homozygote count and to be smaller when the observed homozygote count is greater than the expected homozygote count. In one embodiment, the processor may compute the one or more likelihoods that the genetic mutation causes gene-dysfunction to increase when the observed homozygote count is less than the expected homozygote count. In one embodiment, the processor may compare the one or more likelihoods to one or more pathogenic threshold ranges to predict if the genetic mutation will cause gene-dysfunction in the organism. In one embodiment, the processor may compare the one or more likelihoods to one or more benign threshold ranges to predict if the genetic mutation is benign in the organism. In some embodiments, the processor may compute the one or more likelihoods based on a combination of the multiple population-specific measures of homozygosity corresponding to the multiple populations, each weighted according to an independent population-specific weight. In one embodiment, each of the population-specific weights is defined based on a magnitude of the observed or expected homozygote count in each population-specific set. In one embodiment, the population-specific weights are defined based on degrees from which the organism descended from each population. In some embodiments, the processor may compute the one or more likelihoods based on a single population-specific measure of homozygosity corresponding to a single primary population from which the organism descended. In some embodiments, the processor may compute the one or more likelihoods and weights by training a model to discriminate between genetic mutations known to cause pathology and genetic mutations known to be benign. The one or more likelihoods may be computed on a continuous scale.

In operation 1930, a display may display results, input data, or intermediate data from operations 1920-1928 or data represented as FIGS. 2, 4-18. In some embodiments, the display may display a visualization of the genetic mutation predicted to cause gene-dysfunction in an image or sequence of the organism's DNA together with the one or more likelihoods that the genetic mutation causes gene-dysfunction.

Reference is made to FIG. 19C, which is a flowchart of a method of predicting gene-dysfunction associated with a genetic mutation in an organism based on evolutionary selection factors or nodes according to an embodiment of the invention.

In operation 1932, a processor may receive multiple aligned reference genetic sequences of multiple extant organisms representative of one or more species or populations (e.g., as shown in FIG. 15). The reference genetic sequences may be sequenced by a genetic sequencer (e.g., sequencer 102 of FIG. 1) or pre-stored and retrieved from a memory or database (e.g., memory unit(s) 114, 116, and/or 118 of FIG. 1) and may be aligned by a sequence aligner (e.g., sequence aligner 104 of FIG. 1) or pre-aligned in the memory or database.

In operation 1934, the processor may build or obtain a model representing measures of evolutionary variation of alleles or nucleotides at one or more aligned genetic loci between the multiple organisms. The model may be a single-species model (e.g., the multiple organisms are from the same single species) or a multi-species model (e.g., the multiple organisms are from different multiple species). The model may include, for example, a phylogenetic tree (e.g., as shown in FIG. 16), or another data structure.

In operation 1936, the processor may receive a genetic sequence of an organism to be genetically screened.

In operation 1938, the processor may use the model of operation 1934 defining the evolutionary past variation among multiple extant organisms of different populations or species to predict or interpolate a likelihood or probability of evolutionary health of the organism in operation 1936. The processor may determine the differences between the organism's genetic sequence and one or more aligned reference genetic sequences and may assign each allele (or only different or mutated alleles) a measure of evolutionary variation that is a function of variations in alleles at corresponding aligned genetic loci in the multiple aligned genetic sequences (e.g., loci derived from one or more common ancestral genetic loci in the multiple organisms). The processor may compute one or more likelihoods that an allele mutation at each of the one or more genetic loci in the organism is deleterious based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci for the multiple organisms. The likelihoods may include one or more likelihoods or likelihood distributions for one or more alleles, one or more allele mutations, one or more genes, one or more codons, one or more genetic loci or loci segments, for one or more living organisms or virtual progeny of two potential parents (e.g., generated by repeatedly simulating a mating using different virtual gamete(s) in each iteration) and/or for one or more pairs of potential parents (e.g., generated by repeatedly simulating a mating step, in each iteration using the genetic information obtained in operation 1936. The one or more likelihoods may be compared to one or more thresholds or other statistical models to predict if (or a likelihood or degree in which) an allele mutation is deleterious in the organism. For example, mutations at genetic loci with relatively constant or fixed alleles and relatively lower measures of evolutionary variation may be associated with relatively higher likelihoods of deleterious traits, whereas mutations at genetic loci with relatively volatile or changing alleles and relatively higher measures of evolutionary variation may be associated with relatively lower likelihoods of resulting in deleterious traits.

In operation 1940, an output device (e.g., output device 320 of FIG. 3) may output or display, e.g., to a user, the one or more likelihoods or likelihood distributions that the organism would have deleterious traits, or other results, input data, or intermediate data from operations 1932-1938. For example, the output device may output one or more likelihoods that an allele mutation at each of the one or more genetic loci in the simulated virtual progeny will be deleterious based on the measure of evolutionary variation of alleles at the corresponding aligned genetic loci for the multiple organisms.

The organism screened for gene-dysfunction in FIGS. 17A-17C may be a living organism or a virtual organism. When the organism is a living organism, a processor may sequence the organism's DNA obtained from a biological sample to identify the genetic mutation.

When the organism is a virtual organism, a processor may generate the virtual progeny by combining at least a portion of genetic information representing DNA obtained from biological samples of two living potential parents. In one embodiment, the processor may simulate a mating between the two potential parents by combining their genetic sequences to generate one or more virtual progeny genetic sequences (e.g., sequence 410 of FIG. 17). The processor may generate a virtual gamete (haploid genetic sequence) for each potential parent by at least partially randomly selecting one of two allele copies in the parent's two chromosomes (diploid genetic sequence) to simulate recombination at each of a sequence of genetic loci. A virtual gamete for each of the two potential parents (e.g., one virtual sperm and one virtual egg) may be combined to generate the genetic sequence of the virtual progeny. Multiple virtual gametes may be generated for each potential parent by repeating the recombination process each time selecting a different at least partially random sequence of alleles. Multiple virtual progeny genetic sequences may be generated for multiple pairs of potential parents by repeating the step of combining two virtual gametes for each of a plurality of different combinations of two virtual gametes. In one embodiment, the independent carrier status of an individual may be determined by simulating a mating combining the individual's genetic sequence information with that of a sample, averaged, or reference genetic sequence of the same species.

Other or different operations or orders of operations may be used and operations may be repeated, e.g., until the likelihoods or optimization cost factor converge or asymptotically approach a statistically stable result.

SUMMARY

Current literature and variant classification databases distribute recessive-disease associated variants into binary “pathogenic” and “benign” groupings. These generalizations are based on outdated and limited genetic practices.

There is a need in the art to incorporate multiple complementary resources to generate a more accurate computational score. With the advent of extensive disease databases, such as, the ExAC database, and with additional exome sequencing data of healthy and diseased individuals, the VGD model provides a comprehensive score that incorporates in silico models and actual observed variant frequencies. The VGD model is flexible and dynamic, for example, using a neural network to incorporate a growing combination of model components or nodes, and to re-train the model based on growing clinical resources and genetic datasets. The VGD model may be used to assess gene-dysfunction for any variant and dynamic enough to handle unique variants. The VGD model combined with the reduction in cost of next-generation sequencing provides the capability to efficiently and accurately estimate disease contribution on a continuous scale for any variant.

There is an imperative need for a disease classification system that moves beyond the binary “pathogenic” and “benign” categorizations for recessive-disease associated variants. The VGD model hereby provides estimates of the functional impact that variants have on the expression of single locus recessive diseases on a continuous scale. The VGD model further combines several metrics into one or more likelihoods, increasing the accuracy and sensitivity for assessing the probability that a variant is disease contributing, and identifying new variants that were previously ignored as well as mistaken variants that were previously erroneously classified as pathogenic due to more rudimentary methods, exposing their limitations and restrictions. The VGD model can be used to quantify the level of risk associated with any variant and represents a more accurate metric for describing potential disease contributing variants.

DEFINITIONS

As used herein, a “chromosome” may refer to a molecule of DNA with a sequence of basepairs that corresponds closely to a defined chromosome reference sequence of the organism in question.

As used herein, a “gene” may refer to a DNA sequence in a chromosome that codes for a product (either RNA or its translation product, a polypeptide) or otherwise plays a role in the expression of said product. A gene contains a DNA sequence with biological function. The biological function may be contained within the structure of the RNA product or a coding region for a polypeptide. The coding region includes a plurality of coding segments (“exons”) and intervening non-coding sequences (“introns”) between individual coding segments and non-coding regions preceding and following the first and last coding regions respectively.

As used herein, a “locus” may refer to any segment of DNA sequence defined by chromosomal coordinates in a reference genome known to the art, irrespective of biological function. A DNA locus can contain multiple genes or no genes; it can be a single base pair or millions of base pairs.

As used herein, a “polymorphic locus” may refer to a genomic locus at which two or more alleles have been identified.

As used herein, an “allele” may refer to one of two or more existing genetic variants of a specific polymorphic genomic locus.

As used herein a “variant” or “genetic mutation” may be any one or more bases, nucleotides, or alleles, which may or may not differ compared to reference, common or expected bases, nucleotides, or alleles, for example, of one or more reference genetic sequences.

As used herein, “genotype” may refer to the diploid combination of alleles at a given genetic locus, or set of related loci, in a given cell or organism. A homozygote includes two copies of the same allele and a heterozygote includes two distinct alleles. In the simplest case of a locus with two alleles “A” and “a”, three genotypes can be formed: A/A, A/a, and a/a, of which A/A and a/a are homozygotes and A/a are heterozygotes.

As used herein, “genotyping” may refer to any experimental, computational, or observational protocol for distinguishing an individual's genotype at one or more well-defined loci.

As used herein, a “haplotype” may refer to a unique set of alleles at separate loci that are normally grouped closely together on the same DNA molecule, and are observed to be inherited as a group. A haplotype can be defined by a set of specific alleles at each defined polymorphic locus within a haploblock.

As used herein, a “haploblock” may refer to a genomic region that maintains genetic integrity over multiple generations and is recognized by linkage disequilibrium within a population. Haploblocks are defined empirically for a given population of individuals.

As used herein, “linkage disequilibrium” may refer to the non-random association of alleles at two or more loci within a particular population. Linkage disequilibrium is measured as a departure from the null hypothesis of linkage equilibrium, where each allele at one locus associates randomly with each allele at a second locus in a population of individual genomes.

As used herein, a “genome” may refer to the total genetic information carried by an individual organism or cell, represented by the complete DNA sequences of its chromosomes.

As used herein, a “genome profile” may refer to a representative subset of the total information contained within a genome. A genome profile contains genotypes at a particular set of polymorphic loci.

As used herein, a “personal genome profile”, abbreviated PGP, may refer to the genome profile of a particular individual person.

As used herein, a genetic “trait” may refer to a distinguishing attribute of an individual, whose expression is fully or partially influenced by an individual's genetic constitution.

As used herein, a “phenotype” may refer to a class of alternative traits which may be discrete or continuous.

As used herein, “haploid cell” may refer to a cell with a haploid number (n) of chromosomes.

As used herein, “gametes”, may refer to specialized haploid cells (e.g., spermatozoa and oocytes) produced through the process of meiosis and involved in sexual reproduction.

As used herein, “gametotype” may refer to single copies with one allele of each of one or more loci in the haploid genome of a single gamete.

As used herein, an “autosome” may refer to any chromosome exclusive of the X and Y sex chromosomes.

As used herein, “diploid cell” may have a homologous pair of each of its autosomal chromosomes, and has two copies (2n) of each autosomal genetic locus.

As used herein, a “haplopath” may refer to a haploid path laid out along a defined region of a diploid genome by a single iteration of a Monte Carlo simulation or a single chain generated through a Markov process. A haplopath can be formed by starting at one end of a personal chromosome or genome and walking from locus to locus, choosing a single allele at each step based on available linkage disequilibrium information, inter-locus allele association coefficients, and formal rules of genetics that describe the natural process of gamete production in a sexually reproducing organism. A “haplopath” is generated through the application of formal rules of genetics that describe the reduction of the diploid genome into haploid genomes through the natural process of meiosis.

As used herein, a “Virtual Gamete” may refer to a single haplopath that extends across an entire genome.

As used herein, a “Virtual Progeny” or “Virtual Progeny genome sampling” may refer to the discrete genetic product of two Virtual Gametes. Virtual Progeny may be generated as disclosed in U.S. Pat. No. 8,805,620, incorporated herein by reference in its entirety.

As used herein, a “Virtual Progeny genome” may refer to a collection of discrete Virtual Progeny genome samplings, each generated by combining two uniquely-derived random Virtual Gametes. In some instances, a Virtual Progeny genome is represented as a probability mass function over a sample space of all discrete genome states. In some instances, a Virtual Progeny is an informed simulation of a child or children that might result as a consequence of sexual reproduction between two individuals.

As used herein, a “Virtual Progeny phenome” may refer to a multi-dimensional likelihood function representing the likelihood and/or likely degree of expression of a set of one or more traits from a complete Virtual Progeny genome. In some instances, a Virtual Progeny phenome is represented as a probability mass function over a sample space of discrete or continuous phenotypic states. In some instances, a Virtual Progeny phenome is an informed simulation of a child or children that might result as a consequence of sexual reproduction between two individuals.

As used herein, “potential parent” may refer to an individual who genetic information is combined with another's genetic information to simulate a mating before a child is conceived. The mating may be simulated for two potential parents both interested in their combined genetic code, or a single individual iterating the mating over a plurality of candidate donors, for example, to select an optimal donor from a sperm or egg donor bank. A “partner” may refer to a marriage partner, sexual or reproductive partner, domestic partner, opposite-sex partner, and same-sex partner.

As used herein, a “living” organism may refer to a real, extant, surviving, currently living, or previously living (now deceased), organism. A “virtual” organism may refer to a never or non-living organism, for example, simulated by computer models, where all its genetic information is derived by combining data representing real biological DNA obtained from living organisms such as two potential parents by simulated a mating or conception process.

As used herein, a “DNA image” may refer to a magnified picture or image or real biological DNA, or may refer to a simplified schematic representation thereof such as a nucleotide or DNA sequence. In one example, a DNA image may be zoomed out view of a DNA sequence.

As used herein, “reference genetic sequence” may refer to a genetic sequence used to generate an evolutionary model, such as, a phylogenetic tree. Reference genetic sequences may include standardized genetic sequences from organisms representative of one or more evolutionarily extant (currently or previously living) populations or species, such as those released by genome consortia (e.g., human reference genome, such as, Genome Reference Consortium Human Build 37 (GRCh37) provided by the Genome Reference Consortium). Reference genetic sequences may additionally or alternatively include non-standardized sequences of organisms, such as, any member of a population or species. A single-species model may be generated using reference genetic sequences from multiple organisms of the same single species, e.g., 1,000 chimpanzee or humans. A multi-species model may be generated using reference genetic sequences from multiple organisms of multiple different species, e.g., one model using 1,000 humans, 10 chimpanzee and one gorilla, or another model using a single different organism from each different species as shown in FIG. 15. Reference genetic sequences may be used to analyze the evolution of successful (positive) or neutral (non-deleterious) allele mutations or variations across one or more extant species. An evolutionary model may predict likelihoods that allele mutations or variations would be deleterious based on their frequency or rarity of occurrence across the multiple reference genetic sequences. For example, allele mutations or variations that are relatively more rare across the reference genetic sequences may be considered negatively selected for evolutionarily (e.g. associated with a deleterious trait for which an organism cannot or has a relatively lower likelihood of surviving or reproducing), while allele mutations or variations that are relatively more common across the reference genetic sequences may be considered positively or neutrally selected for evolutionarily (e.g. not associated with a deleterious trait, but traits for which an organism has a neutral or improved likelihood of surviving or reproducing).

As used herein, “potential parent genetic sequences” may refer to genetic sequences of real (currently or previously living) potential parents, for example, from which genetic information is combined to simulate a virtual mating generating one or more virtual children or progeny, to predict before they conceive a child, a likelihood that such a child would have a deleterious trait. The potential parent genetic sequences may be obtained from genetic samples of two potential parents seeking to mate, or from a first potential parent seeking a genetic donor and a second potential parent from a pool of candidate donors.

As used herein, “virtual progeny genetic sequences” may refer to genetic sequences of simulated (never living) virtual progeny generated by simulating a mating or combining genetic information from two potential parent genetic sequences. Each virtual progeny genetic sequence may be a prediction or simulation of one possible genetic sequence of a child of the two potential parents, before that child is conceived. To achieve more robust results, the simulated mating may be repeated to generate multiple virtual progeny genetic sequences for each pair of potential parents. The virtual progeny genetic sequences may be compared to the reference genetic sequences, for example, to identify evolutionarily rare, and therefore, likely deleterious traits.

In some embodiments, genetic information may be used interchangeably for potential parent genetic sequences and reference genetic sequences. In one example, genetic information from a potential parent or donor may be used instead of, or in combination with reference consortium genetic sequences, to generate an evolutionary model or phylogenetic tree. In another example, reference consortium genetic sequences may be used instead of, or in combination with potential parent or donor genetic sequences, to simulate matings or predict likelihoods of deleterious traits in offspring.

As used herein, a “genetic sequence” may include genetic information representing one or more bases, nucleotides or alleles (sequences of nucleotides defining different forms of a gene) for any number of sequential or non-sequential genetic loci. For example, a “genetic sequence” may refer to allele information at a single genetic locus, or multiple genetic loci, such as, one or more gene segments or an entire genome. A genetic sequence is a data structure representing genetic information at one or more loci of a real or virtual genome. Genetic sequence data structures may include, for example, one or more vectors, scalar values, functions, sequences, sets, matrices, tables, lists, arrays, and/or other data structures, representing one or more bases, nucleotides, genes, alleles, codons or other generic material. The data structures representing each single chromosome sequence may be one dimensional (e.g., representing a single base or allele per locus) or multi-dimensional (e.g., representing multiple or all bases A, T, C, G or alleles at each locus and a probability associated with the likelihood of each existing in a potential progeny). The same (or different) data structures may be used for real and virtual genome sequences, though real genome sequences generally represent real genetic material (e.g., DNA extracted from a currently or previously existing genetic sample), while virtual genome sequences are generated by combining at least a portion of genetic sequences from biological DNA samples of two living potential parents.

As used herein, “a˜b” may represent a proportional “˜” relationship between a and b.

As used herein, “a≈b” may represent an approximate equivalence “≈” between a and b, for example, within 10% of either value.

CONCLUSION

In the foregoing description, various aspects of the present invention are described. For purposes of explanation, specific configurations and details are set forth in order to provide a thorough understanding of the present invention. However, it will also be apparent to persons of ordinary skill in the art that the present invention may be practiced without the specific details presented herein. Furthermore, well known features may be omitted or simplified in order not to obscure the present invention.

Unless specifically stated otherwise, it is appreciated that throughout the specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulates and/or transforms data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.

Embodiments of the invention may include an article such as a computer or processor readable non-transitory storage medium, such as for example a memory, a disk drive, or a USB flash memory device (e.g., memory unit(s) 114, 116, and/or 118 of FIG. 1) encoding, including or storing instructions, e.g., computer-executable instructions, which when executed by a processor or controller (e.g., controller(s) or processor(s) 108, 110, and/or 112 of FIG. 1), cause the processor or controller to carry out methods disclosed herein.

Different embodiments are disclosed herein. Features of certain embodiments may be combined with features of other embodiments; thus certain embodiments may be combinations of features of multiple embodiments.

The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. It should be appreciated by persons of ordinary skill in the art that many modifications, variations, substitutions, changes, and equivalents are possible in light of the above teaching. For example, it should be appreciated that sign conventions are equivalent and that embodiments of the invention in which values are above a lower bound or threshold are equivalent to embodiments of the invention in which values are below an upper bound or threshold, since the difference is a mere convention of sign. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

TABLES

TABLE 1

Summary of Variants Detected by VGD Model and Missed by ClinVar or VariBench

Table 1 lists de novo variants (not identified by the ClinVar or VariBench models) identified

according to embodiments of the invention to cause gene dysfunction, for example, with VGD

scores of above 0.95 or identified according to embodiments of the invention to be benign, for

example, with VGD scores below 0.05.

Median

VGD
VGD
# of
Median
PP2
CADD
PROVEAN
VEST

Median
mean
homozygotes
max pop.
median
median
median
median

(IQR)
(sd)
(IQR)
freq (IQR)
(IQR)
(IQR)
(IQR)
(IQR)

ExAC variants
0
0.0109
0 (0, 0)
9.64e−05
0.001
7.5
−0.01
0.066

not in ClinVar
(0, 0.02)
(0.0156)

(3e−05,
(0,
(2.42,
(−0.34,
(0.039,

with VGD <=0.05

0.000262)
0.006)
10.6)
0.41)
0.107)

ExAC variants
0.99
0.985
0 (0, 0)
6.06e−05
1
25.1
−5.7
0.924

not in ClinVar
(0.97, 1)
(0.0162)

(1.51e−05,
(0.999,
(21.6,
(−7.31, −4.36)
(0.868,

with VGD >=0.95

9.92e−05)
1)
32)

0.965)

All ExAC
0.32
0.402
0 (0, 0)
8.64e−05
0.759
16.2
−1.89
0.403

variants not in
(0.03,
(0.369)

(1.61e−05,
(0.022,
(9.81,
(−3.56, −0.75)
(0.165,

ClinVar
0.77)

0.000169)
0.998)
22.6)

0.727)

ExAC variants
0
0.0107
0 (0, 0)
9.82e−05
0.001
7.62
−0.04
0.068

not in VariBench
(0, 0.02)
(0.0155)

(3.01e−05,
(0,
(2.5,
(−0.39,
(0.04,

with VGD <=0.05

0.000347)
0.008)
10.8)
0.39)
0.114)

ExAC variants
0.99
0.985
0 (0, 0)
6.06e−05
1
25
−5.67
0.924

not in VariBench
(0.97, 1)
(0.0161)

(1.51e−05,
(0.999,
(21.5,
(−7.29, −4.31)
(0.867,

with VGD >=0.95

0.000102)
1)
32)

0.966)

All ExAC
0.31
0.401
0 (0, 0)
8.66e−05
0.76
16.2
−1.89
0.403

variants not in
(0.03,
(0.37)

(1.65e−05,
(0.022,
(9.81,
(−3.55, −0.75)
(0.165,

VariBench
0.77)

0.000174)
0.998)
22.7)

0.729)

ExAC variants
0
0.0109
0 (0, 0)
9.63e−05
0.001
7.49
0.01
0.065

not in ClinVar or
(0, 0.02)
(0.0156)

(3e−05,
(0,
(2.42,
(−0.32,
(0.039,

VariBench with

0.00026)
0.006)
10.6)
0.42)
0.105)

VGD <=0.05

ExAC variants
0.99
0.985
0 (0, 0)
6.06e−05
1
25.1
−5.7
0.923

not in ClinVar or
(0.97, 1)
(0.0162)

(1.51e−05,
(0.999,
(21.6,
(−7.31, −4.36)
(0.867,

VariBench with

9.9e−05)
1)
32)

0.964)

VGD >=0.95

ExAC variants
0.98
0.935
0 (0, 0)
6.06e−05
1
24.5
−4.54
0.852

not in ClinVar or
(0.92, 1)
(0.105)

(1.51e−05,
(0.993,
(21.3,
(−6.31, −3.24)
(0.693,

VariBench with

0.000116)
1)
29.9)

0.936)

VGD naive >=

per gene

thresholds

All ExAC
0.32
0.401
0 (0, 0)
8.64e−05
0.753
16.1
−1.89
0.401

variants not in
(0.03,
(0.368)

(1.61e−05,
(0.022,
(9.8,
(−3.55, −0.75)
(0.165,

ClinVar or
0.76)

0.000166)
0.998)
22.6)

0.724)

VariBench

TABLE 2

Example Mutation Class Subscores and

Weights for Various Mutation Types

Raw Mut

Mutation Class
MutScore
Weight

Start-loss, stop-gain, frame-
1.0
99

shift, essential splice site

Microsatellite
1/STR{circumflex over ( )}2
1.0

Synonymous
0.0
1.0

In-frame indel
0.6 + 0.4*(1 − exp(−y/10))
0.01

Missense, stop-loss
0.5
0.01

Other annotation
0.0
0.01

TABLE 3

Example Clinical Classification Subscores

and Weights for Various Mutation Types

ClinVar Class
Clinical Class
Raw ClinVar

(numeric)
(description)
Weight
ClinScore

5.1
Uncontested “Pathogenic”
20
1.0

4
“Likely Pathogenic”
10
1.0

5.2
Contested “Pathogenic”
1
1.0

3
“Likely Benign”
10
0.0

2
“Benign”
20
0.0

0, 1, 255
Uncharacterized
0.0
0.0

not in ClinVar

TABLE 4

Example Scores for ClinVar 5.2 variants

Chr:bp
Gene
VGD
Strand
Genotype
SNP
HGVSp

01:043212925
P3H1
0.36
−
C/T
rs137853890
p.A691=|p.ARA689−691=

01:046655645
POMGNT1
0
−
C/T
rs74374973
p.D534N|p.D556N

01:063872032
ALG6
0
+
T/C
rs35383149
p.Y131H

01:076198337
ACADM
0.29
+
G/A
rs147559466
p.E43K|p.E47K|p.E7K|p.P13=

01:097915614
DPYD
0.16
−
C/T
rs3918290
.

01:098348885
DPYD
0
−
G/A
rs1801265
p.C29=|p.C29R|p.R29C

01:100672060
DBT
0.72
−
T/G
rs12021720
p.G323S|p.G384=

01:100672060
DBT
0
−
T/C
rs12021720
p.G323S|p.G384=|p.G384S|p.S384G

01:155204994
GBA
0.55
−
C/G
rs1135675|rs606231143
p.A456P|p.L444P|p.V386=|p.V412=|p.V450=|p.V460V|p.V499=

01:155205008
GBA
0.57
−
C/G
rs368060
p.A382P|p.A408P|p.A446P|p.A456P|p.A495P|p.L444P|p.V460V

01:155206167
GBA
0.02
−
C/T
rs2230288
p.D140H|p.E252K|p.E278K|p.E316K|p.E326K|p.E365K

01:156108401
LMNA
0.46
+
G/A
rs59886214
p.V607=|p.V607V

01:156108404
LMNA
0.61
+
C/T
rs58596362
p.G608=|p.G608G

01:156848918
NTRK1
0.01
+
C/T
rs6336
p.G607V|p.H568Y|p.H598Y|p.H601Y|p.H604Y|p.Q9X|p.Y604H

01:156848946
NTRK1
0.01
+
G/T
rs6339
p.G577V|p.G607V|p.G610V|p.G613V|p.H598Y|p.Q9X|p.V613G

01:216498841
USH2A
0.34
−
G/T
rs111033272
p.R317=

01:227170648
ADCK3
0
+
C/T
rs41303129
p.F279=|p.F331=|p.F52=

01:241665852
FH
0.99
−
T/G
rs200796606
p.Q376P

01:241675301
FH
0.99
−
G/C
rs199822819
p.P174R

02:026461825
HADHA
0.65
−
G/T
rs147103714
p.F10L|p.R53=

02:145156798
ZEB2
0.33
−
G/A
rs587784563
p.Y652=

02:215813331
ABCA12
0
−
C/T
rs726070
p.D2047N|p.D2363N|p.D2365N

02:219674479
CYP27A1
0.31
+
G/T
rs587778796
p.G145=|p.G51=

02:219678877
CYP27A1
0
+
C/T
rs41272687
p.P384L

02:219754966
WNT10A
0
+
G/A
rs147680216
p.G213S

02:219755011
WNT10A
0.2
+
T/A
rs121908120
p.F228I

02:241808314
AGXT
0.01
+
C/T
rs34116584
p.P11L

02:241817516
AGXT
0
+
A/G
rs4426527
p.I340M

03:014200382
XPC
0
−
G/T
rs74737358
p.P218H|p.P297H|p.P334H

03:015677019
BTD
0
+
G/A
rs34885143
p.G25R|p.G45R|p.G47R

03:015677098
BTD
0
+
T/C
rs397514333
p.L51P|p.L71P|p.L73P

03:015686243
BTD
0
+
A/G
rs35976361
p.I274V|p.I294V|p.I296V

03:015686331
BTD
0
+
A/G
rs397507176
p.H303R|p.H323R|p.H325R

03:015686534
BTD
0
+
C/T
rs35034250
p.P371S|p.P391S|p.P393S

03:015686693
BTD
0.06
+
G/C
rs13078881
p.A171T|p.D424H|p.D444H|p.D446H|p.F403V

03:128598490
ACAD9
0
+
C/+TAAG
rs28384402|rs369565142|rs387906242
.

03:142275281
ATR
0.37
−
T/C
rs587776690
p.G674=

03:150645894
CLRN1
0.99
−
A/C
rs121908140
p.Y100X|p.Y176X|p.Y189X

03:165491280
BCHE
0
−
C/T
rs1803274
p.A29T|p.A539T|p.A567T|p.A97T

03:165547569
BCHE
0.09
−
C/A
rs28933390
p.G390V|p.G418V

03:165548529
BCHE
0.09
−
T/C
rs1799807
p.D70G|p.D98G

03:183966564
ALG3
0.64
−
G/A
rs387906273
p.G53=|p.G55=|p.H33Y

04:005755524
EVC
0
+
G/A
rs35953626
p.R443Q

04:187004074
TLR3
0
+
C/T
rs3775291
p.L135F|p.L348F|p.L412F

05:073981270
HEXB
0.31
+
T/G
rs820878
p.S62=|p.S62Lc.185C=

05:073981270
HEXB
0
+
T/C
rs820878
p.L62S|p.S62=|p.S62L

05:118792052
HSD17B4
0.66
+
C/T
rs587777442
p.A34V|p.S37=

05:131729380
SLC22A5
0.01
+
G/A
rs28383481
p.R488H|p.R512H

06:007542236
DSP
0.06
+
G/A
rs121912998
p.V30M

06:032006858
CYP21A2
0.97
+
C/G
rs6467
p.P105A|p.T100S|p.T70S

06:161159625
PLG
0.04
+
G/A
rs121918027
p.A601T|p.A620T

07:065432754
GUSB
0.37
−
G/A
rs377519272
p.S393=|p.S488=|p.S539=

07:087060844
ABCB4
0.1
−
C/T
rs45575636
p.R590Q

07:087082273
ABCB4
0
−
T/C
rs58238559
p.T175A|p.T175V

07:117149147
CFTR
0
+
G/A
rs1800076
p.R75Q

07:117175372
CFTR
0
+
A/G
rs121909046
p.E187G|p.E217G

07:117188682
CFTR
0.04
+
G/−TT
rs200454589|rs727504486
.

07:117227874
CFTR
0
+
A/G
rs75789129
p.I495V|p.I526V|p.I556V

07:117243663
CFTR
0.48
+
C/T
rs121909034
p.G1244V|p.S851L|p.S882L|p.S912L

07:117251704
CFTR
0.09
+
G/A
rs78769542
p.R1009Q|p.R1040Q|p.R1070Q|p.R12Q

07:117267556
CFTR
0
+
T/C
rs373002889
.

07:117304834
CFTR
0
+
G/C
rs113857788
p.Q1291H|p.Q1322H|p.Q1352H|p.Q61H

08:022021460
SFTPC
0
+
G/A
rs34957318
p.R108Q|p.R114Q|p.R161Q|p.R167Q

08:086392989
CA2
0
+
A/G
rs2228063
p.N252D

08:090983460
NBN
0.4
−
G/A
rs34767364
p.R127W|p.R133W|p.R215W

08:100832259
VPS13B
0.18
+
A/G
rs28940272
p.N2968S|p.N2993S

09:034647661
GALT
0.62
+
T/C
rs367543254
p.S112=|p.V96A

09:034648848
GALT
0.5
+
G/A
rs111033761
p.R259=

09:034649442
GALT
0
+
A/G
rs2070074
p.L218L|p.N205D|p.N314D

09:036217445
GNE
0.06
−
C/T
rs121908627
p.V586M|p.V622M|p.V691M|p.V696M|p.V727M

09:136301982
ADAMTS13
0
+
C/G
rs2301612
p.C508Y|p.Q120E|p.Q417E|p.Q448E

09:136302063
ADAMTS13
0
+
C/T
rs11575933
p.P147S|p.P444S|p.P475S

10:050678722
ERCC6
0
−
G/C
rs4253208
p.P1095R|p.P465R

10:072358722
PRF1
0
−
T/C
rs28933375
p.N252S

10:072360648
PRF1
0
−
C/T
rs35418374
p.R4H

10:102749444
C10orf2
0.58
+
C/T
rs80356541
p.A429=

11:005247873
HBB
0.82
−
C/T
rs33991993
p.E6V|p.K82B|p.K83N

11:005248029
HBB
0.98
−
C/A
rs1135071
p.R30S|p.R31S

11:005248177
HBB
0.64
−
A/T
rs33951465|rs75680770
p.G24G|p.G25=|p.G25G

11:017531093
USH1C
0.77
−
G/C
rs41282932
p.P608R|p.R608P

11:017548622
USH1C
1
−
C/indelrs387906330
.
c.496+59_496+103[9]|c.VNTR

11:017548878
USH1C
0
−
C/T
rs55843567
p.V130I|p.V141I|p.V99I

11:017552978
USH1C
0.64
−
C/T
rs151045328
p.V41=|p.V72=|p.V72fs|p.V83=

11:036615075
RAG2
0
−
G/A
rs35691292
p.T215I|p.W215I

11:064525266
PYGM
0.01
−
C/T
rs116315896
p.K127=|p.K215=

11:066287196
BBS1
0
+
G/A
rs35520756
p.E122K|p.E143K|p.E234K|p.E271K

11:068562328
CPT1A
0
−
C/T
rs2229738
p.A275T|p.A27T

11:071906793
FOLR1
0
+
T/C
rs144637717
.

11:076869378
MYO7A
0.6
+
G/A
rs41298135
p.R291H|p.R302H

11:108143299
ATM
0
+
A/G
rs3092857
p.M1040V

11:128709126
KCNJ1
0
−
A/G
rs59172778
p.M338T|p.M357T

12:006458350
SCNN1A
0.05
−
A/G
rs5742912
p.W193R|p.W493R|p.W515R|p.W516R|p.W552R

12:102164255
GNPTAB
0
−
T/G
rs7958709
p.I348L

12:121175678
ACADS
0
+
C/T
rs1800556
p.R147W|p.R171W

12:121176083
ACADS
0
+
G/A
rs1799958
p.G185S|p.G205S|p.G209S

12:122277904
HPD
0.44
−
G/C
rs137852868
p.I296M|p.I335M

12:122295335
HPD
0
−
T/C
rs1154510
p.A33T|p.T33A

13:020763234
GJB2
0.78
−
T/C
rs80338949
p.M163V

13:020763553
GJB2
0.98
−
C/-A
rs80338942
p.L56RfsX26

13:020763612
GJB2
0
−
C/T
rs72474224
p.V37Ic.109G>A

13:020763620
GJB2
0.53
−
A/G
rs35887622
p.M34T

13:020763685
GJB2
0.94
−
A/-C
rs1801002|rs398123814|rs80338939
p.G12VfsX2

13:020763710
GJB2
0.11
−
C/T
rs111033222
p.G4D

13:052513266
ATP7B
0
−
T/C
rs7334118
p.H1000R|p.H1096R|p.H1129R|p.H1142R|p.H1207R|p.H418R|p.H777R

13:052532497
ATP7B
0.96
−
T/C
rs137853287|rs193922103
p.M41V|p.M658V|p.M769V

13:108863591
LIG4
0
−
G/A
rs1805388
p.T9I

13:108863609
LIG4
0
−
G/A
rs1805389
p.A3V

14:024724663
TGM1
0.02
−
C/T
rs35312232
p.V209M|p.V518M|p.V76M

14:024731434
TGM1
0.72
−
G/T
rs41295338
p.S42Y

14:088452941
GALC
0.21
−
T/C
rs147313927
p.T112A|p.T56A|p.T86A|p.T89A

15:072638893
HEXA
0.58
−
G/A
rs587779406
p.Y262=|p.Y435=|p.Y446=

15:072641434
HEXA
0.64
−
A/T
rs28942072
p.V324=|p.V324V

15:080472526
FAH
0
+
C/T
rs11555096
p.R271W|p.R341W|p.R41W

15:089865073
POLG
0.01
−
T/C
rs41549716
p.Y831C

16:000222980
HBA2
0.33
+
C/T
rs63751457
p.G22G|p.G23=

16:003293257
MEFV
0.14
−
C/A
rs61732874
p.A533S|p.A564S|p.A744S

16:003293310
MEFV
0.01
−
A/G
rs28940579
p.V515A|p.V546A|p.V726A

16:003293403
MEFV
0
−
T/C
rs104895094
p.K484R|p.K515R|p.K695R

16:003304626
MEFV
0
−
C/G
rs3743930
p.E148Q|p.M694I

16:023200921
SCNN1G
0
+
G/A
rs5736
p.G183S

16:023200963
SCNN1G
0
+
G/A
rs5738
p.E197K

16:023360165
SCNN1B
0.54
+
C/G
rs35731153
p.S127C|p.S82C

16:053720436
RPGRIP1L
0
−
C/T
rs61747071
p.A229T

16:088502971
ZNF469
0.08
+
G/-CTTCCCGGGAACACC
rs281865162
p.L3004_T3008del|p.L3032_T3036del

17:003550800
CTNS
0
+
G/A
rs35086888
p.V42I

17:007123838
ACADVL
0
+
C/T
rs28934585
p.K247Q|p.P65L|p.P88L

17:015134364
PMP22
0.94
−
G/A
rs104894619
p.H58=|p.T118M

17:041063017
G6PC
0.27
+
G/T
rs80356484
p.L216=

17:078078341
GAA
0.91
+
T/G
rs199951626|rs386834236
.

19:007125518
INSR
0
−
C/T
rs1799816
p.V1000M|p.V1012M

19:012917556
RNASEH2A
0.63
+
G/A
rs397515480
p.V23=|p.V23V

19:012917562
RNASEH2A
0.64
+
C/T
rs397515479
p.R25=|p.R25R

19:018710530
CRLF1
0.5
−
C/T
rs104894670
p.L374R|p.R81H

19:036339044
NPHS1
0.02
−
C/T
rs28939695
p.D819V|p.E447K

19:045860626
ERCC2
0.81
−
G/C
rs121913016
p.L169V|p.L383V|p.L437V|p.L461V

20:043255233
ADA
0.96
−
G/A
rs121908736
p.R76W

20:043280227
ADA
0
−
C/T
rs11555565|rs73598374
p.D8N

22:050962500
SCO2
0.34
−
C/T
rs145100473
p.R114H

22:051064039
ARSA
0
−
G/C
rs743616
p.T16S|p.T307S|p.T391S|p.T393S

22:051064416
ARSA
0
−
T/C
rs2071421
p.N266S|p.N350S|p.N352S

Chr:bp
HGVSc
ClinVar entries

01:043212925
c.2055+12_2055+18delTCGAGCGinsTCGAGCA|c.2067_2073delTCGAGCGinsTCGAGCA|c.2073G>A
5

01:046655645
c.*1335G>A|c.1600G>A|c.1666G>A
0|2|255|3|5

01:063872032
c.391T>C
2|255|3|5

01:076198337
c.−259G>A|c.118+4164G>A|c.119−201G>A|c.127G>A|c.139G>A|c.19G>A|c.39G>A
2|255|5

01:097915614
c.1905+1G>A
1|255|5

01:098348885
c.39+37555C>T|c.85C>T|c.85T=|c.85T>C
0|1|5

01:100672060
c.1150G=
2|3|5

01:100672060
c.1150A>G|c.1150G=|c.1150G>A
2|255|3|5

01:155204994
c.1158G>C|c.1236G>C|c.1350G>C|c.1497G>C
2|5

01:155205008
c.1144G>C|c.1222G>C|c.1336G>C|c.1483G>C
2|5

01:155206167
c.1093G>A|c.754G>A|c.832G>A|c.946G>A
5

01:156108401
c.1821G>A
1|5

01:156108404
c.1824C>T
5

01:156848918
c.*402C>T|c.1702C>T|c.1792C>T|c.1801C>T|c.1810C>T
2|5

01:156848946
c.*430G>T|c.1730G>T|c.1820G>T|c.1829G>T|c.1838G>T
2|5

01:216498841
c.949C>A
5

01:227170648
c.102+184C>T|c.156C>T|c.837C>T|c.993C>T, EX8|c.993C>T
2|5

01:241665852
c.1127A>C
255|3|5

01:241675301
c.521C>G
255|3|5

02:026461825
c.157C>A|c.30C>A
5

02:145156798
c.1956C>T
5

02:215813331
c.6139G>A|c.7093G>A
5

02:219674479
c.153G>T|c.256−2466G>T|c.435G>T
5

02:219678877
c.1151C>T
5

02:219754966
c.264−2530G>A|c.637G>A
5

02:219755011
c.264−2485T>A|c.682T>A
5

02:241808314
c.32C>T
2|255|5

02:241817516
c.1020A>G
2|255|5

03:014200382
c.*454C>A|c.1001C>A|c.890C>A
1|5

03:015677019
c.133G>A|c.139G>A|c.73G>A
5

03:015677098
c.152T>C|c.212T>C|c.218T>C
5

03:015686243
c.820A>G|c.880A>G|c.886A>G
5

03:015686331
c.908A>G|c.968A>G|c.974A>G
0|5

03:015686534
c.1111C>T|c.1171C>T|c.1177C>T
2|5

03:015686693
c.1270G>C|c.1330G>C|c.1336G>C
5

03:128598490
c.−44_−41dupTAAG|c.−45delCinsCTAAG|c.insTAAG
5

03:142275281
c.2022A>G|c.2101A-G
5

03:150645894
c.104+13475T>G|c.162+13475T>G|c.300T>G|c.528T>G|c.567T>G
5

03:165491280
c.*205G>A|c.*89G>A|c.1699G>A|c.289G>A|c.85G>A
2|255|5

03:165547569
c.−98+7533G>T|c.107+7533G>T|c.1253G>T
5

03:165548529
c.−98+6573A>G|c.107+6573A>G|c.293A>G
5

03:183966564
c.154+6C>T|c.158C>T|c.159+6C>T|c.160del37|c.165C>T|c.2106+104069G>A|c.52+450C>T|c.76+214C>T|c.97C>T
5

04:005755524
c.1328G>A
5

04:187004074
c.1042C>T|c.1234C>T|c.403C>T
5

05:073981270

2|5

05:073981270
c.−376−3883T>C|c.185C=|c.185C>T|c.185T>C
2|5

05:118792052
c.−311C>T|c.−37C>T|c.101C>T|c.111C>T|c.58+3724C>T
5

05:131729380
c.*195G>A|c.*315G>A|c.1463G>A|c.1535G>A
5

06:007542236
c.88G>A
255|3|4|5

06:032006858
c.203−13C>G|c.209C>G|c.293−13C>G|c.299C>G|c.313C>G
5

06:161159625
c.1858G>A
5

07:065432754
c.*884C>T|c.*997C>T|c.1179C>T|c.1464C>T|c.1617C>T|c.1642del38
5

07:087060844
c.1769G>A
5

07:087082273
c.523A>G
5

07:117149147
c.−20G>A|c.224G>A
0|2|3|5

07:117175372
c.560A>G|c.650A>G
5

07:117188682
c.1120−13_1120−11delGTTinsG|c.1209+6520_1209+6522delGTTinsG|c.1210−12[5]|c.1210−12_1210−6T[5]|c.1210−13_1210−11delGTTinsG|c.1210−7_1210−6delTT|c.5T
0|255|5

07:117227874
c.1483A>G|c.1576A>G|c.1666A>G
4|5

07:117243663
c.2552C>T|c.2645C>T|c.2735C>T
2|5

07:117251704
c.3026G>A|c.3119G>A|c.3209G>A|c.34G>A
1|5

07:117267556
c.294−20T>C|c.3286−20T>C|c.3379−20T>C|c.3469−20T>C|c.3601, T-C, −20
5

07:117304834
c.182G>C|c.3873G>C|c.3966G>C|c.4056G>C
5

08:022021460
c.*127G>A|c.*383G>A|c.277−319G>A|c.323G>A|c.341G>A|c.482G>A|c.500G>A
5

08:086392989
c.*341A>G|c.754A>G
5

08:090983460
c.*516C>T|c.379C>T|c.397C>T|c.643C>T
0|1|255|5

08:100832259
c.*4761A>G|c.8903A>G|c.8978A>G
2|5

09:034647661
c.*145T>C|c.*80T>C|c.252+406T>C|c.253−168T>C|c.287T>C|c.336T>C|c.51−168T>C
5

09:034648848
c.777G>A
5

09:034649442
c.*560A>G|c.432+989A>G|c.613A>G|c.940A>G
1|2|255|3|5

09:036217445
c.1756G>A|c.1864G>A|c.2071G>A|c.2086G>A|c.2179G>A|c.485+13269C>T
5

09:136301982
c.*146C>G|c.*210C>G|c.*626C>G|c.1249C>G|c.1342C>G|c.358C>G
5

09:136302063
c.*227C>T|c.*291C>T|c.*707C>T|c.1330C>T|c.1423C>T|c.439C>T
5

10:050678722
c.1394C>G|c.3284C>G
5

10:072358722
c.755A>G
5

10:072360648
c.11G>A
5

10:102749444
c.1287C>T
5

11:005247873
c.249G>Y
255|5

11:005248029
c.93G>T
2|5

11:005248177
c.75T>A
5

11:017531093
c.1192−7566C>G|c.1211−7566C>G|c.1228−7566C>G|c.1285−7566C>G|c.1823C>G
3|5

11:017548622
EXP
2|5

11:017548878
c.295G>A|c.388G>A|c.421G>A
2|5

11:017552978
c.123G>A|c.216G−A|c.216G>A|c.249G>A
5

11:036615075
c.644C>T
5

11:064525266
c.381G>A|c.645G>A
2|5

11:066287196
c.*158G>A|c.*360G>A|c.*403G>A|c.*407G>A|c.*489G>A|c.364G>A|c.427G>A|c.433−1545G>A|c.700G>A|c.811G>A
255|3|5

11:068562328
c.79G>A|c.823G>A
2|5

11:071906793
c.493+2T>C
5

11:076869378
c.872G>A|c.905G>A
2|5

11:108143299
c.3118A>G
2|3|5

11:128709126
c.1013T>C|c.1070T>C
5

12:006458350
c.*548T>C|c.1439+143T>C|c.1477T>C|c.1543T>C|c.1546T>C|c.1654T>C|c.577T>C
5

12:102164255
c.1042A>C
5

12:121175678
c.473−174C>T|c.511C>T
255|4|5

12:121176083
c.613G>A|c.625G>A
2|255|4|5

12:122277904
c.1005C>G|c.888C>G
5

12:122295335
c.−21A>G|c.97A>G|c.97G>A
5

13:020763234
c.487A>G
3|5

13:020763553
c.167delT
5

13:020763612

5

13:020763620
c.101T>C
0|255|5

13:020763685
c.35delG
5

13:020763710
c.11G>A
2|5

13:052513266
c.1253A>G|c.2330A>G|c.2999A>G|c.3287A>G|c.3386A>G|c.3425A>G|c.3620A>G
2|255|3|5

13:052532497
c.121A>G|c.1286−8200A>G|c.1870−754A>G|c.1972A>G|c.2122−754A>G|c.2305A>G
2|5

13:108863591
c.26C>T
2|5

13:108863609
c.8C>T
2|5

14:024724663
c.1552G>A|c.226G>A|c.625G>A
5

14:024731434
c.−130C>A|c.125C>A
5

14:088452941
c.*205A>G|c.*83A>G|c.166A>G|c.256A>G|c.265A>G|c.334A>G
5

15:072638893
c.*110C>T|c.*559−227C>T|c.*715C>T|c.*826C>T|c.1305C>T|c.1338C>T|c.498−227C>T|c.786C>T|c.946−227C>T
5

15:072641434
c.972T>A
5

15:080472526
c.1021C>T|c.119C>T|c.811C>T
5

15:089865073
c.185+900A>G|c.196−582A>G|c.2492A>G
2|5

16:000222980
c.69C>T
5

16:003293257
c.*434G>T|c.*506G>T|c.*614G>T|c.*714G>T|c.*747G>T|c.*855G>T|c.*863G>T|c.1597G>T|c.1690G>T|c.2230G>T
0|255|5

16:003293310
c.*381T>C|c.*453T>C|c.*561T>C|c.*661T>C|c.*694T>C|c.*802T>C|c.*810T>C|c.1544T>C|c.1637T>C|c.2177T>C
5

16:003293403
c.*288A>G|c.*360A>G|c.*468A>G|c.*568A>G|c.*601A>G|c.*709A>G|c.*717A>G|c.1451A>G|c.1544A>G|c.2084A>G
5

16:003304626
c.277+1685G>C|c.442G>C
0|5

16:023200921
c.547G>A
5

16:023200963
c.589G>A
5

16:023360165
c.245C>G|c.380C>G
5

16:053720436
c.685G>A
2|255|3|5

16:088502971
c.9010_9024delCTTCCCGGGAACACC|c.9094_9108delCTTCCCGGGAACACC
5

17:003550800
c.−113+7239G>A|c.−216−7492G>A|c.124G>A|c.5+7239G>A
5

17:007123838
c.*149C>T|c.139−85C>T|c.194C>T|c.263C>T
2|5

17:015134364
c.*62C>T|c.174C>T|c.353C>T
5

17:041063017
c.*40G>T|c.648G>T
5

17:078078341
c.−32−13T>G
5

19:007125518
c.2998G>A|c.3034G>A
4|5

19:012917556
c.69G>A
5

19:012917562
c.75C>T
5

19:018710530
c.242G>A
2|5

19:036339044
c.1338_1339delTGinsTA|c.1339G>A
3|5

19:045860626
c.1147C>G|c.1309C>G|c.1381C>G|c.504C>G
1|5

20:043255233
c.218+2455C>T|c.226C>T
1|5

20:043280227
c.22G>A
5

22:050962500
c.341G>A
5

22:051064039
c.1172C>G|c.1178C>G|c.46C>G|c.920C>G
2|5

22:051064416
c.1049A>G|c.1055A>G|c.797A>G
1|2|5

Maximum

population
Population with

Number of

Chr:bp
Variant type
PolyPhen-2
CADD (adj)
PROVEAN (adj)
VEST
frequency
highest frequency
AN
Carriers
homozygotes
Variant source

01:043212925
Synonymous
.
7.309
.
.
1.50511739916e−05
NFE
121028
1
0
ExAC and ClinVar

01:046655645
Missense
0.997
23.1
−3.12
0.148
0.0126391218174
NFE
118596
1064
9
ExAC and ClinVar

01:063872032
Missense
0.014
20.3
−3.57
0.315
0.0397375848196
NFE
121148
3318
79
ExAC and ClinVar

01:076198337
Missense
0.422
26.9
−2.69
0.744
0.00336761080041
NFE
120842
270
1
ExAC and ClinVar

01:097915614
Essential splice site (intron)
.
22.6
.
.
0.00583313339731
NFE
12125
624
5
ExAC and ClinVar

01:098348885
Missense
.
23.8
.
0.201
0.416378316032
AFR
121114
20943
3749
ExAC and ClinVar

01:100672060
Missense
.
22.8
−2.05
0.463
0
.
.
.
.
ClinVar

01:100672060
Missense
.
7.686
.
0.158
0.235825485297
AFR
121412
9188
641
ExAC and ClinVar

01:155204994
Synonymous
.
14.73

0.00103998151144
EAS
121388
35
0
ExAC and ClinVar

01:155205008
Missense
0.037
13.26
−1.51
0.831
0.00038454143434
AFR
121380
14
0
ExAC, ClinVar, and VariBench

01:155206167
Missense
0.015
21.7
−1.2
0.402
0.011964017991
NFE
121320
1164
12
ExAC, ClinVar, and VariBench

01:156108401
Synonymous
.
15.94
.
.
0
.
.
.
.
ClinVar

01:156108404
Synonymous
.
18.65
.
.
0
.
.
.
.
ClinVar

01:156848918
Missense
0.986
28.2
−5.34
0.727
0.0550196552767
NFE
120020
4832
136
ExAC and ClinVar

01:156848946
Missense
0.379
22.3
−2.72
0.463
0.0549432298587
NFE
120296
4825
138
ExAC and ClinVar

01:216498841
Synonymous
.
10.51
.
.
2.99922020275e−05
NFE
121146
2
0
ExAC and ClinVar

01:227170648
Synonymous
.
17.34
.
.
0.0286132439656
NFE
85834
1729
33
ExAC and ClinVar

01:241665852
Missense
1.0
33.0
−5.97
0.95
0.000149902563334
NFE
121332
11
0
ExAC and ClinVar

01:241675301
Missense
0.999
27.9
−7.67
0.919
2.99679343103e−05
NFE
121408
2
0
ExAC and ClinVar

02:026461825
Synonymous
.
22.8
.
.
0.000164897763387
NFE
121300
11
0
ExAC and ClinVar

02:145156798
Synonymous
.
0.067
.
.
0
.
.
.
.
ClinVar

02:215813331
Missense
0.908
21.1
−1.2
0.016
0.0321069073238
AMR
121124
3203
81
ExAC, ClinVar, and VariBench

02:219674479
Synonymous
.

6.272
0.000693962526024
EAS
120044
6
0
ExAC and ClinVar

02:219678877
Missense
1.0
20.7
−8.96
0.327
0.0304626937984
SAS
121374
2205
37
ExAC and ClinVar

02:219754966
Missense
1.0
33.0
−5.2
0.982
0.0237364194615
EAS
118674
200
4
ExAC and ClinVar

02:219755011
Missense
0.999
31.0
−4.95
0.843
0.02004450901
NFE
117764
1471
14
ExAC and ClinVar

02:241808314
Missense
1.0
12.92
−8.95
0.523
0.212677536463
NFE
114234
14238
1706
ExAC and ClinVar

02:241817516
Missense
0.0
0.003
3.24
0.095
0.210442461763
NFE
120178
16067
1897
ExAC, ClinVar, and VariBench

03:014200382
Missense
0.855
9.104
−1.23
0.075
0.0250665029671
AFR
120636
341
2
ExAC and ClinVar

03:015677019
Missense
0.001
10.21
−0.39
0.078
0.0133503146539
NFE
121412
1186
11
ExAC, ClinVar, and VariBench

03:015677098
Missense
0.0
6.242
3.27
0.634
0.0116279069767
SAS
121412
177
8
ExAC and ClinVar

03:015686243
Missense
0.0
0.001
0.64
0.018
0.042675893887
AFR
121398
457
12
ExAC, ClinVar, and VariBench

03:015686331
Missense
1.0
25.3
−5.52
0.877
0.0164728682171
SAS
121162
275
3
ExAC and ClinVar

03:015686534
Missense
0.002
12.3
−0.63
0.077
0.0210680891872
NFE
121406
1650
24
ExAC, ClinVar, and VariBench

03:015686693
Missense
1.0
21.0
−5.48
0.817
0.0393802259718
NFE
121398
3678
83
ExAC and ClinVar

03:128598490
5′ untranslated
.
9.008
.
.
0.12606292517
AFR
117854
4457
220
ExAC and ClinVar

03:142275281
Synonymous
.
13.68
.
.
0
.
.
.

ClinVar

03:150645894
Nonsense
.
13.83
.
0.9999
0.000487466331775
GLOBAL
21034
55
2
ExAC and ClinVar

03:165491280
Missense
0.001
22.9
−0.52
0.204
0.21094359241
NFE
113876
17478
2029
ExAC, ClinVar, and VariBench

03:165547569
Missense
0.088
21.9
−1.96
0.295
0.00467878351629
NFE
121182
362
2
ExAC, ClinVar, and VariBench

03:165548529
Missense
0.687
24.7
−5.56
0.913
0.017657825252
NFE
121188
1433
15
ExAC, ClinVar, and VariBench

03:183966564
Synonymous
.
22.4
.
.
3.34515287349e−05
NFE
107798
2
0
ExAC and ClinVar

04:005755524
Missense
0.265
17.45
−0.53
0.191
0.225009611688
AFR
121404
1987
283
ExAC, ClinVar, and VariBench

04:187004074
Missense
1.0
23.9
−3.54
0.256
0.33735219105
EAS
120980
23184
4813
ExAC and ClinVar

05:073981270
Missense
.
9.454
−0.82
0.386
0
.
.
.
.
ClinVar

05:073981270
Missense
.
9.134
.
0.191
0.0379595528849
NFE
113532
3207
53
ExAC and ClinVar

05:118792052
Synonymous
.
23.0
.
.
0.000115580212668
EAS
121398
1
0
ExAC and ClinVar

05:131729380
Missense
0.972
34.0
−3.27
0.876
0.0057004664018
AMR
121410
388
4
ExAC and ClinVar

06:007542236
Missense
0.0
8.982
0.4
0.683
0.00570114022805
SAS
47946
147
1
ExAC and ClinVar

06:032006858
Intron
.
.
.
.
0.00278810408922
NFE
87414
202
2
ExAC and ClinVar

06:161159625
Missense
1.0
23.8
−2.77
0.802
0.0181418996996
EAS
121404
155
3
ExAC, ClinVar, and VariBench

07:065432754
Synonymous
.
11.35
.
.
5.99430540986e−05
NFE
121400
4
0
ExAC and ClinVar

07:087060844
Missense
0.999
36.0
−3.62
0.732
0.00574023560445
NFE
121370
503
2
ExAC, ClinVar, and VariBench

07:087082273
Missense
0.841
25.0
−3.82
0.702
0.0132630813953
SAS
121306
1266
11
ExAC and ClinVar

07:117149147
Missense
1.0
27.2
−2.23
0.914
0.024814081804
NFE
121258
1807
28
ExAC, ClinVar, and VariBench

07:117175372
Missense
0.053
24.5
−2.03
0.422
0.0102866389274
EAS
121392
454
11
ExAC, ClinVar, and VariBench

07:117188682
Intron
.
13.2
.
.
0.0651453340133
AFR
99808
2746
26
ExAC and ClinVar

07:117227874
Missense
0.334
22.0
−0.07
0.582
0.0400603808639
EAS
119930
325
12
ExAC and ClinVar

07:117243663
Missense
0.055
9.102
−0.1
0.717
0.00131930077059
NFE
121272
108
0
ExAC and ClinVar

07:117251704
Missense
1.0
22.0
−0.14
0.936
0.00539083557951
SAS
120098
93
2
ExAC, ClinVar, and VariBench

07:117267556
Intron
.
.
.
.
0.0132863675689
SAS
118048
208
7
ExAC and ClinVar

07:117304834
Missense
1.0
26.0
−4.19
0.969
0.0134259259259
EAS
121348
109
4
ExAC and ClinVar

08:022021460
Missense
0.424
5.102
0.16
0.046
0.0161023947151
AFR
119764
160
3
ExAC and ClinVar

08:086392989
Missense
0.0
13.24
−1.41
0.039
0.0913110342176
AFR
121250
916
47
ExAC, ClinVar, and VariBench

08:090983460
Missense
1.0
26.2
−4.1
0.743
0.00470841155453
NFE
120222
349
3
ExAC and ClinVar

08:100832259
Missense
0.998
23.4
−4.61
0.488
0.0049163618922
NFE
121316
394
0
ExAC, ClinVar, and VariBench

09:034647661
Synonymous
.
20.7
.
.
1.49907057624e−05
NFE
121364
1
0
ExAC and ClinVar

09:034648848
Synonymous
.
17.3
.
.
0
.
.
.
.
ClinVar

09:034649442
Missense
0.0
16.87
0.59
0.381
0.183244487521
SAS
121390
9785
692
ExAC and ClinVar

09:036217445
Missense
0.993
22.3
−0.52
0.952
0.014091350826
SAS
121022
229
3
ExAC, ClinVar, and VariBench

09:136301982
Missense
0.0
5.693
1.81
0.068
0.510825982358
AMR
60658
16727
5376
ExAC and ClinVar

09:136302063
Missense
0.024
7.79
−0.58
0.097
0.0506594724221
AMR
52912
380
4
ExAC and ClinVar

10:050678722
Missense
0.0
7.757
−0.45
0.101
0.041915016343
AFR
121382
439
10
ExAC, ClinVar, and VariBench

10:072358722
Missense
0.0
1.583
−0.59
0.007
0.0101883890811
AFR
121358
622
3
ExAC, ClinVar, and VariBench

10:072360648
Missense
0.0
9.447
1.06
0.052
0.117486338798
AFR
48114
511
23
ExAC, ClinVar, and VariBench

10:102749444
Synonymous
.
19.24
.
.
0
.
.
.
.
ClinVar

11:005247873
Synonymous
.
19.62
.
.
0
.
.
.
.
ClinVar

11:005248029
Essential splice site (exon)
.
18.98
.
.
0.000280426247897
GLOBAL
121244
34
0
ExAC and ClinVar

11:005248177
Synonymous
.
21.9
.
.
9.60984047665e−05
AFR
121356
2
0
ExAC and ClinVar

11:017531093
Missense
0.984
24.3
−1.82
0.727
0.000845686996319
NFE
108256
57
0
ExAC and ClinVar

11:017548622
Intron
.
.
.
.
0
.
.
.
.
ClinVa r

11:017548878
Essential splice site (exon)
.
19.43
.
.
0.0462371832076
AFR
120926
473
10
ExAC and ClinVar

11:017552978
Synonymous
.
19.46
.
.
7.59186152445e−05
NFE
119738
5
0
ExAC and ClinVar

11:036615075
Missense
0.51
12.39
−2.65
0.478
0.0249515503876
SAS
121360
422
6
ExAC, ClinVar, and VariBench

11:064525266
Synonymous
.
18.78
.
.
0.00675324675325
AMR
120960
528
3
ExAC and ClinVar

11:066287196
Missense
0.99
22.3
−2.29
0.432
0.101719474498
AFR
120856
1017
55
ExAC, ClinVar, and VariBench

11:068562328
Missense
0.0
13.13
−0.07
0.085
0.0896940273907
NFE
121408
7079
388
ExAC and ClinVar

11:071906793
Essential splice site (intron)
.
12.06
.
.
0.0136870155039
SAS
121396
400
8
ExAC and ClinVar

11:076869378
Missense
0.998
30.0
−4.28
0.697
0.00588035559946
NFE
116632
409
2
ExAC, ClinVar, and VariBench

11:108143299
Missense
0.0
0.141
−0.14
0.12
0.0413846451363
AFR
121182
428
10
ExAC, ClinVar, and VariBench

11:128709126
Missense
0.004
12.33
−0.1
0.213
0.011991845545
NFE
121350
950
12
ExAC, ClinVar, and VariBench

12:006458350
Missense
1.0
22.1
−12.93
0.939
0.0250121124031
SAS
121400
2158
28
ExAC and ClinVar

12:102164255
Missense
0.904
23.3
−1.85
0.307
0.0415224913495
AFR
121328
447
4
ExAC and ClinVar

12:121175678
Missense
0.994
13.76
−4.32
0.194
0.0445456464698
NFE
120756
3588
97
ExAC and ClinVar

12:121176083
Essential splice site (exon)
.
19.66
.
.
0.378772112383
AMR
121136
22322
4537
ExAC and ClinVar

12:122277904
Missense
0.978
18.43
−2.78
0.888
0.00763081395349
SAS
121382
252
2
ExAC and ClinVar

12:122295335
Missense
.
13.89
.
0.345
0.280331835465
AMR
121370
14873
1653
ExAC and ClinVar

13:020763234
Missense
0.91
17.93
−2.16
0.95
0.000606869765748
SAS
121056
20
0
ExAC and ClinVar

13:020763553
Frameshift
.
13.32
.
0.9985
0.00113977204559
NFE
121310
77
3
ExAC and ClinVar

13:020763612
Missense
1.0
11.2
−0.82
0.703
0.0724201758445
EAS
121300
721
39
ExAC, ClinVar, and VariBench

13:020763620
Missense
0.038
5.565
−3.8
0.396
0.0122214557778
NFE
121354
1006
13
ExAC and ClinVar

13:020763685
Frameshift
.
12.64
.
0.9892
0.00877245598776
NFE
121352
727
3
ExAC and ClinVar

13:020763710
Missense
0.088
1.016
−2.15
0.091
0.00404530744337
EAS
121210
49
1
ExAC and ClinVar

13:052513266
Missense
0.044
14.93
−4.29
0.112
0.160760587727
AMR
120728
2727
187
ExAC and ClinVar

13:052532497
Missense
1.0
25.9
−3.23
0.894
0.000104972707096
NFE
120692
8
0
ExAC, ClinVar, and VariBench

13:108863591
Missense
0.52
20.2
−2.7
0.139
0.233560685554
AMR
116530
16240
1962
ExAC and ClinVar

13:108863609
Missense
0.376
23.2
−0.54
0.237
0.12023977433
EAS
115050
6277
234
ExAC and ClinVar

14:024724663
Missense
0.985
29.5
−1.42
0.407
0.0156704992345
NFE
121200
1249
6
ExAC, ClinVar, and VariBench

14:024731434
Missense
0.978
22.7
−0.33
0.823
0.00581448169347
NFE
120520
480
2
ExAC, ClinVar, and VariBench

14:088452941
Missense
0.991
23.4
−2.67
0.9
0.00363043411377
NFE
117482
302
2
ExAC and ClinVar

15:072638893
Synonymous
.
19.85
.
.
0.00043192812716
AMR
121304
6
0
ExAC and ClinVar

15:072641434
Synonymous
.
19.55
.
.
0
.
.
.
.
ClinVar

15:080472526
Missense
1.0
28.4
−5.4
0.758
0.0229433272395
NFE
119604
1973
19
ExAC, ClinVar, and VariBench

15:089865073
Missense
0.995
17.7
−2.82
0.764
0.00891652929717
NFE
121386
756
3
ExAC, ClinVar, and VariBench

16:000222980
Synonymous
.
2.648
.
.
0
.
.
.
.
ClinVar

16:003293257
Missense
0.0
0.003
1.05
0.492
0.00214302841386
NFE
121396
176
2
ExAC, ClinVar, and VariBench

16:003293310
Missense
0.001
0.001
1.93
0.514
0.00325161831695
NFE
121408
212
6
ExAC, ClinVar, and VariBench

16:003293403
Missense
0.939
1.747
−0.95
0.218
0.00791129757267
NFE
121410
660
4
ExAC, ClinVar, and VariBench

16:003304626
Missense
0.995
23.3
−1.3
0.386
0.315009692606
EAS
92068
6211
1038
ExAC, ClinVar, and VariBench

16:023200921
Missense
0.001
0.114
0.92
0.221
0.0373822794542
AFR
121392
561
8
ExAC and ClinVar

16:023200963
Missense
0.004
14.77
−0.82
0.741
0.0074396280186
NFE
121330
566
4
ExAC and ClinVar

16:023360165
Missense
0.997
16.78
−4.04
0.805
0.00689862027594
NFE
121310
575
3
ExAC and ClinVar

16:053720436
Missense
0.005
17.44
−0.19
0.081
0.103657362849
AFR
121106
4265
138
ExAC and ClinVar

16:088502971
Inframe indel
.
17.04
−5.38
.
0.0059508736389
SAS
17306
76
2
ExAC and ClinVar

17:003550800
Missense
0.007
12.64
−0.02
0.351
0.0596914175506
AFR
121160
634
13
ExAC and ClinVar

17:007123838
Missense
0.019
8.442
−1.95
0.113
0.113527034828
AFR
121338
1136
68
ExAC, ClinVar, and VariBench

17:015134364
Missense
1.0
29.0
−5.48
0.926
0.00735939913383
NFE
120452
560
2
ExAC, ClinVar, and VariBench

17:041063017
Synonymous
.
7.36
.
.
0.00127108851398
EAS
121406
11
0
ExAC and ClinVar

17:078078341
Intron
.
.
.
.
0.00531336583313
NFE
101332
359
2
ExAC and ClinVar

19:007125518
Missense
0.992
34.0
−2.42
0.662
0.0225317989098
SAS
121148
1068
11
ExAC and ClinVar

19:012917556
Synonymous
.
19.19
.
.
0.0
NFE
28346
0
0
ExAC and ClinVar

19:012917562
Synonymous
.
21.2
.
.
0
.
.
.
.
ClinVar

19:018710530
Missense
0.004
16.56
−1.9
0.436
0.000532339632686
AFR
91230
5
0
ExAC, ClinVar, and VariBench

19:036339044
Missense
0.996
30.0
−1.73
0.738
0.039027810236
EAS
119744
327
10
ExAC, ClinVar, and VariBench

19:045860626
Missense
0.982
25.3
−2.43
0.934
0.00710382513661
SAS
120472
158
2
ExAC, ClinVar, and VariBench

20:043255233
Missense
1.0
22.3
−5.52
0.941
0.00259615384615
AFR
121358
38
2
ExAC, ClinVar, and VariBench

20:043280227
Missense
0.001
23.1
0.28
0.038
0.138583482485
SAS
13566
1628
80
ExAC and ClinVar

22:050962500
Missense
1.0
16.29
−3.47
0.847
0.00234252993233
AMR
119592
89
2
ExAC and ClinVar

22:051064039
Missense
0.0
8.571
0.13
0.106
0.539283024552
NFE
120940
29102
14735
ExAC and ClinVar

22:051064416
Missense
0.134
17.91
−1.17
0.126
0.394781144781
AMR
40070
7001
839
ExAC and ClinVar

TABLE 5

Summary VGD scores stratified by ClinVar and VariBench category

VGD
VGD
Median # of
Median max

# of
Median
mean
homozygotes
population

Variants
(IQR)
(sd)
(IQR)
frequency (IQR)

ClinVar 2
2087
0
0.0496
8 (1, 211)
0.0211
(0.00377,

(0, 0.01)
(0.158)

0.118)

ClinVar 3
1597
0
0.0624
5 (0, 166)
0.0148
(0.000606,

(0, 0.04)
(0.142)

0.102)

ClinVar 4
1280
0.99
0.921
0 (0, 0)
0
(0, 1.5e−05)

(0.94, 1)
(0.172)

ClinVar 5
6793
1
0.975
0 (0, 0)
0
(0, 4.51e−05)

(0.99, 1)
(0.129)

ClinVar 5.1
6653
1
0.991
0 (0, 0)
0
(0, 3.02e−05)

(1, 1)
(0.0575)

ClinVar 5.2
140
0.02
0.249
6 (2, 38)
0.0118
(0.00111,

(0, 0.508)
(0.33)

0.0416)

Other or unkown
6406
0.755
0.631
0 (0, 0)
0
(0, 0.000108)

variants

(0.3, 0.97)
(0.364)

All ClinVar variants
17748
0.97
0.661
0 (0, 3)
0
(0, 0.000405)

(0.14, 1)
(0.42)

VariBench Benign
753
0
0.193
8 (0, 135)
0.0159
(0.000207,

(0, 0.29)
(0.312)

0.0865)

VariBench
5521
0.97
0.869
0 (0, 0)
0
(0, 2.2e−05)

Pathogenic

(0.83, 0.99)
(0.216)

VariBench
5431
0.98
0.883
0 (0, 0)
0
(0, 1.53e−05)

Pathogenic group 1

(0.85, 0.99)
(0.189)

VariBench
90
0
0.0649
9.5 (3, 28)
0.0186
(0.00906,

Pathogenic group 2

(0, 0.02)
(0.19)

0.0468)

All VariBench
6274
0.96
0.788
0 (0, 0)
0
(0, 8.7e−05)

variants

(0.75, 0.99)
(0.318)

All ExAC variants
242782
0.32
0.403
0 (0, 0)
8.66e−05
(1.65e−05,

(0.03, 0.77)
(0.372)

0.000176)

CADD

median
PROVEAN
VEST median

PP2 median (IQR)
(IQR)
median (IQR)
(IQR)

ClinVar 2
0.141
12.6
−0.99
0.144

(0.002, 0.972)
(6.68, 18.6)
(−2.35, −0.26)
(0.063, 0.35)

ClinVar 3
0.073
12.9
−1.08
0.144

(0.003, 0.937)
(7.34, 18.2)
(−2.2, −0.345)
(0.0632, 0.326)

ClinVar 4
1
23.8
−4.72
0.925

(0.982, 1)
(19.6, 33)
(−6.8, −2.97)
(0.778, 0.974)

ClinVar 5
1
22.8
−4.76
0.941

(0.985, 1)
(18.6, 29.8)
(−6.67, −3.2)
(0.847, 0.981)

ClinVar 5.1
1
22.8
−4.82
0.945

(0.988, 1)
(18.8, 30)
(−6.72, −3.28)
(0.855, 0.982)

ClinVar 5.2
0.841
18.7
−1.96
0.463

(0.005, 0.998)
(11.9, 23)
(−3.66, −0.503)
(0.191, 0.814)

Other or unkown
0.979
21.1
−2.75
0.738

variants
(0.151, 1)
(14, 25)
(−4.73, −1.22)
(0.364, 0.923)

All ClinVar variants
0.995
21.2
−3.33
0.834

(0.329, 1)
(13.8, 25.6)
(−5.49, −1.44)
(0.381, 0.956)

VariBench Benign
0.243
17.2
−1.31
0.183

(0.003, 0.985)
(9.73, 23)
(−2.8, −0.39)
(0.07, 0.461)

VariBench
1
22.8
−4.71
0.935

Pathogenic
(0.975, 1)
(18.4, 27.3)
(−6.51, −2.97)
(0.828, 0.977)

VariBench
1
22.8
−4.75
0.937

Pathogenic group 1
(0.978, 1)
(18.4, 27.4)
(−6.53, −3.03)
(0.836, 0.977)

VariBench
0.601
21.4
−1.58
0.402

Pathogenic group 2
(0.00525, 0.992)
(11, 24.4)
(−2.88, −0.5)
(0.149, 0.728)

All VariBench
0.999
22.6
−4.34
0.918

variants
(0.892, 1)
(17.5, 26.8)
(−6.25, −2.51)
(0.739, 0.973)

All ExAC variants
0.771
16.2
−1.91
0.408

(0.023, 0.998)
(9.85, 22.7)
(−3.59, −0.76)
(0.166, 0.737)

TABLE 6

Summary VGD scores stratified by mutation type

VGD
Median # of
Median max

# of
VGD Median
mean
homozygotes
population
PP2 median

Variants
(IQR)
(sd)
(IQR)
frequency (IQR)
(IQR)

Unknown variants
331
0 (0, 0)
0.0181
0 (0, 0)
0.000129
NA (NA, NA)

(0.113)

(3.29e−05, 0.00043)

Non-coding change
360
0.02 (0, 0.07)
0.116
0 (0, 0)
0.000135
NA (NA, NA)

variants

(0.243)

(6.15e−05, 0.000443)

Coding unknown
3
0.02 (0.01, 0.07)
0.0467
0 (0, 48.5)
0.000123
NA (NA, NA)

variants

(0.0643)

(7.67e−05, 0.039)

Intergenic
66
0 (0, 0)
0.0395
0 (0, 0)
0.000207
NA (NA, NA)

(0.145)

(0.000133, 0.000638)

Intronic variants
7586
0 (0, 0.01)
0.0454
0 (0, 0)
9.02e−05
NA (NA, NA)

(0.141)

(2.09e−05, 0.000209)

Synonymous
63473
0.02 (0, 0.15)
0.0981
0 (0, 0)
9.61e−05
NA (NA, NA)

variants

(0.146)

(3e−05, 0.000219)

Non-essential splice
13915
0.02 (0, 0.11)
0.116
0 (0, 0)
8.7e−05
NA (NA, NA)

site varaints

(0.214)

(1.69e−05, 0.000184)

3′ variants
2984
0 (0, 0)
0.0109
0 (0, 0)
0.000133
NA (NA, NA)

(0.0646)

(6.06e−05, 0.000459)

5′ variants
1657
0 (0, 0)
0.0214
0 (0, 0)
9.94e−05
NA (NA, NA)

(0.107)

(1.98e−05, 0.000258)

Stoploss variants
105
0.36 (0.05, 0.83)
0.422
0 (0, 0)
6.07e−05
NA (NA, NA)

(0.364)

(1.67e−05, 0.000107)

Missense variants
132248
0.57 (0.23, 0.83)
0.534
0 (0, 0)
8.64e−05
0.771

(0.323)

(1.6e−05, 0.000172)
(0.023, 0.998)

Inframe indel
2353
0.82 (0.53, 0.97)
0.704
0 (0, 0)
7.57e−05
NA (NA, NA)

(0.311)

(1.54e−05, 0.000125)

Startloss variants
277
0.98 (0.98, 0.99)
0.943
0 (0, 0)
7.69e−05
NA (NA, NA)

(0.167)

(1.9e−05, 0.000136)

Essential splice site
4704
0.99 (0.99, 1)
0.954
0 (0, 0)
8.64e−05
NA (NA, NA)

variants (exon)

(0.167)

(1.59e−05, 0.000131)

Essential splice site
3276
1 (0.99, 1)
0.958
0 (0, 0)
6.06e−05
NA (NA, NA)

variants (intron)

(0.173)

(1.51e−05, 0.000116)

Nonsense variants
3812
1 (0.99, 1)
0.99
0 (0, 0)
6.06e−05
NA (NA, NA)

(0.064)

(1.51e−05, 0.000102)

Frameshift
5632
1 (0.99, 1)
0.986
0 (0, 0)
6.04e−05
NA (NA, NA)

(0.0825)

(1.5e−05, 9.93e−05)

All ExAC variants
242782
0.32 (0.03, 0.77)
0.403
0 (0, 0)
8.66e−05
0.771

(0.372)

(1.65e−05, 0.000176)
(0.023, 0.998)

CADD median
PROVEAN
VEST median

(IQR)
median (IQR)
(IQR)

Unknown variants
8.04
(4.17, 14.1)
NA
(NA, NA)
NA
(NA, NA)

Non-coding change
10.1
(5.25, 12.8)
NA
(NA, NA)
NA
(NA, NA)

variants

Coding unknown
10
(8.7, 11.8)
NA
(NA, NA)
NA
(NA, NA)

variants

Intergenic
4.01
(2.02, 6.76)
NA
(NA, NA)
NA
(NA, NA)

Intronic variants
6.72
(3.01, 11.5)
NA
(NA, NA)
NA
(NA, NA)

Synonymous
11.6
(6.54, 16.2)
NA
(NA, NA)
NA
(NA, NA)

variants

Non-essential splice
9.95
(5.67, 13.6)
NA
(NA, NA)
NA
(NA, NA)

site varaints

3′ variants
5.61
(2.74, 8.84)
NA
(NA, NA)
NA
(NA, NA)

5′ variants
11
(8.43, 15.3)
NA
(NA, NA)
NA
(NA, NA)

Stoploss variants
14.5
(11.3, 19.5)
NA
(NA, NA)
1
(1, 1)

Missense variants
20.3
(13, 24)
−1.88
(−3.53, −0.75)
0.403
(0.164, 0.729)

Inframe indel
19.2
(14.8, 22.4)
−5.9
(−10, −2.27)
1
(0.999, 1)

Startloss variants
16.9
(12.5, 21.9)
NA
(NA, NA)
0.86
(0.645, 0.936)

Essential splice site
19.5
(13.7, 23.5)
NA
(NA, NA)
1
(0.998, 1)

variants (exon)

Essential splice site
22
(15.9, 23.8)
NA
(NA, NA)
NA
(NA, NA)

variants (intron)

Nonsense variants
32.5
(19.4, 39)
NA
(NA, NA)
1
(1, 1)

Frameshift
23
(17.6, 35)
−2.92
(−6.68, −0.408)
1
(0.999, 1)

All ExAC variants
16.2
(9.85, 22.7)
−1.91
(−3.59, −0.76)
0.408
(0.166, 0.737)

TABLE 7

Bootstrap 99% one-sided confidence interval thresholds for each gene

# of

# of

Gene
ClinVar 5
Average VGD

ClinVar 5.1

Name
variants
naive
99% cutoff
99.9% cutoff
variants

PEX10
10
0.98
0.967
0.963
10

NPHP4
8
NA
NA
NA
8

PLEKHG5
7
NA
NA
NA
7

PLOD1
5
NA
NA
NA
5

ALPL
22
0.934545454545455
0.893636363636364
0.878180909090909
22

HSPG2
1
NA
NA
NA
1

HMGCL
5
NA
NA
NA
5

FUCA1
8
NA
NA
NA
8

SEPN1
9
NA
NA
NA
9

NDUFS5
0
NA
NA
NA
0

PPT1
8
NA
NA
NA
8

ZMPSTE24
9
NA
NA
NA
9

CLDN19
3
NA
NA
NA
3

P3H1
9
NA
NA
NA
8

MPL
9
NA
NA
NA
9

ST3GAL3
1
NA
NA
NA
1

MMACHC
13
0.98
0.964615384615385
0.958461538461538
13

POMGNT1
20
0.936
0.7965
0.736
19

CPT2
18
0.919444444444444
0.863333333333333
0.842777777777778
18

DHCR24
7
NA
NA
NA
7

ALG6
7
NA
NA
NA
6

SLC35D1
3
NA
NA
NA
3

ACADM
19
0.878421052631579
0.766842105263158
0.718947368421053
18

DPYD
9
NA
NA
NA
7

AGL
16
0.938125
0.755
0.693125
16

DBT
24
0.905
0.782083333333333
0.72916625
22

TSHB
3
NA
NA
NA
3

HSD3B2
10
0.968
0.932
0.920999
10

HFE2
8
NA
NA
NA
8

CTSK
7
NA
NA
NA
7

HAX1
8
NA
NA
NA
8

GBA
68
0.821323529411765
0.759704411764706
0.734116470588235
65

PKLR
8
NA
NA
NA
8

LMNA
66
0.78969696969697
0.714392424242424
0.691969545454546
64

NTRK1
15
0.662
0.395333333333333
0.295997333333333
13

MPZ
50
0.8368
0.78
0.7601984
50

CD247
6
NA
NA
NA
6

FASLG
3
NA
NA
NA
3

NPHS2
11
0.874545454545455
0.632727272727273
0.538165454545455
11

LAMC2
4
NA
NA
NA
4

LAMB3
14
0.84
0.66
0.592142857142857
14

USH2A
87
0.963103448275862
0.92321724137931
0.905860459770115
86

RAB3GAP2
5
NA
NA
NA
5

LBR
7
NA
NA
NA
7

ADCK3
14
0.890714285714286
0.696428571428571
0.634278571428571
13

GJC2
8
NA
NA
NA
8

TBCE
1
NA
NA
NA
1

LYST
47
0.990212765957447
0.976170212765957
0.970425531914894
47

FH
26
0.943076923076923
0.869226923076923
0.846536923076923
24

HADHA
9
NA
NA
NA
8

HADHB
5
NA
NA
NA
5

MPV17
26
0.971153846153846
0.955
0.947692307692308
26

SRD5A2
14
0.853571428571429
0.758571428571429
0.722854285714286
14

LRPPRC
1
NA
NA
NA
1

LHCGR
24
0.849583333333333
0.8125
0.799582083333333
24

PEX13
2
NA
NA
NA
2

ALMS1
6
NA
NA
NA
6

DGUOK
4
NA
NA
NA
4

MOGS
3
NA
NA
NA
3

SUCLG1
3
NA
NA
NA
3

SFTPB
1
NA
NA
NA
1

ST3GAL5
2
NA
NA
NA
2

EIF2AK3
2
NA
NA
NA
2

NPHP1
4
NA
NA
NA
4

IL1RN
2
NA
NA
NA
2

ERCC3
4
NA
NA
NA
4

RAB3GAP1
8
NA
NA
NA
8

ZEB2
62
0.952258064516129
0.896612903225806
0.872901935483871
61

NEB
7
NA
NA
NA
7

ABCB11
7
NA
NA
NA
7

LRP2
19
0.948947368421053
0.851052631578947
0.818421052631579
19

ITGA6
0
NA
NA
NA
0

CHRNA1
14
0.86
0.79
0.769997857142857
14

AGPS
4
NA
NA
NA
4

HIBCH
1
NA
NA
NA
1

STAT1
24
0.791666666666667
0.684583333333333
0.650411666666667
24

CASP10
6
NA
NA
NA
6

ALS2
23
0.996086956521739
0.990869565217391
0.988695652173913
23

ICOS
0
NA
NA
NA
0

FASTKD2
1
NA
NA
NA
1

ACADL
0
NA
NA
NA
0

CPS1
5
NA
NA
NA
5

ABCA12
13
0.908461538461538
0.688453846153846
0.602307692307692
12

BCS1L
13
0.913076923076923
0.856923076923077
0.834613076923077
13

CYP27A1
55
0.922363636363636
0.846905454545455
0.818545272727273
53

WNT10A
8
NA
NA
NA
6

NHEJ1
3
NA
NA
NA
3

COL4A4
6
NA
NA
NA
6

COL4A3
6
NA
NA
NA
6

SP110
12
0.926666666666667
0.8275
0.763333333333333
12

CHRND
8
NA
NA
NA
8

CHRNG
5
NA
NA
NA
5

COL6A3
8
NA
NA
NA
8

AGXT
16
0.8275
0.61936875
0.53187375
14

WNT7A
5
NA
NA
NA
5

XPC
2
NA
NA
NA
1

BTD
161
0.839813664596273
0.788941614906832
0.770683167701863
155

GLB1
42
0.97
0.947140476190476
0.935714285714286
42

CRTAP
5
NA
NA
NA
5

MYD88
3
NA
NA
NA
3

PTH1R
9
NA
NA
NA
9

TREX1
23
0.904782608695652
0.830869565217391
0.80608347826087
23

COL7A1
36
0.952777777777778
0.917222222222222
0.903333333333333
36

SLC25A20
9
NA
NA
NA
9

LAMB2
9
NA
NA
NA
9

AMT
7
NA
NA
NA
7

RFT1
1
NA
NA
NA
1

HESX1
8
NA
NA
NA
8

GBE1
16
0.95
0.820625
0.77625
16

POU1F1
17
0.974117647058824
0.933523529411765
0.919998235294118
17

HGD
17
0.985882352941176
0.971176470588235
0.965882352941176
17

IQCB1
10
0.995
0.98
0.975
10

ACAD9
6
NA
NA
NA
5

NPHP3
9
NA
NA
NA
9

PCCB
21
0.96
0.887619047619048
0.858571428571429
21

MRPS22
2
NA
NA
NA
2

ATR
4
NA
NA
NA
3

CLRN1
10
0.909
0.823
0.791
9

GFM1
4
NA
NA
NA
4

IFT80
5
NA
NA
NA
5

BCHE
11
0.592727272727273
0.325454545454545
0.251811818181818
8

DNAJC19
3
NA
NA
NA
3

ALG3
5
NA
NA
NA
4

CLDN1
0
NA
NA
NA
0

IDUA
30
0.726333333333333
0.572666666666667
0.535330666666667
30

DOK7
12
0.970833333333333
0.895
0.87
12

EVC2
6
NA
NA
NA
6

EVC
5
NA
NA
NA
4

SGCB
7
NA
NA
NA
7

SRD5A3
6
NA
NA
NA
6

GNRHR
18
0.798333333333333
0.68555
0.646665
18

FRAS1
4
NA
NA
NA
4

ANTXR2
8
NA
NA
NA
8

COQ2
4
NA
NA
NA
4

DMP1
4
NA
NA
NA
4

HADH
4
NA
NA
NA
4

PRSS12
0
NA
NA
NA
0

MFSD8
8
NA
NA
NA
8

MMAA
4
NA
NA
NA
4

FGA
9
NA
NA
NA
9

ETFDH
11
0.97
0.943636363636364
0.933636363636364
11

AGA
8
NA
NA
NA
8

TLR3
1
NA
NA
NA
0

F11
13
0.857692307692308
0.680769230769231
0.598460769230769
13

NDUFS6
1
NA
NA
NA
1

NSUN2
3
NA
NA
NA
3

AMACR
2
NA
NA
NA
2

LIFR
2
NA
NA
NA
2

OXCT1
7
NA
NA
NA
7

MOCS2
14
0.926428571428571
0.712857142857143
0.640714285714286
14

NDUFS4
4
NA
NA
NA
4

ERCC8
5
NA
NA
NA
5

NDUFAF2
1
NA
NA
NA
1

SMN2
0
NA
NA
NA
0

SMN1
18
0.798888888888889
0.62
0.542776666666667
18

HEXB
13
0.832307692307692
0.609984615384615
0.53692
11

AP3B1
5
NA
NA
NA
5

ARSB
18
0.94
0.809444444444444
0.762221111111111
18

ADGRV1
19
0.945789473684211
0.787894736842105
0.735263157894737
19

HSD17B4
11
0.910909090909091
0.779090909090909
0.735451818181818
10

ALDH7A1
7
NA
NA
NA
7

SLC22A5
24
0.861666666666667
0.72375
0.674999166666667
23

UQCRQ
1
NA
NA
NA
1

SIL1
4
NA
NA
NA
4

SLC26A2
14
0.976428571428571
0.959285714285714
0.953571428571429
14

IL12B
1
NA
NA
NA
1

NSD1
130
0.986230769230769
0.964230769230769
0.956230769230769
130

PROP1
15
0.992666666666667
0.986666666666667
0.984666666666667
15

DSP
20
0.9185
0.783
0.7334965
19

NHLRC1
8
NA
NA
NA
8

ALDH5A1
3
NA
NA
NA
3

NEU1
13
0.864615384615385
0.781530769230769
0.753845384615385
13

CYP21A1P
0
NA
NA
NA
0

CYP21A2
19
0.702631578947368
0.539468421052632
0.498941578947368
18

MOCS1
7
NA
NA
NA
7

MUT
15
0.962666666666667
0.908
0.886665333333333
15

PKHD1
38
0.959736842105263
0.900263157894737
0.878420526315789
38

RAB23
1
NA
NA
NA
1

SLC17A5
8
NA
NA
NA
8

BCKDHB
25
0.986
0.974
0.9695996
25

SLC35A1
0
NA
NA
NA
0

NDUFAF4
1
NA
NA
NA
1

GRIK2
0
NA
NA
NA
0

PDSS2
2
NA
NA
NA
2

OSTM1
1
NA
NA
NA
1

TSPYL1
0
NA
NA
NA
0

LAMA2
34
0.980882352941176
0.925588235294118
0.906764705882353
34

ENPP1
13
0.980769230769231
0.96
0.951537692307692
13

AHI1
15
0.994
0.988
0.985333333333333
15

PEX7
14
0.808571428571429
0.5757
0.51357
14

IFNGR1
14
0.962857142857143
0.920714285714286
0.904285714285714
14

STX11
3
NA
NA
NA
3

EPM2A
7
NA
NA
NA
7

GTF2H5
2
NA
NA
NA
2

PLG
11
0.837272727272727
0.610909090909091
0.509999090909091
10

FAM20C
1
NA
NA
NA
1

FAM126A
3
NA
NA
NA
3

DDC
6
NA
NA
NA
6

GUSB
19
0.887894736842105
0.754731578947368
0.709472105263158
18

ASL
11
0.905454545454546
0.764545454545455
0.701817272727273
11

SBDS
13
0.972307692307692
0.934615384615385
0.919999230769231
13

POR
8
NA
NA
NA
8

ABCB4
17
0.878235294117647
0.698817647058824
0.597058235294118
15

PEX1
11
0.921818181818182
0.846363636363636
0.824545454545455
11

COL1A2
26
0.962307692307692
0.929615384615385
0.91846
26

RELN
4
NA
NA
NA
4

SLC26A4
46
0.940217391304348
0.888910869565217
0.872391086956522
46

DLD
11
0.969090909090909
0.941818181818182
0.932724545454545
11

CFTR
255
0.904862745098039
0.870509803921569
0.858862470588235
247

CLN8
5
NA
NA
NA
5

TUSC3
2
NA
NA
NA
2

SFTPC
6
NA
NA
NA
5

ESCO2
32
0.9340625
0.8096875
0.7778125
32

STAR
13
0.871538461538461
0.666923076923077
0.575383076923077
13

HGSNAT
12
0.974166666666667
0.9375
0.925833333333333
12

TTPA
16
0.9125
0.804375
0.761868125
16

GDAP1
17
0.857647058823529
0.758235294117647
0.72411705882353
17

CA2
5
NA
NA
NA
4

CNGB3
8
NA
NA
NA
8

NBN
34
0.9629411876470588
0.891467647058824
0.869705882352941
33

TMEM67
12
0.960833333333333
0.916666666666667
0.902498333333333
12

PDP1
1
NA
NA
NA
1

UQCRB
0
NA
NA
NA
0

VPS13B
37
0.961351351351351
0.895405405405405
0.867837837837838
36

RRM2B
37
0.944054054054054
0.885672972972973
0.866756486486487
37

TNFRSF11B
2
NA
NA
NA
2

TRAPPC9
3
NA
NA
NA
3

CYP11B1
10
0.905
0.822
0.797999
10

PLEC
5
NA
NA
NA
5

DOCK8
1
NA
NA
NA
1

VLDLR
5
NA
NA
NA
5

GLDC
10
0.973
0.956
0.950999
10

APTX
7
NA
NA
NA
7

B4GALT1
0
NA
NA
NA
0

GALT
231
0.858484848484849
0.822897835497835
0.810085454545455
228

RMRP
13
0.509230769230769
0.236923076923077
0.168461538461538
13

GNE
17
0.752352941176471
0.597052941176471
0.548234117647059
16

GRHPR
3
NA
NA
NA
3

AUH
8
NA
NA
NA
8

FANCC
17
0.977647058823529
0.935882352941176
0.919997647058824
17

HSD17B3
9
NA
NA
NA
9

XPA
4
NA
NA
NA
4

ALG2
3
NA
NA
NA
3

INVS
4
NA
NA
NA
4

ALDOB
16
0.945625
0.88124375
0.856874375
16

FKTN
18
0.928333333333333
0.860555555555556
0.832776111111111
18

IKBKAP
3
NA
NA
NA
3

NR5A1
23
0.79695652173913
0.635652173913044
0.58347652173913
23

GLE1
4
NA
NA
NA
4

DOLK
5
NA
NA
NA
5

ASS1
22
0.887727272727273
0.781363636363636
0.739089090909091
22

POMT1
18
0.911666666666667
0.786111111111111
0.733331111111111
18

SURF1
4
NA
NA
NA
4

ADAMTS13
23
0.84695652173913
0.693039130434783
0.635646086956522
21

ADAMTSL2
7
NA
NA
NA
7

LHX3
7
NA
NA
NA
7

DCLRE1C
4
NA
NA
NA
4

PDSS1
1
NA
NA
NA
1

ERCC6
12
0.7825
0.504166666666667
0.403333333333333
11

PCDH15
18
0.991111111111111
0.98
0.976110555555556
18

EGR2
9
NA
NA
NA
9

NEUROG3
2
NA
NA
NA
2

PRF1
12
0.738333333333333
0.48
0.386665833333333
10

CDH23
34
0.893823529411765
0.785585294117647
0.736762352941177
34

PSAP
8
NA
NA
NA
8

MRPS16
1
NA
NA
NA
1

PTEN
60
0.949166666666667
0.903996666666667
0.886333333333333
60

FAS
21
0.780952380952381
0.614285714285714
0.56142619047619
21

PLCE1
8
NA
NA
NA
8

COX15
2
NA
NA
NA
2

C10orf2
16
0.811875
0.698125
0.650623125
15

CYP17A1
23
0.807826086956522
0.727391304347826
0.69478
23

COL17A1
6
NA
NA
NA
6

UROS
14
0.945714285714286
0.905
0.892142857142857
14

SLC25A22
3
NA
NA
NA
3

CTSD
3
NA
NA
NA
3

TH
9
NA
NA
NA
9

STIM1
9
NA
NA
NA
9

HBB
140
0.888071428571429
0.8415
0.825142
137

SMPD1
30
0.936333333333333
0.897996666666667
0.882333
30

TPP1
10
0.879
0.781
0.742999
10

ABCC8
34
0.837647058823529
0.744117647058824
0.707939705882353
34

USH1C
10
0.805
0.53595
0.446998
6

PDHX
7
NA
NA
NA
7

RAG1
19
0.872631578947368
0.823684210526316
0.805256842105263
19

RAG2
9
NA
NA
NA
8

SLC35C1
4
NA
NA
NA
4

DDB2
4
NA
NA
NA
4

RAPSN
9
NA
NA
NA
9

NDUFS3
2
NA
NA
NA
2

TMEM216
4
NA
NA
NA
4

FERMT3
4
NA
NA
NA
4

PYGM
23
0.943913043478261
0.816086956521739
0.769997391304348
22

RNASEH2C
2
NA
NA
NA
2

EFEMP2
12
0.975
0.955833333333333
0.9475
12

BBS1
10
0.648
0.315
0.193995
9

PC
14
0.889285714285714
0.790714285714286
0.755712857142857
14

NDUFV1
3
NA
NA
NA
3

UNC93B1
0
NA
NA
NA
0

NDUFS8
7
NA
NA
NA
7

TCIRG1
4
NA
NA
NA
4

CPT1A
34
0.899705882352941
0.775882352941176
0.720293529411765
33

IGHMBP2
8
NA
NA
NA
8

DHCR7
24
0.888333333333333
0.782495833333333
0.736664583333333
24

FOLR1
3
NA
NA
NA
2

MYO7A
63
0.982063492063492
0.960952380952381
0.951111111111111
62

ALG8
5
NA
NA
NA
5

DYNC2H1
21
0.932857142857143
0.842371428571429
0.801428571428571
21

ACAT1
19
0.932631578947368
0.869473684210526
0.841052105263158
19

ATM
143
0.956713286713287
0.919300699300699
0.906713076923077
142

ALG9
2
NA
NA
NA
2

CD3E
3
NA
NA
NA
3

CD3D
3
NA
NA
NA
3

CD3G
4
NA
NA
NA
4

SLC37A4
12
0.921666666666667
0.829158333333333
0.7858325
12

DPAGT1
12
0.889166666666667
0.819166666666667
0.794165833333333
12

SC5D
3
NA
NA
NA
3

KCNJ1
8
NA
NA
NA
7

SCNN1A
5
NA
NA
NA
4

PEX5
2
NA
NA
NA
2

FGD4
10
0.982
0.965
0.959
10

VDR
14
0.947142857142857
0.89
0.866427142857143
14

TUBA1A
12
0.855833333333333
0.800825
0.776666666666667
12

AAAS
7
NA
NA
NA
7

SUOX
3
NA
NA
NA
3

ERBB3
0
NA
NA
NA
0

CYP27B1
16
0.941875
0.860625
0.82875
16

TSFM
4
NA
NA
NA
4

BBS10
12
0.895833333333333
0.781666666666667
0.739166666666667
12

CEP290
23
0.909565217391304
0.744347826086956
0.697391304347826
23

GNPTAB
188
0.954627659574468
0.926755319148936
0.91664829787234
187

PAH
78
0.91051282051282
0.858202564102564
0.841282051282051
78

MMAB
3
NA
NA
NA
3

MVK
17
0.907058823529412
0.804117647058823
0.761760588235294
17

ACADS
17
0.798235294117647
0.605294117647059
0.532939411764706
15

ORAI1
2
NA
NA
NA
2

HPD
7
NA
NA
NA
5

ATP6V0A2
11
0.998181818181818
0.995454545454545
0.993636363636364
11

GJB2
73
0.808219178082192
0.744928767123288
0.721232602739726
67

SACS
11
0.952727272727273
0.894545454545455
0.873635454545455
11

FREM2
2
NA
NA
NA
2

SLC25A15
17
0.944705882352941
0.9
0.879998235294118
17

SUCLA2
4
NA
NA
NA
4

RNASEH2B
2
NA
NA
NA
2

ATP7B
36
0.894722222222222
0.789997222222222
0.748608888888889
34

CLN5
8
NA
NA
NA
8

EDNRB
4
NA
NA
NA
4

PCCA
8
NA
NA
NA
8

ERCC5
10
0.983
0.964
0.956
10

LIG4
7
NA
NA
NA
5

TGM1
38
0.904736842105263
0.824207894736842
0.793419473684211
36

FOXG1
16
0.99
0.965
0.95625
16

MGAT2
4
NA
NA
NA
4

NPC2
16
0.9125
0.768125
0.712496875
16

POMT2
19
0.969473684210526
0.935263157894737
0.922631052631579
19

VIPAS39
4
NA
NA
NA
4

GALC
9
NA
NA
NA
8

FBLN5
12
0.833333333333333
0.649991666666667
0.5883325
12

UBE3A
150
0.975466666666667
0.9604
0.955733066666667
150

SLC12A6
14
1
1
1
14

IVD
8
NA
NA
NA
8

UBR1
2
NA
NA
NA
2

BLOC1S6
1
NA
NA
NA
1

SLC12A1
3
NA
NA
NA
3

MYO5A
0
NA
NA
NA
0

RAB27A
5
NA
NA
NA
5

CLN6
13
0.871538461538462
0.77
0.73923
13

HEXA
49
0.952040816326531
0.903061224489796
0.886325714285714
47

STRA6
18
0.908333333333333
0.832222222222222
0.809439444444444
18

CYP11A1
6
NA
NA
NA
6

MPI
4
NA
NA
NA
4

ETFA
3
NA
NA
NA
3

FAH
16
0.798125
0.575625
0.487499375
15

POLG
25
0.8424
0.707996
0.6575992
24

BLM
45
0.987555555555556
0.964
0.954221333333333
45

VPS33B
6
NA
NA
NA
6

HBA2
17
0.811176470588235
0.637052941176471
0.577646470588235
16

HBA1
3
NA
NA
NA
3

CLCN7
7
NA
NA
NA
7

ABCA3
6
NA
NA
NA
6

MEFV
18
0.251111111111111
0.121666666666667
0.0894438888888889
14

ALG1
8
NA
NA
NA
8

PMM2
24
0.867916666666667
0.812083333333333
0.792916666666667
24

ERCC4
11
0.984545454545455
0.966363636363636
0.959090909090909
11

SCNN1G
3
NA
NA
NA
1

SCNN1B
12
0.861666666666667
0.719166666666667
0.6733325
11

COG7
0
NA
NA
NA
0

CLN3
4
NA
NA
NA
4

TUFM
1
NA
NA
NA
1

CD19
0
NA
NA
NA
0

RPGRIP1L
13
0.897692307692308
0.684607692307692
0.609215384615385
12

COQ9
1
NA
NA
NA
1

TK2
36
0.906111111111111
0.849166666666667
0.8313875
36

HSD11B2
9
NA
NA
NA
9

COG8
1
NA
NA
NA
1

TAT
5
NA
NA
NA
5

GOSH
0
NA
NA
NA
0

ZNF469
21
0.348095238095238
0.214761904761905
0.185237619047619
20

ASPA
9
NA
NA
NA
9

CTNS
19
0.898421052631579
0.752105263157895
0.69841
18

ACADVL
25
0.8488
0.702392
0.6483972
24

MPDU1
4
NA
NA
NA
4

SCO1
3
NA
NA
NA
3

COX10
6
NA
NA
NA
6

PMP22
16
0.963125
0.93936875
0.929999375
15

ALDH3A2
11
0.976363636363636
0.941818181818182
0.929090909090909
11

FOXN1
1
NA
NA
NA
1

PEX12
9
NA
NA
NA
9

NAGLU
18
0.891666666666667
0.836111111111111
0.819443333333333
18

G6PC
23
0.900869565217391
0.775652173913043
0.730433913043478
22

NAGS
8
NA
NA
NA
8

G6PC3
6
NA
NA
NA
6

WNT3
1
NA
NA
NA
1

PNPO
3
NA
NA
NA
3

SGCA
8
NA
NA
NA
8

COL1A1
55
0.937090909090909
0.907636363636364
0.897272181818182
55

MKS1
8
NA
NA
NA
8

TRIM37
4
NA
NA
NA
4

COG1
0
NA
NA
NA
0

USH1G
7
NA
NA
NA
7

TSEN54
9
NA
NA
NA
9

ITGB4
9
NA
NA
NA
9

GALK1
5
NA
NA
NA
5

UNC13D
3
NA
NA
NA
3

ACOX1
7
NA
NA
NA
7

GAA
29
0.910689655172414
0.839996551724138
0.816895862068965
28

SGSH
10
0.887
0.803
0.774
10

NPC1
46
0.922173913043478
0.852391304347826
0.824782173913043
46

LAMA3
4
NA
NA
NA
4

TCF4
24
0.991666666666667
0.982916666666667
0.979583333333333
24

ATP8B1
12
0.820833333333333
0.6225
0.5458325
12

NDUFS7
2
NA
NA
NA
2

GAMT
6
NA
NA
NA
6

INSR
31
0.910645161290323
0.814516129032258
0.773225483870968
30

MCOLN1
5
NA
NA
NA
5

STXBP2
3
NA
NA
NA
3

NDUFA7
0
NA
NA
NA
0

TYK2
0
NA
NA
NA
0

MAN2B1
12
0.975
0.936666666666667
0.9233325
12

RNASEH2A
12
0.813333333333333
0.689158333333333
0.6349975
10

GCDH
11
0.883636363636364
0.773636363636364
0.726362727272727
11

JAK3
5
NA
NA
NA
5

IL12RB1
5
NA
NA
NA
5

CRLF1
18
0.932222222222222
0.829444444444444
0.781665
17

HAMP
1
NA
NA
NA
1

COX6B1
1
NA
NA
NA
1

NPHS1
9
NA
NA
NA
8

DLL3
3
NA
NA
NA
3

PRX
10
0.966
0.875
0.844
10

BCKDHA
32
0.953125
0.916875
0.9021875
32

ETHE1
5
NA
NA
NA
5

ERCC2
12
0.880833333333333
0.7725
0.723333333333333
11

OPA3
7
NA
NA
NA
7

FKRP
20
0.7825
0.626995
0.557499
20

NUP62
1
NA
NA
NA
1

ETFB
2
NA
NA
NA
2

SLC4A11
16
0.92625
0.85875
0.8375
16

PANK2
12
0.914166666666667
0.805833333333333
0.765
12

DNMT3B
9
NA
NA
NA
9

GSS
6
NA
NA
NA
6

SAMHD1
19
0.985789473684211
0.966315789473684
0.95999947368421
19

ADA
26
0.841538461538462
0.725
0.675766538461538
24

DPM1
4
NA
NA
NA
4

EDN3
4
NA
NA
NA
4

IFNGR2
5
NA
NA
NA
5

HLCS
9
NA
NA
NA
9

CBS
16
0.84625
0.78
0.753123125
16

CSTB
9
NA
NA
NA
9

AIRE
11
0.984545454545455
0.969090909090909
0.963636363636364
11

COL6A1
10
0.92
0.779
0.707999
10

COL6A2
21
0.842857142857143
0.7
0.642849047619048
21

PEX26
7
NA
NA
NA
7

SNAP29
1
NA
NA
NA
1

LARGE
4
NA
NA
NA
4

PLA2G6
37
0.938378378378378
0.877024324324324
0.847566216216216
37

ALG12
6
NA
NA
NA
6

MLC1
12
0.890833333333333
0.660833333333333
0.561665
12

SCO2
8
NA
NA
NA
7

TYMP
9
NA
NA
NA
9

ARSA
58
0.867241379310345
0.783443103448276
0.752068103448276
56

All
6793
0.902891211541292
0.897057205947299
0.89538492713087
6653

Average VGD

Gene
naive (only using
99% cutoff (only
99.9% cutoff (only

Name
CV 5.1)
using CV 5.1)
using CV 5.1)

PEX10
0.98
0.967
0.962

NPHP4
NA
NA
NA

PLEKHG5
NA
NA
NA

PLOD1
NA
NA
NA

ALPL
0.934545454545455
0.893636363636364
0.874545

HSPG2
NA
NA
NA

HMGCL
NA
NA
NA

FUCA1
NA
NA
NA

SEPN1
NA
NA
NA

NDUFS5
NA
NA
NA

PPT1
NA
NA
NA

ZMPSTE24
NA
NA
NA

CLDN19
NA
NA
NA

P3H1
NA
NA
NA

MPL
NA
NA
NA

ST3GAL3
NA
NA
NA

MMACHC
0.98
0.963846153846154
0.956923076923077

POMGNT1
0.985263157894737
0.948421052631579
0.934210526315789

CPT2
0.919444444444444
0.864444444444444
0.844998888888889

DHCR24
NA
NA
NA

ALG6
NA
NA
NA

SLC35D1
NA
NA
NA

ACADM
0.911666666666667
0.836661111111111
0.805554444444445

DPYD
NA
NA
NA

AGL
0.938125
0.755
0.693125

DBT
0.954545454545455
0.905909090909091
0.887271363636364

TSHB
NA
NA
NA

HSD3B2
0.968
0.935
0.920999

HFE2
NA
NA
NA

CTSK
NA
NA
NA

HAX1
NA
NA
NA

GBA
0.845230769230769
0.790921538461538
0.769383846153846

PKLR
NA
NA
NA

LMNA
0.80421875
0.7328125
0.710624375

NTRK1
0.762307692307692
0.491538461538462
0.396920769230769

MPZ
0.8368
0.779198
0.7575998

CD247
NA
NA
NA

FASLG
NA
NA
NA

NPHS2
0.874545454545455
0.627254545454546
0.515454545454545

LAMC2
NA
NA
NA

LAMB3
0.84
0.652142857142857
0.597850714285714

USH2A
0.974186046511628
0.945581395348837
0.932790232558139

RAB3GAP2
NA
NA
NA

LBR
NA
NA
NA

ADCK3
0.959230769230769
0.905384615384615
0.883074615384615

GJC2
NA
NA
NA

TBCE
NA
NA
NA

LYST
0.990212765957447
0.976168085106383
0.970425106382979

FH
0.939583333333333
0.8625
0.831665833333333

HADHA
NA
NA
NA

HADHB
NA
NA
NA

MPV17
0.971153846153846
0.954230769230769
0.947307692307692

SRD5A2
0.853571428571429
0.755707142857143
0.717857142857143

LRPPRC
NA
NA
NA

LHCGR
0.849583333333333
0.812495833333333
0.80124875

PEX13
NA
NA
NA

ALMS1
NA
NA
NA

DGUOK
NA
NA
NA

MOGS
NA
NA
NA

SUCLG1
NA
NA
NA

SFTPB
NA
NA
NA

ST3GAL5
NA
NA
NA

EIF2AK3
NA
NA
NA

NPHP1
NA
NA
NA

IL1RN
NA
NA
NA

ERCC3
NA
NA
NA

RAB3GAP1
NA
NA
NA

ZEB2
0.967868852459016
0.92917868852459
0.91295

NEB
NA
NA
NA

ABCB11
NA
NA
NA

LRP2
0.948947368421053
0.851578947368421
0.803157894736842

ITGA6
NA
NA
NA

CHRNA1
0.86
0.792142857142857
0.768571428571429

AGPS
NA
NA
NA

HIBCH
NA
NA
NA

STAT1
0.791666666666667
0.687495833333333
0.65583125

CASP10
NA
NA
NA

ALS2
0.996086956521739
0.990869565217391
0.988695652173913

ICOS
NA
NA
NA

FASTKD2
NA
NA
NA

ACADL
NA
NA
NA

CPS1
NA
NA
NA

ABCA12
0.984166666666667
0.950833333333333
0.938333333333333

BCS1L
0.913076923076923
0.856923076923077
0.839999230769231

CYP27A1
0.957169811320755
0.911884905660377
0.894904528301887

WNT10A
NA
NA
NA

NHEJ1
NA
NA
NA

COL4A4
NA
NA
NA

COL4A3
NA
NA
NA

SP110
0.926666666666667
0.8275
0.795

CHRND
NA
NA
NA

CHRNG
NA
NA
NA

COL6A3
NA
NA
NA

AGXT
0.945
0.898571428571429
0.877142857142857

WNT7A
NA
NA
NA

XPC
NA
NA
NA

BTD
0.871935483870968
0.830255483870968
0.81419335483871

GLB1
0.97
0.947140476190476
0.937618571428571

CRTAP
NA
NA
NA

MYD88
NA
NA
NA

PTH1R
NA
NA
NA

TREX1
0.904782608695652
0.832169565217391
0.800869130434783

COL7A1
0.952777777777778
0.916388888888889
0.903333055555556

SLC25A20
NA
NA
NA

LAMB2
NA
NA
NA

AMT
NA
NA
NA

RFT1
NA
NA
NA

HESX1
NA
NA
NA

GBE1
0.95
0.820625
0.775625

POU1F1
0.974117647058824
0.932352941176471
0.917058823529412

HGD
0.985882352941176
0.971176470588235
0.964705882352941

IQCB1
0.995
0.98
0.975

ACAD9
NA
NA
NA

NPHP3
NA
NA
NA

PCCB
0.96
0.887142857142857
0.861428095238095

MRPS22
NA
NA
NA

ATR
NA
NA
NA

CLRN1
NA
NA
NA

GFM1
NA
NA
NA

IFT80
NA
NA
NA

BCHE
NA
NA
NA

DNAJC19
NA
NA
NA

ALG3
NA
NA
NA

CLDN1
NA
NA
NA

IDUA
0.726333333333333
0.579663333333333
0.525664

DOK7
0.970833333333333
0.895
0.868333333333333

EVC2
NA
NA
NA

EVC
NA
NA
NA

SGCB
NA
NA
NA

SRD5A3
NA
NA
NA

GNRHR
0.798333333333333
0.685555555555556
0.646666111111111

FRAS1
NA
NA
NA

ANTXR2
NA
NA
NA

COQ2
NA
NA
NA

DMP1
NA
NA
NA

HADH
NA
NA
NA

PRSS12
NA
NA
NA

MFSD8
NA
NA
NA

MMAA
NA
NA
NA

FGA
NA
NA
NA

ETFDH
0.97
0.943636363636364
0.934545454545455

AGA
NA
NA
NA

TLR3
NA
NA
NA

F11
0.857692307692308
0.683846153846154
0.616152307692308

NDUFS6
NA
NA
NA

NSUN2
NA
NA
NA

AMACR
NA
NA
NA

LIFR
NA
NA
NA

OXCT1
NA
NA
NA

MOCS2
0.926428571428571
0.712857142857143
0.641428571428571

NDUFS4
NA
NA
NA

ERCC8
NA
NA
NA

NDUFAF2
NA
NA
NA

SMN2
NA
NA
NA

SMN1
0.798888888888889
0.612216666666667
0.55222

HEXB
0.955454545454545
0.893627272727273
0.871817272727273

AP3B1
NA
NA
NA

ARSB
0.94
0.811111111111111
0.760555

ADGRV1
0.945789473684211
0.788421052631579
0.736315263157895

HSD17B4
0.952
0.854
0.819999

ALDH7A1
NA
NA
NA

SLC22A5
0.898695652173913
0.796517391304348
0.763912608695652

UQCRQ
NA
NA
NA

SIL1
NA
NA
NA

SLC26A2
0.976428571428571
0.959285714285714
0.952142857142857

IL12B
NA
NA
NA

NSD1
0.986230769230769
0.964306923076923
0.951999615384615

PROP1
0.992666666666667
0.986666666666667
0.984666666666667

DSP
0.965263157894737
0.926842105263158
0.911052105263158

NHLRC1
NA
NA
NA

ALDH5A1
NA
NA
NA

NEU1
0.864615384615385
0.781538461538462
0.750768461538462

CYP21A1P
NA
NA
NA

CYP21A2
0.687777777777778
0.517772222222222
0.460550555555556

MOCS1
NA
NA
NA

MUT
0.962666666666667
0.907333333333333
0.884666666666667

PKHD1
0.959736842105263
0.898684210526316
0.874736578947368

RAB23
NA
NA
NA

SLC17A5
NA
NA
NA

BCKDHB
0.986
0.9748
0.97

SLC35A1
NA
NA
NA

NDUFAF4
NA
NA
NA

GRIK2
NA
NA
NA

PDSS2
NA
NA
NA

OSTM1
NA
NA
NA

TSPYL1
NA
NA
NA

LAMA2
0.980882352941176
0.925585294117647
0.906470588235294

ENPP1
0.980769230769231
0.960769230769231
0.953076923076923

AHI1
0.994
0.988
0.985333333333333

PEX7
0.808571428571429
0.572135714285714
0.498568571428571

IFNGR1
0.962857142857143
0.918571428571429
0.899999285714286

STX11
NA
NA
NA

EPM2A
NA
NA
NA

GTF2H5
NA
NA
NA

PLG
0.917
0.805
0.769

FAM20C
NA
NA
NA

FAM126A
NA
NA
NA

DDC
NA
NA
NA

GUSB
0.933333333333333
0.891666666666667
0.880555555555556

ASL
0.905454545454546
0.764545454545455
0.70818

SBDS
0.972307692307692
0.934615384615385
0.920766153846154

POR
NA
NA
NA

ABCB4
0.988666666666667
0.974
0.967333333333333

PEX1
0.921818181818182
0.85
0.820906363636364

COL1A2
0.962307692307692
0.93
0.916537692307692

RELN
NA
NA
NA

SLC26A4
0.940217391304348
0.888258695652174
0.869346086956522

DLD
0.969090909090909
0.940909090909091
0.93

CFTR
0.932186234817814
0.906760728744939
0.898137044534413

CLN8
NA
NA
NA

TUSC3
NA
NA
NA

SFTPC
NA
NA
NA

ESCO2
0.9340625
0.810309375
0.7778121875

STAR
0.871538461538461
0.668461538461538
0.577689230769231

HGSNAT
0.974166666666667
0.938333333333333
0.9233275

TTPA
0.9125
0.801875
0.764374375

GDAP1
0.857647058823529
0.759411764705882
0.72529294117647

CA2
NA
NA
NA

CNGB3
NA
NA
NA

NBN
0.98030303030303
0.922121212121212
0.90272696969697

TMEM67
0.960833333333333
0.918333333333333
0.906665833333333

PDP1
NA
NA
NA

UQCRB
NA
NA
NA

VPS13B
0.983055555555556
0.961666666666667
0.952221666666667

RRM2B
0.944054054054054
0.886751351351351
0.863512972972973

TNFRSF11B
NA
NA
NA

TRAPPC9
NA
NA
NA

CYP11B1
0.905
0.824
0.804999

PLEC
NA
NA
NA

DOCK8
NA
NA
NA

VLDLR
NA
NA
NA

GLDC
0.973
0.957
0.950999

APTX
NA
NA
NA

B4GALT1
NA
NA
NA

GALT
0.866798245614035
0.833113596491228
0.820482456140351

RMRP
0.509230769230769
0.234615384615385
0.168458461538462

GNE
0.795625
0.688125
0.65499875

GRHPR
NA
NA
NA

AUH
NA
NA
NA

FANCC
0.977647058823529
0.937052941176471
0.918235294117647

HSD17B3
NA
NA
NA

XPA
NA
NA
NA

ALG2
NA
NA
NA

INVS
NA
NA
NA

ALDOB
0.945625
0.88
0.854375

FKTN
0.928333333333333
0.86
0.841665555555556

IKBKAP
NA
NA
NA

NR5A1
0.79695652173913
0.632608695652174
0.574779130434783

GLE1
NA
NA
NA

DOLK
NA
NA
NA

ASS1
0.887727272727273
0.783636363636364
0.753174090909091

POMT1
0.911666666666667
0.779438888888889
0.729442777777778

SURF1
NA
NA
NA

ADAMTS13
0.927619047619048
0.88
0.863332857142857

ADAMTSL2
NA
NA
NA

LHX3
NA
NA
NA

DCLRE1C
NA
NA
NA

PDSS1
NA
NA
NA

ERCC6
0.853636363636364
0.635454545454545
0.497272727272727

PCDH15
0.991111111111111
0.980555555555556
0.976111111111111

EGR2
NA
NA
NA

NEUROG3
NA
NA
NA

PRF1
0.886
0.785
0.754999

CDH23
0.893823529411765
0.783235294117647
0.744115588235294

PSAP
NA
NA
NA

MRPS16
NA
NA
NA

PTEN
0.949166666666667
0.905833333333333
0.889665666666667

FAS
0.780952380952381
0.618090476190476
0.564284285714286

PLCE1
NA
NA
NA

COX15
NA
NA
NA

C10orf2
0.841333333333333
0.74466
0.714665333333333

CYP17A1
0.807826086956522
0.724347826086956
0.694781739130435

COL17A1
NA
NA
NA

UROS
0.945714285714286
0.905
0.889999285714286

SLC25A22
NA
NA
NA

CTSD
NA
NA
NA

TH
NA
NA
NA

STIM1
NA
NA
NA

HBB
0.891240875912409
0.842260583941606
0.825764890510949

SMPD1
0.936333333333333
0.897333333333333
0.879666

TPP1
0.879
0.78399
0.752

ABCC8
0.837647058823529
0.743526470588235
0.708529117647059

USH1C
NA
NA
NA

PDHX
NA
NA
NA

RAG1
0.872631578947368
0.823684210526316
0.806841052631579

RAG2
NA
NA
NA

SLC35C1
NA
NA
NA

DDB2
NA
NA
NA

RAPSN
NA
NA
NA

NDUFS3
NA
NA
NA

TMEM216
NA
NA
NA

FERMT3
NA
NA
NA

PYGM
0.986363636363636
0.973636363636364
0.966817727272727

RNASEH2C
NA
NA
NA

EFEMP2
0.975
0.956666666666667
0.950833333333333

BBS1
NA
NA
NA

PC
0.889285714285714
0.794285714285714
0.757849285714286

NDUFV1
NA
NA
NA

UNC93B1
NA
NA
NA

NDUFS8
NA
NA
NA

TCIRG1
NA
NA
NA

CPT1A
0.926969696969697
0.811509090909091
0.765443333333333

IGHMBP2
NA
NA
NA

DHCR7
0.888333333333333
0.783745833333333
0.740827916666667

FOLR1
NA
NA
NA

MYO7A
0.988387096774194
0.976774193548387
0.971935322580645

ALG8
NA
NA
NA

DYNC2H1
0.932857142857143
0.840471428571429
0.80666619047619

ACAT1
0.932631578947368
0.867894736842105
0.845262105263158

ATM
0.963450704225352
0.926688732394366
0.912183098591549

ALG9
NA
NA
NA

CD3E
NA
NA
NA

CD3D
NA
NA
NA

CD3G
NA
NA
NA

SLC37A4
0.921666666666667
0.828333333333333
0.784996666666667

DPAGT1
0.889166666666667
0.819166666666667
0.791665833333333

SC5D
NA
NA
NA

KCNJ1
NA
NA
NA

SCNN1A
NA
NA
NA

PEX5
NA
NA
NA

FGD4
0.982
0.965
0.959999

VDR
0.947142857142857
0.888571428571429
0.867141428571429

TUBA1A
0.855833333333333
0.800833333333333
0.780833333333333

AAAS
NA
NA
NA

SUOX
NA
NA
NA

ERBB3
NA
NA
NA

CYP27B1
0.941875
0.8625
0.8318725

TSFM
NA
NA
NA

BBS10
0.895833333333333
0.779166666666667
0.740833333333333

CEP290
0.909565217391304
0.743473913043478
0.696956086956522

GNPTAB
0.959732620320856
0.93475935828877
0.925881818181818

PAH
0.91051282051282
0.860512820512821
0.839614230769231

MMAB
NA
NA
NA

MVK
0.907058823529412
0.806470588235294
0.768235294117647

ACADS
0.904666666666667
0.842666666666667
0.819999333333333

ORAI1
NA
NA
NA

HPD
NA
NA
NA

ATP6V0A2
0.998181818181818
0.995454545454545
0.993636363636364

GJB2
0.832537313432836
0.779550746268657
0.761192388059701

SACS
0.952727272727273
0.892727272727273
0.878180909090909

FREM2
NA
NA
NA

SLC25A15
0.944705882352941
0.899411764705882
0.881764705882353

SUCLA2
NA
NA
NA

RNASEH2B
NA
NA
NA

ATP7B
0.919117647058823
0.827352941176471
0.785

CLN5
NA
NA
NA

EDNRB
NA
NA
NA

PCCA
NA
NA
NA

ERCC5
0.983
0.963
0.954

LIG4
NA
NA
NA

TGM1
0.935555555555556
0.894722222222222
0.881109166666667

FOXG1
0.99
0.96375
0.95625

MGAT2
NA
NA
NA

NPC2
0.9125
0.766875
0.69874875

POMT2
0.969473684210526
0.934736842105263
0.923683684210526

VIPAS39
NA
NA
NA

GALC
NA
NA
NA

FBLN5
0.833333333333333
0.654166666666667
0.5833325

UBE3A
0.975466666666667
0.961066
0.9551996

SLC12A6
1
1
1

IVD
NA
NA
NA

UBR1
NA
NA
NA

BLOC1S6
NA
NA
NA

SLC12A1
NA
NA
NA

MYO5A
NA
NA
NA

RAB27A
NA
NA
NA

CLN6
0.871538461538462
0.768453846153846
0.727692307692308

HEXA
0.974042553191489
0.944468085106383
0.932126595744681

STRA6
0.908333333333333
0.833333333333333
0.804443888888889

CYP11A1
NA
NA
NA

MPI
NA
NA
NA

ETFA
NA
NA
NA

FAH
0.851333333333333
0.651326666666667
0.579328666666667

POLG
0.877083333333333
0.774995833333333
0.731662083333333

BLM
0.987555555555556
0.964666666666667
0.954888666666667

VPS33B
NA
NA
NA

HBA2
0.861875
0.736875
0.6881225

HBA1
NA
NA
NA

CLCN7
NA
NA
NA

ABCA3
NA
NA
NA

MEFV
0.318571428571429
0.175714285714286
0.140713571428571

ALG1
NA
NA
NA

PMM2
0.867916666666667
0.811666666666667
0.795833333333333

ERCC4
0.984545454545455
0.965454545454545
0.960908181818182

SCNN1G
NA
NA
NA

SCNN1B
0.892727272727273
0.761818181818182
0.700906363636364

COG7
NA
NA
NA

CLN3
NA
NA
NA

TUFM
NA
NA
NA

CD19
NA
NA
NA

RPGRIP1L
0.9725
0.905
0.88

COQ9
NA
NA
NA

TK2
0.906111111111111
0.850555555555556
0.831666666666667

HSD11B2
NA
NA
NA

COG8
NA
NA
NA

TAT
NA
NA
NA

GOSH
NA
NA
NA

ZNF469
0.362
0.226995
0.1899995

ASPA
NA
NA
NA

CTNS
0.948333333333333
0.89
0.869999444444444

ACADVL
0.884166666666667
0.760416666666667
0.7070825

MPDU1
NA
NA
NA

SCO1
NA
NA
NA

COX10
NA
NA
NA

PMP22
0.965333333333333
0.939993333333333
0.930666

ALDH3A2
0.976363636363636
0.941809090909091
0.929090909090909

FOXN1
NA
NA
NA

PEX12
NA
NA
NA

NAGLU
0.891666666666667
0.836666666666667
0.818332777777778

G6PC
0.941818181818182
0.903181818181818
0.887727272727273

NAGS
NA
NA
NA

G6PC3
NA
NA
NA

WNT3
NA
NA
NA

PNPO
NA
NA
NA

SGCA
NA
NA
NA

COL1A1
0.937090909090909
0.906363636363636
0.895272545454545

MKS1
NA
NA
NA

TRIM37
NA
NA
NA

COG1
NA
NA
NA

USH1G
NA
NA
NA

TSEN54
NA
NA
NA

ITGB4
NA
NA
NA

GALK1
NA
NA
NA

UNC13D
NA
NA
NA

ACOX1
NA
NA
NA

GAA
0.910714285714286
0.835
0.810356785714286

SGSH
0.887
0.802
0.779

NPC1
0.922173913043478
0.85304347826087
0.821521521739131

LAMA3
NA
NA
NA

TCF4
0.991666666666667
0.983329166666667
0.979166666666667

ATP8B1
0.820833333333333
0.615
0.55

NDUFS7
NA
NA
NA

GAMT
NA
NA
NA

INSR
0.941
0.898666666666667
0.882999666666667

MCOLN1
NA
NA
NA

STXBP2
NA
NA
NA

NDUFA7
NA
NA
NA

TYK2
NA
NA
NA

MAN2B1
0.975
0.9375
0.920833333333333

RNASEH2A
0.881
0.801
0.773999

GCDH
0.883636363636364
0.769090909090909
0.731817272727273

JAK3
NA
NA
NA

IL12RB1
NA
NA
NA

CRLF1
0.964705882352941
0.911764705882353
0.890588235294118

HAMP
NA
NA
NA

COX6B1
NA
NA
NA

NPHS1
NA
NA
NA

DLL3
NA
NA
NA

PRX
0.966
0.875
0.843

BCKDHA
0.953125
0.915625
0.9012490625

ETHE1
NA
NA
NA

ERCC2
0.888181818181818
0.770909090909091
0.728180909090909

OPA3
NA
NA
NA

FKRP
0.7825
0.6205
0.5534975

NUP62
NA
NA
NA

ETFB
NA
NA
NA

SLC4A11
0.92625
0.85875
0.83375

PANK2
0.914166666666667
0.805825
0.7658325

DNMT3B
NA
NA
NA

GSS
NA
NA
NA

SAMHD1
0.985789473684211
0.966836842105263
0.956842105263158

ADA
0.872083333333333
0.803333333333333
0.782914583333333

DPM1
NA
NA
NA

EDN3
NA
NA
NA

IFNGR2
NA
NA
NA

HLCS
NA
NA
NA

CBS
0.84625
0.77875
0.753125

CSTB
NA
NA
NA

AIRE
0.984545454545455
0.969090909090909
0.963636363636364

COL6A1
0.92
0.773
0.720999

COL6A2
0.842857142857143
0.702371428571429
0.64238

PEX26
NA
NA
NA

SNAP29
NA
NA
NA

LARGE
NA
NA
NA

PLA2G6
0.938378378378378
0.877564864864865
0.852159189189189

ALG12
NA
NA
NA

MLC1
0.890833333333333
0.660833333333333
0.5625

SCO2
NA
NA
NA

TYMP
NA
NA
NA

ARSA
0.898214285714286
0.8275
0.80232125

All
0.917565008266947
0.912709950398317
0.911042912971592

	Number	Date	Country
	62151116	Apr 2015	US
	62013139	Jun 2014	US

	Number	Date	Country
Parent	14568456	Dec 2014	US
Child	15136093		US

DEVICE, SYSTEM AND METHOD FOR ASSESSING RISK OF VARIANT-SPECIFIC GENE DYSFUNCTION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

Provisional Applications (2)

Continuation in Parts (1)