The present invention relates to systems and methods for estimating a probability of a disease occurrence in a patient based on genetic data.
In some implementations, the systems and methods described herein provide better understanding of the risk of a genetic disease given a specific genetic variation. The approach currently used by the medical genetics community, promulgated by the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP), translates variant data into knowledge of whether that variant can or cannot induce disease; the terminology used is “causative” for disease. In contrast, the framework described in this disclosure integrates variant information (e.g., function, in silico predictions, etc.) into a single annotation ranging from “Benign” to “Pathogenic” (with multiple intermediate classification levels in between). Each additional satisfied criterion raises or lowers the probability the variant is classified “Pathogenic”. This approach is unique because it interprets genetic variants as probabilistically increasing the risk for disease. The systems and methods described herein are centered around a hypothesis that clinically meaningful knowledge is lost in the compression of variant information into a single ACMG annotation. Accordingly, the systems and methods described herein translate variant-specific data into knowledge about the probability of a disease phenotype, the post-test probability of disease or positive predictive value (PPV) for each variant. This is enabled by two major shifts from current thinking: 1) the probability of a genetic disease is estimated by measuring the penetrance of disease (the number of affected carriers divided by the total number of carriers) and 2) variant-specific data is calibrated to this estimate of disease penetrance (or probability/PPV) to better estimate the probability of disease for very rare genetic variants. In some implementations, the systems and methods include translating penetrance in three-dimensional space which is a modification of previous structure-derived features.
In some implementations, variant interpretation is limited to a desire for “yes/no” answers to what is in truth a continuum of probabilities. Other implementations described herein allows for better calibration of variant features because the outcome is not binary and, therefore, contains more information than current strategies. Accordingly, the outcome, or target, enables a more precise interpretation of both variant features and disease probability.
In some implementations, the systems and methods described herein provide the following: (1) Interpretation of all variants (ranging from common to rare) as raising (or not raising) the probability of disease; this probability is measured as the penetrance of disease; (2) Calibrating the estimation of disease probability on penetrance data. This enables an estimate of the probability of disease before observing healthy or sick carriers of genetic variants; and (3) a 3D protein structural feature as a covariate in the method.
The systems and methods described herein are distinct from other technologies that try to understand the consequences of genetic variations in disease-associated genes because those other technologies attempt to classify variants which CAUSE disease whereas the new method determines the probability of disease. Systems and methods described herein are calibrated using penetrance data from many variants, not classification data from many variants. This more precise framework of the relationship between genetic variation and disease probability allows for a more precise understanding of how each individual input feature associates with disease and therefore more informative estimates of disease probability.
In one embodiment, the invention provides a method of assessing a probability of a disease occurring in a patient based on an estimation of disease penetrance corresponding to a specific genetic variant of interest. A set of variants is selected including a specific variant of interest and a database is accessed including genetic and disease data for each of a plurality of individuals. An empirical prior estimate of disease penetrance is calculated based on a number of individuals from the plurality of individuals that have both the disease and at least one variant of the set of variants, and a number of individuals from the plurality of individuals that have at least one variant of the set of variants. A posterior penetrance estimate for the specific variant of interest is then calculated based on the empirical prior estimate and a number of individuals that have the specific variant of interest. A recursive regression modeling is then applied to the posterior penetrance estimate and, in each iteration, an estimated penetrance for the specific variant of interest is fitted to the posterior penetrance estimate and the posterior penetrance estimate is recalculated based on a set of revised variant-specific priors determined based on the regression modeling. The recursive regression modeling continues until one or more defined exit criteria are satisfied (e.g., until the posterior penetrance estimate converges towards a final value). The probability of the disease occurring in a patient that has the specific variant of interest is then determined based on the posterior penetrance estimate as determined by the recursive regression modeling.
Other aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings.
Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways.
A major challenge emerging in genomic medicine is how to assess best disease risk from rare or novel variants found in disease-related genes. The expanding volume of data generated by very large phenotyping efforts coupled to DNA sequence data presents an opportunity to reinterpret genetic liability of disease risk. Here we propose a framework to estimate the probability of disease given the presence of a genetic variant conditioned on features of that variant. We refer to this as the penetrance; the fraction of all variant heterozygotes that will present with disease.
In various specific examples described herein, we demonstrate this methodology using a well-established disease-gene pair, the cardiac sodium channel gene SCNSA and the heart arrhythmia Brugada syndrome. From a review of 756 publications, we developed a pattern mixture algorithm, based on a Bayesian Beta-Binomial model, to generate SCNSA penetrance probabilities for the Brugada syndrome conditioned on variant-specific attributes. These probabilities are determined from variant-specific features (e.g. function, structural context, and sequence conservation) and from observations of affected and unaffected heterozygotes. Variant functional perturbation and structural context prove most predictive of Brugada syndrome penetrance.
The clinical implications for genetic variants, even definitively pathogenic variants, can vary strikingly across individuals. Lack of evidence to estimate the probability of disease from identified genetic variants, especially rare variants, presents a major barrier to integrating genotype information into clinical care. Here we advance an approach to estimate the penetrance, or positive predictive value of the discovery of a genetic variant, in service of advancing the use of genetic information in personalized medicine.
A major barrier to integrating genotype information into clinical care is accurately linking genetic variants to disease risk. As cheap whole genome, exome, and gene panel sequencing becomes more widely used, the genetics community frequently observes novel, ultra-rare variants—ones carried by a single or few (often related) individuals. Indeed, most variants found in large population genome sequencing efforts are novel or ultra rare. The number of possible single nucleotide variants in the human genome is in the billions; the number of variants becomes uncountable if insertion and/or deletions (indels) are included. The majority of these discovered variants will never be observed in a sufficient number of heterozygotes to ascertain a causal link with disease. In addition to finding rare variants, large-scale genetic sequencing efforts taking place around the world are identifying greater numbers of individuals, ostensibly unaffected, who carry variants previously thought to be disease-inducing. As a consequence of insufficient heterozygote counts and conflicting annotations, many diagnostic laboratories annotate such variants as “Variants of Uncertain Significance” (VUS), despite more confident past assessments of “Likely Pathogenic” or “Pathogenic.”.
To help assess the impact of genetic variants, the American College of Medical Genetics and Genomics (ACMG) suggests integrating multiple sources of information including population, functional, computational, and segregation data to classify variants. This is consistent with a continuous, Bayesian framework where each additional satisfied classification criterion modifies the probability a variant is causative for disease (pathogenic) or not (benign). Given the resulting probabilities, a final classification can be made into one of the five categories commonly used to distinguish variants including, for example, “benign”, “likely benign”, “variant of uncertain significance”, “likely pathogenic”, or “pathogenic”. However, a remaining challenge even after classification is that the clinical implications for definitively pathogenic variants can vary strikingly across individuals, including variable expressivity and incomplete penetrance. We attempt here to address one aspect of this clinical variability by developing a method to estimate variant-induced disease risk.
In this study, we sought to develop a method to estimate the probability of disease given variant-specific information—which we refer to as the penetrance of a variant—and we also provide the uncertainty for that estimate. The pathogenicity of a variant for a specific individual at a given point in time is binary but unknown. This pathogenicity may have a time dependence such as for diseases which present later in life. Penetrance is one metric that captures the degree to which the pathogenicity will manifest as a human phenotype such as a disease or a trait. We provide posterior probability estimates of the penetrance, asymptotic with respect to age, which can be thought of as the positive predictive value of disease given the known variant information. We also provide a 95% credible interval that represents the uncertainty in that estimate. Our method relies on “borrowing strength” or sharing information across variants to produce variant-specific, quantitative penetrance estimates even in the absence of a large number of heterozygotes. These estimates can be especially informative for interpreting rare and novel variants.
where α is the number of variant heterozygotes diagnoses with a particular disease and β is the number of unaffected heterozygotes of the same variant, i is the specific variant in question and mean posterior penetrance is the estimated penetrance or probability of disease. As the total number of observed heterozygotes increases, the estimated penetrance converges to the traditional definition. The mean posterior penetrance can be thought of as a shrunken estimate of the observed penetrance, especially for variants with small numbers of known heterozygotes.
To generate “priors” from available data, the method of
Penetrance Estimatei=β0+β1(Peak Current)i+β2(Penetrance Density)i+Σnβi,n(In Silica Variant Classifiers)i,n+εi (2)
The fitted model is then used to generate an updated prior distribution and, by addition of observed cases (i.e., heterozygotes with disease) and controls (i.e., heterozygotes without the disease) for each variant, a subsequent posterior expected penetrance. The new EM priors are unique to each variant and replace αprior and βprior in equation (1) for the next iteration. The updated posterior penetrance is then used to build a new fitted model and further refine the posterior expected penetrance. The procedure is iterated until it converges to a maximum likelihood solution. Although the example of
As illustrated in
Next a regression model is used to fit an estimated penetrance based on variant-specific features (e.g., the result of Equation (2)) to the posterior penetrance estimate of Equation (1) (step 109). The variant-specific priors (i.e., the probability correlation of the variant-specific features to the disease) are then revised based on the regression model (step 111) and the posterior penetrance estimate (equation (1)) is recalculated based on the updated variant-specific priors (step 113). In other words, after the first iteration, αprior and βprior in the empirical model are replaced with new variant-specific values of αi,prior and βi,prior based on the regression model. After each subsequent iteration, αi,prior and βi,prior are replaced with updated values based on the regression model. This process is repeated iteratively by fitting a regression model of the estimated penetrance based on variant-specific features (e.g., equation (2)) to the updated posterior penetrance estimate until one or more convergence criteria are met (e.g., until an iteration fails to change the calculated mean penetrance by more than a defined percentage) (step 115). When the convergence criteria (or criterion) are satisfied, then the most recent version of the recalculated posterior penetrance estimate is the final EM penetrance estimate for the specific variant (step 117).
In some implementations, the estimated disease penetrance for the specific variant can be used to identify and/or quantify risk conditions for the disease in patients. For example, in the method of
The variant penetrance computer system 301 analyzes and processes data from the genomic data library 307 and the linked EHR data library 309 in order to calculate an estimated disease penetrance for the specific variant (e.g., using the method of
To further illustrate this approach, below are a series of examples applying the method of
Variants in SCN5A have been associated with BrS1 since 1998, some variants affecting almost all known heterozygous individuals, some variants conferring only modestly increased risk, and others have no influence on arrhythmia presentation. SCN5A variants that do not influence the gene in any way do not predispose or protect against BrS1, e.g. many synonymous variants. These variants therefore have a relatively low penetrance of the arrhythmia, similar to the general population. SCN5A variants that produce no sodium current result in a higher fraction of heterozygotes presenting with BrS1, much higher than in the general population. However, BrS1 presentation, as for nearly all inherited diseases, is not homogeneous even amongst heterozygotes of SCN5A haploinsufficiency alleles. In fact, even highly penetrant variants such as R878C and E1784K still leave some heterozygotes unaffected: 100% penetrance is extremely rare.
Our hypothesis is that variant-specific features (e.g. variant-induced changes in function and location in structure) contain information equivalent to clinically phenotyping heterozygotes and can therefore be used to inform the prior distribution in a Bayesian framework. This prior distribution is combined directly with clinically phenotyped heterozygotes (the likelihood function) to produce more accurate estimates of disease risk probability (posterior penetrance;
Precision and accuracy of BrS1 penetrance priors. To evaluate performance over the distribution of BrS1 prior penetrances (
(where αprior and βprior are the tuning parameters for the beta-binomial distribution and are set equivalent to the number of affected and unaffected individual heterozygotes in the prior), as the dependent variable; both empirical and EM priors were evaluated as indicated in the table of
Inclusion of individuals from gnomAD. Individuals in gnomAD are mostly unaffected, given the rarity of BrS; however, the data available from that resource could be contaminated with individuals presenting with BrS, though likely at or near the rate in the general public. To test the sensitivity of our results to this type of misclassification, we randomly switched individuals from unaffected (gnomAD) to BrS cases for each variant and examined the change in penetrance due to misclassification. We did this with 24 and 240 misclassified cases. With 24 misclassifications, the median rate of penetrance change is 0.4% and the expected number of variants with a penetrance change is 6. The average mean absolute difference in penetrance change is 0.02% (first quartile of 0.0014% and third quartile of 0.02%). With 240 misclassifications, the median rate of penetrance change is 2%, and the expected number of variants with a penetrance change is 28. The average mean absolute difference in penetrance change is 0.2% (first quartile of 0.1% and third quartile of 0.3%). These results suggest minimal influence of small or modest misclassification rates on penetrance estimates.
Structure and peak current improve prediction of penetrance. The resulting prior BrS1 mean penetrance estimates reflect the known topology of Nav1.5 (protein product of SCN5A;
A modified Bayesian approach to estimate BrS penetrance. An Empirical Bayes approach combines information across all variants to estimate a single prior distribution and estimate a variant-specific posterior penetrance from that prior. These estimates assume all variant effects have the same prior and therefore shrink towards a global mean across all variants. Here we put forward a method to model the penetrance for each variant using variant-specific predictive features. The resulting penetrance and uncertainty estimates yield a posterior that can be re-used as variant-specific prior (interpretable as equivalent to hypothetical observations of affected and unaffected heterozygotes) in a Bayesian updating scheme. This information is accessible before clinically phenotyping a single heterozygote; example estimates of high BrS1 penetrance [c.4213G>C (p.Val1405Leu), c.4259G>T (p.Gly1420Val), and c.4258G>C (p.Gly1420Arg)] and low BrS1 penetrance [c.4418T>G (p.Phe1473Cys), c.4459A>C (p.Met1487Leu), and c.4467G>T (p.Glu1489Asp)] are seen in
Comparison between penetrance prediction and ACMG variant classification. We put forward a method to estimate the probability that a SCN5A variant will manifest in BrS1 for a given patient (our ‘risk score’), and uncertainty for that score, conditioned on variant attributes. We are not assessing the causality of the variant and its attributes on the manifestation of disease, but rather their association. Hence, our framework diverges from that of the ACMG. For example, in our formulation, a VUS with many affected heterozygotes would have the same probability distribution as a pathogenic variant with many affected heterozygotes [provided the number of observations of cases and controls is the same and the other predictive covariates (variant attributes) are the same]. If there are comparatively few heterozygotes of the VUS, given the same predictive covariates, greater uncertainty would be reflected by a wider distribution of penetrance probability (
Prospects for applications of this method. Our approach provides a risk score for disease, in this case, for BrS1. However, Brugada syndrome has degrees of electrophysiologic phenotypes and symptoms. We envision being able to predict these degrees of clinical phenotype from variant-specific properties in the future by integrating electronic health records with linked genetic data. However, at present, these granular electrophysiologic and symptom data are not available for a number of unique heterozygotes and unique variants sufficient for statistical analysis. Beyond SCN5A and BrS1, a reasonable next step would involve the 59 genes for which the ACMG recommends clinical diagnostic laboratories report secondary variant discovery. Of these, 36 have greater than or equal to 20 missense “pathogenic”/“likely pathogenic” variants in ClinVar, suggesting that many variants are described in the literature and can be curated in a similar manner to SCN5A. It is also important to note that the penetrance estimates derived in our approach are not static and will continue to be refined as additional data become available, i.e. phenotype data from case reports and large biobank projects, additional in vitro functional studies, and improved computational and structural predictors.
Our approach provides a risk score for disease, in this case BrS1, analogous to a diagnostic test (might patient X develop BrS1 given they have variant Y). If we know patient X already has BrS1, we can use their data to inform other individuals' risk scores, but we cannot use our approach to absolutely determine the role of variant Y manifesting disease. One application of our approach is that we can examine the ratio P(BrS1|Variant X)/P(BrS1|wild-type) to see if the data better support that variant X is on the causal pathway to disease. But we caution that this approach is imperfect; it does not allow for variants to interact, for example. Additionally, while clinical evidence affirms a strong relationship between SCN5A variants and BrS1, many genetic and environmental factors influence the ultimate presentation of BrS1 in an individual. Not accounting for additional demographic, genetic, or environmental factors certainly increased the noise in our analysis. To counter this as best as possible, we included the maximum number of carriers for the maximum number of unique variants. Finally, we recognize the likely bias intrinsic to compiling a list of affected and unaffected heterozygotes in the manner outlined in the methods section above; however, the most probable manifestation of these biases would be the loss of an observable relationship between the predictive features and penetrance, not the creation of a spurious relationship.
We advance a method to estimate a degree of clinical heterogeneity in variant impact, incomplete penetrance. Here we have demonstrated how BrS1 penetrance can be estimated with high accuracy and precision. Using a Bayesian framework to estimate penetrance allows us to quantitatively integrate clinical phenotypic data with variant-specific functional measurements, variant classifiers, and sequence- and structure-based features to accurately estimate penetrance. This method can be extended to other genes and disorders in order to enable quantitative interpretation of variants probabilistically and quantitatively.
Applying the method of
To generate priors from our available data, we use the expectation maximization (EM) algorithm described above in reference to
In this example using equation (2) specifically for BrS1 penetrance, equation (2) peak current is an in vitro measurement of the maximum current through a channel (normalized to wild type), penetrance density is a structure-based metric (as described in further detail below), and in silico variant-classifiers is a vector populated with commonly used variant classification servers such as PROVEAN and PolyPhen (see below); all predictors used are continuous, not categorical or binary (in the table of
to the posterior mean penetrance derived by adding BrS1 cases and controls for each variant to the empirical prior,
or the EM prior,
Collection of the SCN5A variant dataset. The dataset was curated from 711 papers in a previous publication, to which we added an additional 45 papers on SCN5A that had been published since the previous dataset was constructed. Briefly, we searched publications for the number of heterozygotes of each variant mentioned, the number of unaffected and affected individuals with diagnosed BrS1, and variant-induced changes in channel function, if reported; all recorded values were normalized to wild-type values reported in the same publications. We supplemented this dataset with all SCN5A variants in the gnomAD database of population variation (http://gnomad.broadinstitute.org/; release 2.0). Due to the rarity of BrS1 (-1 in 10,000), all heterozygotes found in gnomAD were counted as unaffected. An interactive version of the dataset, the SCN5A Variant Browser, is available at http://oates.mc.vanderbilt.edu/vancart/SCN5A/. We further collected in silico pathogenicity predictions from three commonly used servers: SIFT, Polyphen-2, and PROVEAN. We also include basic local alignment search tool position-specific scoring matrix (BLAST-PSSM) for SCN5A and the per residue evolutionary rate, previously shown to have predictive value for predicting functional perturbation for the cardiac potassium channel gene KCNQ1, and point accepted mutation score (PAM). Additionally, we leveraged structures of the SCN5A protein product and derived a penetrance density as previously described (as described further below). In-frame indels are treated as missense variants. We include these variants as variations at a residue where the indel starts, and only note whether they are an insertion or deletion. Some of these variants have functional data available and their penetrance densities are calculated from the residue starting the indel. These are simplifications to enable an analysis of as many variants and heterozygote individuals as possible. For these variants, we did not include in silico pathogenicity predictions. We included compound heterozygotes (individuals with more than one SCN5A variant) as separate records when these data are available, though these were very rare. Additionally, our inclusion criteria are not modified by relatedness. We did not include intronic variants in our analysis.
Initial Empirical Bayes beta-binomial prior penetrance calculation. Using the data from the aforementioned literature curation, we estimated the penetrance for each observed variant using a beta-binomial empirical Bayes model. To calculate the empirical BrS1 penetrance prior, we calculated αprior, empirical and βprior, empirical by finding the weighted mean penetrance over all variants in the dataset and estimating the variance. Weighting was done using the following equation:
Equation 3 ensures variants with a greater number of total heterozygotes (and therefore higher confidence in penetrance estimate) had a greater weight in the preliminary analysis. We then estimated the variance in penetrance as the mean squared error (MSE) between the estimated penetrance mean and the observed penetrance from Equation 1 with αprior and βprior equal to zero. With these estimated mean and MSE-derived variance, the empirical prior penetrance was calculated to be an αprior and βprior equal to 0.45 and 2.73, respectively. The variant-specific empirical posterior for each variant was then calculated by adding observed heterozygote counts of affected (BrS1 cases) and unaffected to αprior, empirical and βprior, empirical respectively, and the resulting posterior mean penetrance was used as the dependent variable of the subsequent regression model (Equation 2). The inverse variance of the estimated posterior beta distributions capped at the ninth decile determined in this step were used to weight subsequent regression models and Pearson R2 calculations.
Expectation maximization Bayesian beta-binomial penetrance predictions. To deal with missing data in a prediction model, we followed an approach which avoids multiple imputation but guarantees maximum predictive accuracy across missing data patterns. In short, for every missing data pattern, we estimate a separate prediction model. For example, His558Arg, where penetrance density, in silico predictors, and functional data are all available, the estimate of penetrance is regressed on all other variants where all of these covariates are available (n=238). For Tyr1449Cys, however, only penetrance density and in silico predictors are available, so only those covariates are used in the regression (n=1,382; much higher since functional data have been collected for relatively few variants). The models were built with a linear regression pattern-mixture algorithm, updating posterior mean penetrances iteratively until the resulting estimated mean penetrance,
changed by <0.01% from the previous iteration. This process typically converged within eight iterations. For variant, i, the variance was estimated from this converged EM mean penetrance according to (Equation 4):
We then adjusted v, equivalent to the number hypothetical observations of clinically phenotyped heterozygotes, to balance overcoverage of variants with low to moderate BrS1 penetrance and poorer coverage of variants with high estimated mean penetrance, resulting in a range of v, from approximately 15 to 20 (see
As discussed above, one variant-specific feature that may be used in some implementations to estimate the disease penetrance for the specific variant is “penetrance density.” In some implementations, “penetrance densities” are calculated (for BrS1 or for other genes/variants) based on an approach akin to k-nearest neighbors. We average empirical BrS1 mean posterior penetrance
of variants weighted by the inverse distance in space from the variant of interest. This calculated feature therefore depends on how many variants are near the variant of interest, with regions in three-dimensional space dense with high penetrance variants—“hotspots”—yielding a higher penetrance prediction. Penetrance density is calculated as follows:
where ρj is BrS1 penetrance density of the jth variant, BrS1 Posterior Penetrancex,i,mutation(i≠j) is the empirical BrS1 mean posterior penetrance for the ith variant, and di,j is the distance between the center of mass of residues i and j. i does include residue j, but only if the identity of the amino-acid mutation is changed, i.e. mutation(i)≠mutation(j). For residues in the flexible linkers where structural data are not available, we assumed a flexible amino-acid polymer and calculated the distance between residues as di,j=3.8*√{square root over (Ri,j)}, where Ri,j is the number of residues between residue i and residue j.
Also, as discussed above in reference to equation (4), we scaled the variance from the convergent EM result by a factor of “v.” At each level of ‘v’ and for each variant, we sampled from binomial distribution with n of 100, and probability of
We calculated the resulting 95% posterior credible interval from the Beta distribution with shape parameters 1) BrS1 sampled+αprior,EM and 2) 100−BrS1 sampled+βprior,EM. We repeated this process 1000 times, and calculated the rate of the posterior credible interval covering the probability
We selected the best ‘v’ from the coverage plots which balances the tradeoff of over-coverage in variants with medium low BrS1 penetrance and under-coverage of variants with high BrS1 penetrance. From this procedure we chose a tuning parameter of ‘v’=19 (see, e.g.,
Accordingly, the systems and methods described herein provide mechanisms for estimating variant-specific disease penetrance and then assigning a classification of disease risk for patients exhibiting the variant based on the variant-specific disease penetrance. In some implementations, the systems and methods described herein provide automated mechanisms for detecting occurrences of a specific variant in electronic medical records and for then taking mitigative action (e.g., notifying a medical professional and/or initiating medical treatment) based on the classification of disease risk assigned to the variant based on the variant-specific disease penetrance. Further features and advantages are set forth in the accompanying claims.
This application claims the priority benefit of U.S. Provisional Patent Application No. 63/212,928, filed Jun. 21, 2021, entitled “A BAYESIAN METHOD TO ESTIMATE VARIANT-INDUCED DISEASE PENETRANCE,” the entire contents of which are hereby incorporated herein by reference.
This invention was made with government support under grant R00HL135442 awarded by the National Institutes of Health. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63212928 | Jun 2021 | US |