Genomic approaches in studying disease provide useful tools in the field, such as the ability to replace informed but biased hypotheses with unbiased but generic ones, such as the equal treatment of all genetic variants in genome-wide association studies (GWAS). However, for both rare variants of large effect and common variants of weak effect, the use of prior knowledge can be critical for disease gene discovery (Cooper et al. 2010; Cooper et al. 2011(a); Musunuru et al. 2010; Ward & Kellis 2012). For example, exome sequencing is an effective discovery strategy because it focuses on protein-altering variation, which is enriched for causal effects (Ng et al. 2009).
Although many existing annotation methods are useful for prioritizing causal variants to boost discovery power (for example, PolyPhen (Adzhubei et al. 2010), SIFT (Ng & Henikoff 2003) and GERP (Cooper et al. 2005)), current approaches tend to suffer from one or more major limitations. First, annotation methods vary widely with respect to both inputs and outputs. For example, conservation metrics (Cooper et al. 2005; Siepel et al. 2005; Pollard et al. 2010) are defined across the genome but do not use functional information and are not allele specific, whereas protein-based metrics (Adzhubei et al. 2010; Ng. & Henikoff 2003) apply only to coding and often only to missense variants, thereby excluding >99% of human genetic variation. Further, each annotation method has its own metric, and these metrics are rarely comparable, making it difficult to evaluate the relative importance of distinct variant categories or annotations. Additionally, annotation methods trained on known pathogenic mutations are subject to major ascertainment biases and may not be generalizable. Moreover, it is a major practical challenge to obtain, let alone to objectively evaluate or combine, the existing panoply of partially correlated and partially overlapping annotations; this challenge will only increase in size as large-scale projects such as the Encyclopedia of DNA Elements (ENCODE) (ENCODE Project Consortium et al. 2012) continually increase the amount of relevant data available. The net result of these limitations is that many potentially relevant annotations are ignored, while the annotations that are used are applied and combined in ad hoc and subjective ways that undermine their usefulness. Therefore, it would be useful to develop a framework which would provide a single metric to prioritize variants across annotation methods.
According to the embodiments described herein a method performed by a computing system for determining the relative effect of a genetic variant is provided. The method may include a set of applying a machine learning model to a dataset, wherein the dataset comprises one or more genetic variants, each of which is associated with values or states of each of a set of annotations. In some aspects, the machine learning model is a support vector machine (SVM) model. The method may also include a step of calculating and/or assigning (e.g., a raw integrated deleteriousness score or a scaled integrated deleteriousness score) for each of the one or more genetic variants. The integrated deleteriousness score of each genetic variant may be used to determine the relative effect of said genetic variant when compared to other integrated deleteriousness scores.
In certain embodiments, a system for generating an integrated deleteriousness score is provided. The system may include a computer-readable storage medium which stores computer-executable instructions. In some aspects, the computer-executable instructions include, but are not limited to (i) instructions for applying a machine learning model to a dataset, wherein the dataset comprises one or more genetic variants, each of which is associated with values or states of each of a set of annotations; and/or (ii) instructions for calculating an integrated deleteriousness score to each of the one or more genetic variants. They system may also include a processor. In some aspects, the processor may be configured to perform steps including, but not limited to, receiving the dataset by a user and/or executing the computer-executable instructions stored in the computer-readable storage medium.
In certain embodiments, a computer-readable storage medium is provided. Such a computer-readable storage medium may store computer-executable instructions including, but not limited to (i) instructions for applying a machine learning model to a dataset, wherein the dataset comprises one or more genetic variants, each of which is associated with values or states of each of a set of annotations, and/or (ii) instructions for calculating an integrated deleteriousness score to each of the one or more genetic variants.
Provided herein is a general framework and methods for integrating diverse genome annotations and scoring a human genomic variant, e.g., a genetic mutation resulting in a single-nucleotide variant (SNV) (also referred to as a single nucleotide polymorphism (SNP) or an insertion or deletion event (also referred to as an insertion-deletion event or an “indel” event). According to the embodiments described herein, this framework may be implemented by various computer-based methods for determining the relative effect (e.g., pathogenicity or functionality) of a genetic variant using a single metric (or score), and is also referred to as “Combined Annotation—Dependent Depletion”, or CADD. As referred to herein, the term “genetic variant” is any alternation or change to the nucleotide sequence of a gene, genome or any other DNA molecule derived from the genetic material of a human or other organism. Such alternations may include, but are not limited to, single-nucleotide polymorphisms (SNPs), (also referred to herein as a single nucleotide variant, or SNV), insertion or deletion events (or “indels”), and copy number variants. The alternation or change may have no effect, may alter the expression or function of a gene or its expression product, or may prevent the gene or its expression product from functioning properly. Effects caused by genetic variants may be neutral in effect, beneficial in effect, or pathogenic in effect. Genetic variants that are rare and/or abnormal among the population are also known as mutations. Many mutations cause pathogenic changes associated with human diseases (inheritable or non-inheritable), but mutations may also result in neutral or beneficial effects.
The basis of the CADD framework and methods described herein is to contrast a set of annotations for fixed or nearly fixed derived alleles in humans (i.e., observed human derived variants) with those of simulated variants. Deleterious variants—that is, variants that reduce organismal fitness—are depleted by natural selection in fixed but not simulated variation. The CADD framework therefore measures deleteriousness by way of assigning a calculated integrated deleteriousness score to a genetic variant or a set of genetic variants, as described in detail below. Deleteriousness is a property that strongly correlates with both molecular functionality and pathogenicity (Kimura 1983). Metrics of deleteriousness (e.g., an integrated deleteriousness score), in contrast to metrics limited to pathogenicity or molecular functionality, have many advantages for use in genomics field (e.g., clinicians, researchers, patients, etc.). Whereas metrics limited to pathogenicity or molecular functionality are limited in scope to a small set of genetically or experimentally well-characterized mutations and are subject to major ascertainment biases, deleteriousness can be measured systematically across a genome assembly (see Cooper et al. 2005; Siepel et al. 2005; Pollard et al. 2010 and description below). Further, selective constraint on genetic variants is related to the totality of their phenotype-relevant effects rather than to any individual molecular or phenotypic consequence. Measures of deleteriousness can therefore provide, in principle, a genome-wide, data-rich, functionally generic and organismally relevant estimate of variant effect that can be useful across datasets and annotation tools in the field.
In some embodiments, the methods for determining the relative effect (e.g., pathogenicity or functionality) of a genetic variant may include a step of applying a machine learning model to a dataset. As referred to herein, a “dataset” includes a set of one or more genetic variants and a set of one or more annotations, wherein each of the one or more genetic variants are associated with values or states of each of the one or more annotations. In some aspects, the dataset may be a training set (e.g., a set of observed variants, a set of simulated variants, or both) that, when applied to a machine learning model, trains the machine learning model. In other aspects, the dataset may be a test set (e.g., one or more variants derived from a genome, gene, or other DNA molecule) that may be used in applying a machine learning model. In certain embodiments, the dataset includes a set of one or more genetic variants organized in rows of a table and a set of one or more annotations organized in columns of the table. In other embodiments, the dataset includes a set includes a set of one or more genetic variants organized in columns of a table and a set of one or more annotations organized in rows of the table. In some aspects, said table provides an organizational structure, within which the one or more genetic variants are associated with values or states of each of the one or more annotations. Such associations may form the basis of an annotation matrix that may be used to apply the machine learning models described below in accordance with the embodiments described herein.
Models that are based on a form machine learning are established by constructing systems that can learn from data, rather than follow only explicitly programmed instructions. Several forms of machine learning that are based on learning algorithms are known in the art including, but not limited to, supervised learning, unsupervised learning, reinforcement learning, semi-supervised learning, transduction, learning to learn, and developmental learning. These forms of machine learning give rise to several approaches for generating a machine learning model. Approaches of machine learning that may be used to generate a model in accordance with the embodiments described herein include, but are not limited to, decision tree learning, association rule learning, artificial neural networks, inductive logic programming, support vector machines, clustering, Bayesian networks, reinforcement learning, representation learning, similarity and metric learning, and sparse dictionary learning.
In one aspect, the machine learning model used in the embodiments described herein is a support vector machine (SVM) model (also known as a “classifier”) (see Franc & Sonnenburg 2009, the subject matter of which is hereby incorporated by reference as if fully set forth herein). In machine learning, SVMs are supervised learning models having associated learning algorithms that analyze data and recognize patterns. SVMs are used, for example, for classification and regression analysis. When applied to a given a set of training examples (i.e., “training sets”), each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. A SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a line such that on each side, the gap between the line and the points on the side are maximized. In cases where a perfect separation of points from different categories by a line is not possible, the SVM seeks the best possible such line. New examples are then mapped into that same space and predicted to belong to a category based on which side of the line they fall on.
The SVM may be trained using any methods known in the art. In some methods, the SVM is trained to distinguish between a training set that includes a set of simulated variants and a set of observed variants.
In certain methods, the SVM may be trained using a linear (or non-linear) kernel function (k(x,y)). SVMs are extremely robust classifiers for binary classification problems when the points to be separated are linearly separable. Their utility is extended to nonlinearly separable data by using kernels that implicitly map data to a higher dimension where such data are more likely to be linearly separable. In spaces with more than two dimensions, the term hyperplane is applied, rather than a line, which is a generalization of the notion of a line (see e.g.,
The SVM model may be designed to construct a hyperplane or a set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. In some aspects, a hyperplane may be defined as the set of points whose dot product with a vector in that space is constant. In this case, the vectors defining the hyperplanes can be chosen to be linear combinations with parameters ai of images of feature vectors that occur in the training set or a test set (or database). In such case, the points x in the feature space that are mapped into the hyperplane are defined by the relation:
In some embodiments, an SVM model may be trained on features derived from an annotation matrix that includes one or more suitable annotations (e.g., X1, . . . , Xn shown in Function 1 below) used to classify a set of genetic variants of a dataset (e.g., a training set or a test set). Any number of annotations may be used to train the SVM model. The annotations may be derived from one or more annotation tools or pipelines such as AnnoVar, Ensembl Variant Effect Predictor (VEP), snpEffect, Panther, SeattleSeq, FamAnn, RefSeq, GATK VariantAnnotater, VAAST 2.0, Mutalyzer 2, VAT, or any other suitable annotation tool in the art.
In one embodiment, the set of annotations may include, but are not limited to, one or more of: Alt allele, bStatistic, cDNApos, CDSpos, Consequence, Dst2Splice, Dst2SplType, EncExp, EncH3K27Ac, EncH3K4Me1, EncH3K4Me3, EncNucleo, EncOCC, EncOCctcfSig, EncOCDNaseSig, EncOCFaireSig, EncOCmycSig, EncOCpolIISig, GerpN, GerpRS, GerpRSpval, GerpS, Grantham, Indel length, Local CpG density, Local GC density, Mammalian PhastCons, Mammalian PhyloP, minDistTSE, minDistTSS, motifDist, motifECount, motifEHIPos, motifEName, motifEScoreChng, Mutation type, nAA, oAA, PolyPhenCat, PolyPhenVal, Primate PhastCons, Primate PhyloP, protPos, Ref allele, relcDNApos, relCDSpos, relProtPos, Segway, SIFTcat, SIFTval, TFBS, TFBSPeaks, TFBSPeaksMax, tOverlapMotifs, Transversion?, Vertebrate PhastCons, and Vertebrate PhyloP.
In some aspects, the annotations may be part of or associated with an annotation category. Such categories include, but are not limited to, evolutionary constraint annotations (i.e., conservation metrics) (e.g., Primate PhastCons, Mammalian PhastCons, Vertebrate PhastCons, Primate PhyloP, Mammalian PhyloP, Vertebrate PhyloP, GerpN, GerpS, GerpRS, GerpRSpval, bStatistic); missense annotations (e.g., Grantham, PolyPhenCat, PolyPhenVal, SIFTcat, SIFTval, oAA, nAA); epigenetic measurement annotations (e.g., EncExp, EncH3K27Ac, EncH3K4Me1, EncH3K4Me3, EncNucleo, EncOCC, EncOCDNaseSig, EncOCFaireSig, EncOCpolIISig, EncOCctcfSig, EncOCmycSig); functional prediction annotations (e.g., tOverlapMotifs, motifDist, motifECount, motifEName, motifEHlPos, motifEScoreChng, TFBS, TFBSPeaks, TFBSPeaksMax, Segway); sequence context annotations (e.g., Ref allele, Alt allele, Mutation type, Transversion?, Indel length, Local GC density, Local CpG density); and gene model annotations (e.g., Consequence, minDistTSS, minDistTSE, cDNApos, relcDNApos, CDSpos, relCDSpos, protPos, relProtPos, Dst2Splice, Dst2SplType). The list of annotations and categories of annotations above is non-limiting, as the SVM model described herein may be updated and/or re-trained to include additional annotations including newly discovered alternative annotations. In one embodiment, a set of annotations which includes those described above are shown in
In one embodiment, an SVM model was trained with a linear kernel on features derived from a number (Xn) of annotations, supplemented by a limited number of interaction terms. In one embodiment, the number of annotations is 63 (see
According to certain embodiments, a set of genetic variants (e.g., those part of a training set or a test set) that may be used for generating annotations, training an SVM or other machine learning model, or applying the SVM or other machine learning model described above, may be derived from any suitable source, such as one or more public variant databases known in the art, or from one or more customized databases that include one or more variants of interest identified by a user (e.g., a researcher or clinician). In some aspects, a set of genetic variants may be derived from a variant database including, but not limited to, Exome Variant Server (EVS), dbSNP (NCBI), dbNSFP, 1000 Genomes (variants deposited in dbSNP), 1000 Genomes (provided through the European Bioinformatics Institute), ENCODE Project, UCSC Genome Browser, COSMIC (Catalogue of Somatic Mutations In Cancer) Project, gwasCatalog (GWAS), refGene, knownGene, ccdsGene, phastCons, cytoBand, keggPathway, or CancerGeneCensus.
In one embodiment, an annotation matrix may be generated using a set of genetic variants derived from the following sources: the Ensembl Variant Effect Predictor (McClaren et al. 2010) (VEP), data from the ENCODE Project (ENCODE Project Consortium et al. 2012) and information from UCSC Genome Browser tracks (Meyer et al. 2013 (
In some embodiments, the method the methods for determining the relative effect (e.g., pathogenicity or functionality) of a genetic variant may include a step of calculating and/or assigning an integrated deleteriousness score (also referred to herein as “CADD scores” or “C-Scores”) for each of one or more genetic variants of the dataset based on the machine learning model described above.
The integrated deleteriousness score may be a raw integrated deleteriousness score or a scaled integrated deleteriousness score. Integrated deleteriousness scores are useful in at least two distinct forms, namely “raw” and “scaled”. In the embodiments described above, “raw” integrated deleteriousness scores come straight from the SVM, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be “observed” (negative values) vs “simulated” (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or SVM model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or “not observed”) and therefore more likely to have deleterious effects.
Since the raw scores do have relative meaning, a specific group of variants may be identified, and the rank for each variant may be defined within that group. The ranked value may then be used as a “normalized” or “scaled” integrated deleteriousness score, which is an externally comparable unit of analysis. In the embodiments described in Example 1 below, all ˜8.6 billion SNVs of a GRCh37/hg19 reference genome were scored and then those values were “PHRED-scaled” by expressing them as rounded, order of magnitude values (with precision increasing for low ranks). For example, reference genome single nucleotide variants at the top 10% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, etc. The results of this transformation are the “scaled” CADD scores.
Raw and scaled integrated deleteriousness scores are useful in different contexts. In one embodiment, raw scores may be used for resolution of genetic variants. For example, raw scores offer superior resolution across the entire spectrum, and preserve relative differences between scores that may otherwise be rounded away in the scaled integrated deleteriousness scores. For example, using the reference genome described in Example 1 below, the bottom 90% (˜7.74 billion) of all GRCh37/hg19 reference SNVs (˜8.6 billion) are compressed into scaled CADD units of 0 to 10, while the next 9% (top 10% to top 1%, spanning ˜774 million SNVs) occupy CADD-10 to CADD-20, etc., with the scaled units only getting close to resolving individual SNVs from one another at the extreme top end. As a result, many variants that have substantive raw score differences between them are rounded to the same or very similar scaled value.
In certain embodiments, a scaled integrated deleteriousness score may be used as a frame of reference e.g., between different reference genes or genomes, different versions of the machine learning models, or different/separate analyses. Since there must always be a top-ranked variant, second-ranked variant, etc., scaled scores are easier to interpret at first glance and will be comparable across different CADD framework versions as, for example, the SVM is updated to include new annotations or use alternative model-building methods. A scaled score of 10, for example, refers to the top 10% of all reference genome SNVs, regardless of the details of the annotation set, model parameters, etc. Furthermore, with scaled values one can always infer, with just a simple glance, the probability of picking a variant(s) at that score or greater when selecting randomly from all possible reference SNVs.
According to one embodiment described in the Examples below, ten SVM models, independently trained on observed variants and different samples of simulated variants, were highly correlated (all pairwise Spearman rank correlations >0.99;
According to the embodiments described herein, the integrated deleteriousness score assigned to a genetic variant may be used to determine its relative effect or effects (e.g., relative pathogenicity or functionality) when compared to other integrated deleteriousness scores. For example, in some embodiments, the integrated deleteriousness score assigned to a genetic variant may be compared to a plurality of integrated deleteriousness scores that are assigned or calculated for a reference gene or genome. In some aspects, the integrated deleteriousness scores for the reference gene or genome are precomputed and are used to provide a reference scoring scheme, within which an integrated deleteriousness score assigned to a genetic variant of interest may fit or be compared. For example, the reference genome described in Example 1 below may include a precomputed set of raw and/or scaled reference universal deleterious scores that may serve as a backdrop reference with which to compare a raw and/or scaled universal deleterious score of a genetic variant of interest. In other embodiments, the integrated deleteriousness score assigned to a genetic variant may be compared to a plurality of integrated deleteriousness scores that are assigned or calculated for a plurality of genetic variants that are part of the same dataset, or part of a different dataset.dataset In one aspect, a genetic variant of interest that is part of a dataset that includes 100 genetic variants, the integrated deleteriousness score of the genetic variant of interest may be compared to the integrated deleteriousness score of the other 99 genetic variants that are part of the dataset. The use of an integrated deleteriousness score is further discussed below.
According to some embodiments, the methods described herein may be used in several applications as follows, depending on the appropriate choice of scores.
In certain embodiments, the methods described herein may be used to discover causal variants within an individual, or small groups, of exomes or genomes. Scaled CADD scores are most useful in this context, as one will generally only be interested or capable of reviewing a small set of the “most interesting” variants. In this setting, the distinction between a variant at the 25th percentile and 75th percentile is effectively irrelevant (scaled scores of ˜0 to 1), while the difference between a variant in the top 10% (scaled score of 10) vs 1% (scaled score of 20) may be quite meaningful. Further, the absolute frame of the reference is valuable here, allowing an analyst to quickly place a variant in context and facilitate easier translation of results across publications, studies, etc.
In certain embodiments, the methods described herein may be used for fine-mapping to discover causal variants within associated loci. As above, scaled scores are likely to be more useful here by allowing focus on a small set of manually reviewable best candidates and providing the absolute frame of the reference genome.
In certain embodiments, the methods described herein may be used to compare distributions of scores between groups of variants, e.g., cases vs controls. In this case, raw scores should be used, as they preserve distinctions that may be relevant across the entire scoring spectrum. Scaled scores may obscure systematic and potentially highly significant distinctions between two groups of variants (e.g., the first and third quartiles of all hg19 SNV scores). Further, since such analyses are generally conducted computationally and without manual intervention, the absolute frame of reference advantage to scaled scores is not as valuable in this context.
In the analyses presented in the Examples described herein, both types of scores (raw and scaled) were used. For many FIGS. and Tables (e.g.
The methods described herein may be used in the context of a computer system or as part of software or computer-executable instructions that are stored in a computer-readable storage medium.
In some embodiments, a system (e.g., a computer system) may be used to implement certain features of some of the embodiments of the invention. For example, in certain embodiments, a system (e.g., a computer system) for generating an integrated deleteriousness score is provided. The integrated deleteriousness score generated by the system may be used to determine the relative effect (e.g., the relative pathogenicity) of a genetic variant in accordance with the features of the embodiments described above.
In certain embodiments, the system may include one or more memory and/or storage devices. The memory and storage devices may be one or more computer-readable storage media that may store computer-executable instructions that implement at least portions of the various embodiments of the invention. In one embodiment, the system may include a computer-readable storage medium which stores computer-executable instructions that include, but are not limited to, one or both of the following: (i) instructions for applying a machine learning model to a dataset including one or more genetic variants, each of which is associated with values or states of each of a set of annotations; and (ii) instructions for calculating and/or assigning an integrated deleteriousness score to each of one or more genetic variants. Such instructions may be carried out in accordance with the methods described in the embodiments above.
In certain embodiments, the system may include a processor configured to perform one or more steps including, but not limited to, (i) receiving a dataset (e.g., a set of genetic variants and associated annotation data entered or uploaded by a user); and (ii) executing a set of computer-executable instructions stored in a computer-readable storage medium, such as that described above. The steps may be performed in accordance with the methods described in the embodiments above.
The computer system may be a server computer, a client computer, a personal computer (PC), a user device, a tablet PC, a laptop computer, a personal digital assistant (PDA), a cellular telephone, an iPhone, an iPad, a Blackberry, a processor, a telephone, a web appliance, a network router, switch or bridge, a console, a hand-held console, a (hand-held) gaming device, a music player, any portable, mobile, hand-held device, wearable device, or any machine capable of executing a set of instructions, sequential or otherwise, that specify actions to be taken by that machine.
The computing system may include one or more central processing units (“processors”), memory, input/output devices, e.g. keyboard and pointing devices, touch devices, display devices, storage devices, e.g. disk drives, and network adapters, e.g. network interfaces, that are connected to an interconnect.
According to some aspects, the interconnect is an abstraction that represents any one or more separate physical buses, point-to-point connections, or both, connected by appropriate bridges, adapters, or controllers. The interconnect, therefore, may include, for example a system bus, a peripheral component interconnect (PCI) bus or PCI-Express bus, a HyperTransport or industry standard architecture (ISA) bus, a small computer system interface (SCSI) bus, a universal serial bus (USB), IIC (12C) bus, or an Institute of Electrical and Electronics Engineers (IEEE) standard 1394 bus, also referred to as Firewire.
In addition, data structures and message structures may be stored or transmitted via a data transmission medium, e.g. a signal on a communications link. Various communications links may be used, e.g. the Internet, a local area network, a wide area network, or a point-to-point dial-up connection. Thus, computer readable media can include computer-readable storage media, e.g. non-transitory media, and computer-readable transmission media.
The instructions stored in memory can be implemented as software and/or firmware to program one or more processors to carry out the actions described above. In some embodiments of the invention, such software or firmware may be initially provided to the processing system by downloading it from a remote system through the computing system, e.g. via the network adapter.
The various embodiments of the invention introduced herein can be implemented by, for example, programmable circuitry, e.g. one or more microprocessors, programmed with software and/or firmware, entirely in special-purpose hardwired, i.e. non-programmable, circuitry, or in a combination of such forms. Special purpose hardwired circuitry may be in the form of, for example, one or more ASICs, PLDs, FPGAs, etc.
The CADD methods described herein provide a generic, expandable framework that may be used for integrating information contained in diverse annotations of genetic variation into a single score. It was demonstrated that in a variety of contexts this approach is better than other widely used annotations prioritizing functional and pathogenic variants (see Examples below). Further, beyond usefulness in any one setting, there are practical and conceptual advantages to CADD that provide significant value to genetic studies of human disease for at least the following reasons. First, the information content of many individual annotations is objectively merged into a single value, which is far preferable to ad hoc approaches for combining annotations and is likely to improve performance, consistent with the benefits seen for consensus methods in missense mutation—specific annotation (Gonzalez-Perez & Lopez-Bigas 2011). Second, the CADD framework can readily be updated to incorporate expansions to existing annotations and entirely new annotations. This ability to indefinitely and readily integrate new information is crucial in light of annotation tools and projects such as ENCODE, which are continuously and rapidly expanding available annotations (ENCODE Project Consortium et al. 2012). Third, the CADD framework combines the generality of conservation-based metrics with the specificity of subset-relevant functional metrics (for example, PolyPhen), exploiting the advantages of both approaches while attenuating their respective disadvantages.
The one-stop nature of CADD confers practical and conceptual value to future sequencing studies. It will minimize the scope and diversity of annotations that have to be generated, tracked and evaluated by a laboratory or project and will reduce the need for ad hoc combinations of filters, scores and parameters as is now routinely carried out. For example, a standard approach in exome studies is to merge missense (with or without an annotation of damaging or a given level of conservation), nonsense and splice-disrupting variants into a single, internally unranked list of protein-altering variants before genetic analysis (Ng et al. 2009). With CADD, one might avoid arbitrary filters or thresholds altogether, including both coding and noncoding variants on a single, meaningfully ranked list. For example, a recent study of recessive, non-syndromic pancreatic agenesis identified five causal noncoding variants that disrupted the function of a distal enhancer of PTF1A (Weedon et al. 2014). C-scores for these noncoding, disease-causal variants (scaled scores between 23.2 and 24.5) rank them higher than 99.5% of all possible human SNVs, higher than 97% of missense SNVs in a typical exome and higher than 56% of pathogenic SNVs in ClinVar (Baker 2012).
In research and in the clinic, the capacity to define catalogs of genetic variants currently exceeds the ability to systematically evaluate their potential effects. This disparity will deepen as sequencing accelerates, genomes displace exomes and the array of functional categories and annotations expands. As a solution, a unified, quantitative and scalable framework (such as the CADD framework described herein) that is capable of exploiting many genomic annotations may be used to meet the challenge posed. Thus, the model, methods, and systems described herein are broadly useful, and will improve over time with updates, allowing for better interpretation of variants of uncertain significance in a clinical setting and improving discovery power for genetic studies of both Mendelian and complex diseases.
The following examples are intended to illustrate various embodiments of the invention. As such, the specific embodiments discussed are not to be construed as limitations on the scope of the invention. It will be apparent to one skilled in the art that various equivalents, changes, and modifications may be made without departing from the scope of invention, and it is understood that such equivalent embodiments are to be included herein. Further, all references cited in the disclosure are hereby incorporated by reference in their entirety, as if fully set forth herein.
A. Simulated and Observed Variants
The basis of the CADD framework is to capture correlates of selective constraint as manifested in differences between two datasets: (1) simulated events generated using parameters estimated from whole genome species alignments, which contain some proportion of deleterious alleles, and (2) species differences that underwent many generations of mostly purifying/negative selection and are depleted for deleterious alleles.
Simulated variants. A genome-wide simulator of de novo germline variation was developed in accordance with the methods and embodiments described herein. Differences between human genomes and the inferred human-chimpanzee ancestral genome (Paten et al. 2008a) where humans carry a derived allele with a frequency of at least 95% (14.9 million SNVs and 1.7 million indels) were identified. Nearly all of these events are fully fixed in the human lineage, with fewer than 5% appearing as nearly fixed polymorphisms in the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2012) variant catalog (derived allele frequency (DAF) ≧95%). The simulator is partially based on the parameters of the General Time Reversible (GTR) model (Tavaré 1986), but because the standard GTR does not naturally accommodate asymmetric CpG-specific mutation rates, a fully empirical model of sequence evolution with a separate rate for CpG dinucleotides and local adjustment of mutation rates (on a 1-Mb scale) was used to simulate de novo mutations. Simulation parameters were obtained from Ensembl Enredo-Pecan-Ortheus (EPO) (Paten et al. 2008b; Paten et al. 2008a) whole genome alignments of six primate species (Ensembl Compara release 66). Using a custom script, an inferred human-chimpanzee ancestor was compared with its aligned human reference sequence (GRch37) to obtain a genome-wide substitution rate matrix, local mutation rate estimates in blocks of 100 kb, and frequency and length distribution of insertion and deletion events.
These parameters were applied to simulate single nucleotide variants (SNV, also referred to as single nucleotide polymorphisms, or SNPs) and insertion/deletion (also referred to herein as “indel”) variants based on the human reference sequence (GRCh37). Variants were simulated by iterating through all bases of the human reference autosomes and the X chromosome and picking sites for mutation with probabilities corresponding to the genome-wide substitution rate matrix. The Y chromosome and additional contigs were not included in this embodiment to exclude effects due to variation in sequence quality. The implementation of the simulator uses a predefined approximate number of mutations, including the relative rates of substitutions and indels based on the EPO alignments. Further, it locally adjusts the overall mutation rate based on the local mutation rate estimated by averaging over the five 100 kb blocks up- and downstream of the site as well as the block of the actual site (i.e. a 1.1 Mb sliding window). Using an approximate number of 40 million autosomal and 2 million X-chromosomal mutations, a total of 46,735,302 SNVs, 2,227,688 insertions (1 to 50 bp) and 3,291,250 deletions (1 to 50 bp) were simulated. The simulated variants were limited to genomic regions for which an inferred human-chimpanzee ancestor sequence is available from the EPO alignments in this embodiment; this reduced the final numbers to 44,182,238 SNVs, 2,108,268 insertions and 3,116,551 deletions. These are referred to as “simulated variants”.
Observed variants. For observed human-derived changes, sites at which the human reference genome differs from the inferred human-chimp ancestral genome were extracted from the Ensembl EPO. This resulted in six primate alignments defined above, excluding variants in the most recent 1000 Genomes Project data (1000 Genomes Project Consortium et al. 2012, variant release 3) with a frequency of greater than 5%, and including variants where the human reference carries an ancestral allele (i.e. matching the inferred human-chimp ancestor sequence) but where the derived allele is observed with frequency above 95% in the 1000 Genomes project data. Low frequency derived variants (average derived allele frequency (DAF) less than 95%) were excluded in order to guarantee that alleles were exposed to many generations of natural selection. A total of 14,893,290 SNVs, and 627,071 insertions and 1,107,414 deletions (less than 50 bp in length) were identified. This set of variants is referred to herein as “HCdiff variants” or “observed variants”. It is noted that even though high frequency derived alleles that are not fully fixed were included, they constitute a small proportion of the observed variants; 99.37% of indels and 95.41% of SNVs in the set of observed variants are invariant in 1000G data.
B. Variant Annotation Matrix
The Ensembl Variant Effect Predictor (VEP, Ensembl Gene annotation v68) (McLaren et al. 2010) was used to obtain gene model annotation for single nucleotide and indel variants. For single nucleotide variants (SNVs) within coding sequence, SIFT (Ng & Henikoff 2003) and PolyPhen-2 (Adzhubei et al. 2010) scores from VEP were also obtained. Output lines describing MotifFeatures were combined with the other annotation lines, reformatted it to a pure tabular format and reduced the different Consequence output values to the following 17 levels: 3PRIME_UTR, 5PRIME_UTR, DOWNSTREAM, UPSTREAM, INTERGENIC, INTRONIC, NONCODING_CHANGE, SYNONYMOUS, NON_SYNONYMOUS, REGULATORY, CANONICAL_SPLICE, SPLICE_SITE, STOP_GAINED, STOP_LOST, INFRAME, FRAME_SHIFT, and UNKNOWN. For training, a 4-level hierarchy was implemented in the case of overlapping annotations. This hierarchy is as follows: if multiple VEP annotation lines were reported for the same variant (due to overlapping annotations), the most deleterious were selected based on the following ranking scheme: (1) VEP effect Sequence Ontology annotation containing substrings “coding”, “missense”, “synonymous”, “stop”, “mature”, “splice”, “initiator_codon”, “frame”, or “terminal_codon”, (2) Sequence Ontology annotations containing “utr” or “regulatory”, (3) Sequence Ontology annotations containing “intronic”, “upstream”, or “downstream”, (4) everything else. A random annotation was selected if multiple lines with the same priority were observed.
To the 6 VEP input derived columns (chromosome, start, reference allele, alternative allele, variant type: SNV/INS/DEL, length) and 26 actual VEP output derived columns, 56 columns were added that contain the following annotations: the ancestral primate allele as obtained from the EPO six primate alignments; a Boolean column indicating whether the ancestral allele is different from the alternative allele; a Boolean column indicating whether the base substitution is a transition or transversion; the Duke University mapability score of 20 bp and 35 bp sequences as distributed by UCSC (Rosenbloom et al. 2012); segmental duplication annotation as provided by UCSC (Fujita et al. 2011); PhastCons and phyloP conservation scores (Hubisz et al. 2011) for primate, mammalian and vertebrate multi-species alignments—all determined starting from UCSC whole genome alignments (Siepel et al. 2005) but excluding the human reference sequence in score calculation; GERP++ (Davydov et al. 2010) N/S and region scores/p-values; the background selection score (original coordinates transferred from NCBI36 to GRCh37) (Meyer et al. 2012; McVicker et al. 2009); the maximum expression value, maximum H3K27 acetylation peak, maximum H3K4 methylation peak, maximum H3K4 trimethylation peak and maximum value in the nucleosome occupancy tracks provided for ENCODE cell lines in the UCSC super tracks (Rosenbloom et al. 2012); maximum peaks and p-values from the Encode open chromatin UCSC track (includes Faire, Dnase, Poll!, CTCF, Myc values as well as two summary scores) (Rosenbloom et al. 2012); the genomic segment type assignment obtained from clustering of ENCODE features (Segway (Hoffman et al. 2012)); the total number of predicted transcription factor (TF) binding sites and the difference in base compositionb from the reference allele to the alternative allele for TF binding motifs (ENCODE Project Consortium et al. 2012); the number of different overlapping ENCODE transcription factors; the number and maximum peak of all overlapping ENCODE ChIP-seq transcription factor binding sites in different cell/tissue types (ENCODE Project Consortium et al. 2012) (UCSC EncodeAwgTfbsUniform tracks excluding transcriptions factors already used in open chromation track); a Boolean column indicating whether this site is observed in the above described 1000 Genome variants or the Exome Sequencing Project (ESP) variants (Tennessen et al. 2012); the average allele frequency in 1000 Genomes and the average allele frequency in 1000 Genomes limited to Asian populations, limited to South American population, limited to African populations, and limited to European populations; the average alternative allele frequency in ESP and the average alternative allele frequency in ESP for individuals of African ancestry and individuals of European ancestry; the distance to the closest transcribed sequence start (TSS) and transcribed sequence end (TSE) position in the Ensembl v68 transcript annotation; the distance to the next splice site if 20 bp upstream or downstream, in which case it is also indicated whether this site is approached from within an exon or intron and whether it is a splice acceptor or donor site; and finally the Grantham score (Grantham 1974) associated with a reported amino acid substitution.
If position values (cDNApos, CDSpos, protPos) for indels were provided as value ranges by VEP, the first value reported for the interval was selected. For the additional annotations, the most extreme value was extracted across the positions impacted by an indel event (i.e. all deleted bases for deletions and the base before and after the event for insertions).
C. Imputation and Final Training Data Set
From the annotation described above, some columns are not useful for model training (Chrom, Pos, AnnoType, ConsScore, ConsDetail, motifEName, GeneID, FeatureID, CCDS, GeneName, Exon, Intron) or need to be excluded from training as they will differ between the simulated variants and the human-chimpanzee ancestor differences for technical reasons (Anc, isDerived, mapAbility20 bp, mapAbility35 bp, scoreSegDup, known variation status and ESP/1000G frequency information). Importantly, no allele frequency information was used in model training. In order to fit models, missing values in the remaining annotations must be imputed. Missing values in genome-wide measures were imputed by the genome average obtained from the simulated data, or set missing values to 0 where appropriate (
When extracting differences between the human-chimp ancestor and present-day human alleles, a deletion from the ancestor is described as an insertion into the human reference; the same applies when describing mutations from the ancestor; they are thus oriented back in time. In contrast, the simulation contains forward mutations of the human reference. To correct this effect, Ref/Alt and nAA/oAA columns were interchanged for HCdiff. For the same reasons, INS/DEL levels in the Type column and STOP_GAINED/STOP_LOST levels in the Consequence column were interchanged for the HCdiff variants before training.
Sites from the simulation were labeled +1 and human derived variants (i.e., sites identified from HCdiff) −1. Only insertions and deletions shorter than 50 bp were considered for model training and the Length column was capped at 49 for the prediction of longer events. The ratio of indel events to SNV events observed for the simulation (1:8.46) was also set for HCdiff by sampling an equal number of variants for both data sets: 13,141,299 SNVs, 627,071 insertions and 926,968 deletions each.
D. Exploratory Analysis of Annotations
Univariate analyses of SNVs, insertions, and deletions. The following analyses were performed separately on the SNVs, insertions, and deletions. The variants were split into equally-sized training and test sets. For each feature, a univariate logistic regression model was fit on the training set in order to predict whether a site is observed or simulated using just that feature. Test set performance was evaluated using (1) area under the curve (AUC), which is equivalent to a Mann-Whitney U-statistic, and which quantifies the extent to which simulated sites are given higher predictions of deleteriousness than observed sites; and (2) depletion of observed sites among the 0.1%, 1%, and 10% of sites predicted to be most deleterious. An AUC of 0.5 is expected by chance, and an AUC near 1 indicates a model that successfully assigns higher predictions of deleteriousness to simulated sites than to observed sites. Depletion is defined as (fraction of observed sites among the x % predicted to be most deleterious)/(fraction of observed sites in the full data set); a value of 1 is expected by chance, and a small value indicates that the sites predicted to be most deleterious are predominantly simulated. Results are given in
Correlations among quantitative features.
Interactions among features. To determine if predictions of whether a SNV is observed or simulated could be improved by including interactions in the model. For each pair of features x1 and x2, linear regression was used to fit a main effects model of the form y˜x1+x2, as well as an interaction model of the form y˜x1+x2+x1x2. Here y is a vector that encodes whether each variant is observed or simulated.
Distance to splice sites. Logistic regression models were fit to predict whether a SNV is observed or simulated, using its distance from splice site (treated as a categorical variable) for sites in the exon donor, intron donor, intron acceptor, and exon acceptor regions. Variants within 20 bp of a splice site that were neither non-synonymous, stop-gain, nor stop-loss events were included.
Depletion of observed sites by Consequence. For each consequence bin, the depletion of observed SNVs in that bin was computed: namely, (fraction of sites in that bin that are observed)/(fraction of all sites that are observed). Results are shown in
Interaction between Consequence and position of mutation in cDNA. In order to determine whether the deleteriousness of a given Consequence is associated with the SNV's position within the cDNA, a logistic regression model was fit to predict whether a SNV is observed or simulated on the basis of Consequence, relcDNApos, and an interaction between the two. Results are shown in
Summary. The validity of the general approach was first assessed by constructing a series of univariate models described above that contrast observed and simulated variants using each of the 63 annotations as individual predictors. Nearly all models were highly predictive for distinguishing observed and simulated variants (
In addition, correlations between annotations (
Overall, the analyses described above demonstrate that substantial biological differences are present between the observed and simulated variants with respect to the 63 annotations and that linear models capture much of this information.
E. Model Training
Ten training data sets were generated by sampling an equal number of 13,141,299 SNVs, 627,071 insertions and 926,968 deletions from both the simulated variant and observed variant datasets. To train each support vector machine (SVM) model, the processed data was converted to a sparse matrix representation after converting all n-level categorical values to n individual Boolean flags. 1% of sites (˜132,000 SNVs, 6,000 insertions and 9,000 deletions each) were randomly selected and used as a test data set. All other sites were used to train linear SVMs using the LIBOCAS v0.96 library (Franc & Sonnenburg 2009).
The SVM model fits a hyperplane as defined below (Function 1). X1, . . . , Xn represent the 63 annotations described above (which are expanded from 63 to 166 features due to the treatment of categorical annotations), W1, . . . , W11 represent the Boolean features that indicate whether a given feature (out of cDNApos, relcDNApos, CDSpos, relCDSpos, protPos, relProtPos, Grantham, PolyPhenVal, SIFTval, as well as Dst2Splice ACCEPTOR and DONOR) is undefined, 1{A} is an indicator variable for whether the event A holds, and D is the set of bStatistic, cDNApos, CDSpos, Dst2Splice, GerpN, GerpS, mamPhCons, mamPhyloP, minDistTSE, minDistTSS, priPhCons, priPhyloP, protPos, relcDNApos, relCDSpos, relProtPos, verPhCons, and verPhyloP. Due to the coding of categorical values using Boolean variables, the total number of features for this model is 949. The hyperplane is defined by Function 1 as follows:
SVM models were trained, using various values for the generalization parameter (C), which assigns the cost of misclassifications.
F. Model Testing and Validation
All 8.6 billion possible substitutions in the human reference genome (GRCh37) were annotated, and the model was applied to score all possible substitutions. When scoring sites with multiple VEP annotation lines, all possible annotations were scored first and then the one with the highest deleteriousness after applying the four hierarchy levels was reported. As the scale of the model-based combined scores (“C-scores”) resulting from the SVM model is effectively arbitrary, the C-scores were mapped to a phred-like scale (“scaled C-scores”) ranging from 1 to 99 based on their rank relative to all possible substitutions in the human reference genome, i.e. −10log10(rank/total number of substitutions). For example, the 1% (10−2) of all possible substitutions with the lowest scores—that is, least likely to be observed human alleles under the model—were assigned values of 20 or greater (“C20”). Several datasets extracted from the literature and public databases were used to look at the performance of the model scores.
Functional consequences for possible substitutions to reference genome. In accordance with the model described above, the proportion of all possible substitutions with a given scaled C-score having specific functional consequences was calculated (
Genetic diversity for all possible substitutions to reference genome. Scaled C-scores were compared with levels of genetic diversity, finding that C-scores were negatively correlated with the average derived allele frequencies (DAFs) of variants listed by the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2012) or the Exome Sequencing Project (Fu et al. 2013) (ESP) (
The CADD framework described above may be used for prioritizing functional and disease-relevant variation. This use is evidenced in accordance with the five distinct contexts as described below. For these contexts, several data sets extracted from the literature and public databases and were used to examine the performance of model scores.
A. C-Scores in Specific Gene Classes.
To determine the relationship of C-scores within specific gene classes, the following groups of genes were obtained for analysis: (i) genes with at least 5 disease mutations (“DM”; missense, non-sense and indels) from HGMD (Stenson et al. 2009); (ii) 120 human non-immune essential genes (with associated diseases) described by Liao et al. (“essential”) (Liao et al. 2008); (iii) GWAS genes as available from the reported genes column of http://www.genome.gov/Pages/About/OD/OPG/GWAS%20Catalog/GWASCatalog112608.xls; (iv) loss-of-function (LoF) genes from MacArthur et al. (MacArthur et al. 2012; supplementary material 1, filtered column==0, at least 2 observations); and (v) olfactory genes by matching “olfactory receptor” in the description field of the Ensembl 68 gene build.
All obtained gene IDs were matched to Ensembl 68 protein-coding gene identifiers and following hierarchy for genes observed in multiple categories were applied: essential, disease, GWAS, olfactory, LoF. In addition, 500 random non-overlapping protein-coding genes were selected for the “other” category.
B. MLL2 Variants.
A total of 210 mutations in KMT2D (MLL2) associated with Kabuki syndrome were downloaded (see mutations in Makrythanasis et al. 2013, which is hereby incorporated by reference as is fully set forth herein). From the variants in
The Kabuki syndrome-associated KMT2D (MLL2) variants are 46% frameshift indels, 37% nonsense, 16% missense, 1% inframe indels and <1% splice site events, while the ESP-based MLL2 variants are 40% missense, 31% synonymous, 21% intronic, 3% splice site events, 2% inframe indels and 6% other. Further, C-scores for the KMT2D (MLL2) variants enabled the discrimination of a diverse set of disease-associated alleles (Makrythanasis et al. 2013) from rare, likely benign variants listed in ESP (Fu et al. 2013) (Wilcoxon rank-sum test P=9.9×10−94; n=210 disease associated/679 likely benign). Other metrics were markedly inferior in terms of accuracy or comprehensiveness (
C. HBB Variants.
A total of 119 SNVs, 30 insertions and 63 deletions (all required to be at most 50nt) within or near HBB that give rise to β-thalassemia from HbVar (Giardine et al. 2007) were downloaded. Disease categories were used as defined by HbVar, except that all types that are not “beta0” or “beta+” were pooled into one category, “other”. Results for this analysis are presented in
C-scores of disease-associated alleles (Giardine et al. 2007)—a set of indels (n=93) and SNVs (n=119) with regulatory/upstream (n=54), splicing (n=37), missense (n=22), nonsense (n=18) and other effects—were significantly and more strongly correlated with 3 levels of phenotypic severity than other measures (Kruskal-Wallis rank-sum test P=2.4×10−7; n=48 mild/65 intermediate/99 severe;
D. ClinVar.
The ClinVar (Baker 2012) data set (release date Jun. 16, 2012, ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_00-latestvcf.gz) was obtained from the American National Center for Biotechnology Information (NCBI). Variants that were marked “pathogenic” or “non-pathogenic (benign)” were extracted. However, it was noticed that the benign variation had a very different composition in terms of the Consequence annotation compared to the pathogenic variation. Due to the restriction of the most predictive publically available scores (i.e. PolyPhen, SIFT) to non-synonymous changes, those scores were underrepresented in the benign set. Therefore a set of apparently benign (≧5% allele frequency) variants were selected from ESP that were matched to the pathogenic ClinVar sites in terms of their Consequence annotations. In addition, a data set was generated where ESP and ClinVar frequencies were matched to three decimal precisions of the alternative allele frequency. Further, due to the overlap of ClinVar and ESP variants with the PolyPhen training data set, a separate classifier was trained without the PolyPhen features and the performance on the subset of ClinVar and ESP variants not used for PolyPhen training was also checked. To compare the performance of CADD with other publically available missense annotations not used in model training, scores from dbNSFP 2.0 (Liu et al. 2011) were downloaded. Finally, ClinVar pathogenic variants were analyzed in the context of the eleven men data (see Example 3 below). Results for these analyses are presented in
The ClinVar pathogenic variants used here are 76% missense, 18% nonsense, 3% splice site events, 1% frameshift indels and 2% other (and ESP benign variants were always matched to the same distribution of categorical consequences). It is noted that there was substantial overlap between ClinVar and the training data underlying PolyPhen. When the corresponding sites were excluded from the test data set or when PolyPhen was excluded as a training feature from CADD, C-scores continued to outperform all or nearly all missense-only metrics and conservation measures (
E. ALDOB, ECR11 Enhancers and HBB Promoter.
The expression fold change was obtained for each base substitution in ALDOB and ECR11 enhancers from a data set which contains a total of 777 variants for ALDOB and 1860 variants for ECR11 (see Patwardhan et al. 2012. Supplementary data, http://www.nature.com/nbt/journal/v30/n3/extref/nbt.2136-S2.zip, the subject matter of which is incorporated by reference as if fully set forth herein). Further, HBB promoter data was obtained from a data set which contains a total of 210 variants associated with an expression fold change (see Patwardhan et al. 2009, Supplementary data, http://www.nature.com/nbt/journal/v27/n12/extref/nbt.1589-S2.zip, the subject matter of which is incorporated by reference as if fully set forth herein). Results for this analysis are presented in
The two enhancers (Patwardhan et al. 2012) and one promoter (Patwardhan et al. 2009) were examined, in which saturation mutagenesis was previously performed. C-scores were significantly correlated with experimentally measured fold change in absolute expression from individual variants and were overall more significantly correlated than measures of sequence conservation (Spearman rank correlation of combined data=0.31, P=1.9×10−65, n=2,847;
F. IARC p53 Variants.
A list of 23,788 single nucleotide somatic cancer mutations in p53 were obtained, which were reported to the International Agency for Research on Cancer (IARC, http://p53.iarc.fr/TP53Somatic Mutations.aspx). These mutations correspond to 2,068 distinct variants. The number of times that each variant was reported was recorded. The Spearman rank correlation between the number of observations per variant and the C-score is 0.38, p=5.8×10−73 (n=2,068). Accordingly, the C-scores strongly correlated with the number of observed somatic cancer mutations in TP53 (encoding p53) reported to the International Agency for Research on Cancer (IARC)
G. Summary.
Collectively, these analyses demonstrate that CADD is quantitatively predictive of deleteriousness, pathogenicity and molecular functionality, both protein altering and regulatory, in a variety of experimental and disease contexts. In each of these contexts, the predictive usefulness of CADD was much better than measures of sequence conservation, the only comprehensive type of variant score, and also tended to be better, in most cases substantially so, than function-specific metrics when restricted to the appropriate variant subsets.
The CADD framework described above is also useful in evaluating candidate variation within exome or genome-wide studies, as evidenced by the following studies.
A. Autism and Intellectual Disability Variants.
All high confidence de novo mutations were combined from five family based autism exome sequencing studies (O'Roak et al. 2011; O'Roak et al. 2012; Sanders et al. 2012; Neale et al. 2012; lossifov et al. 2012). Calls from individuals studied in multiple studies were merged for a total of 948 ASD probands and 590 unaffected siblings. These included all validated variants from studies (O'Roak et al. 2011; O'Roak et al. 2012; Sanders et al. 2012; Neale et al. 2012; lossifov et al. 2012) as well as all sites from lossifov et al. (lossifov et al. 2012) that passed SNVFilter and IndelFilter. Only coding and canonical splice-site positions were considered. Coding sites were defined by RefSeq, CCDS, and UCSC genes. All indel start positions (noting the position prior to the change) and alleles were reformatted to match the current VCF convention. In the case of complex mutation events (i.e. multiple base changes in close proximity) where multiple nucleotide changes were predicted to alter the protein, the complex mutation was provided to VEP, treating it like an indel for the scoring process. For sites with a non-synonymous and synonymous change, only the non-synonymous was considered. For the special case of two missense events reported in 12624.p1 (separated by 1.9kb, involving a possible processed pseudogene), only the 5′-most variant (12:58129165) was considered.
Further, the coding variants were obtained as described above for two family-based intellectual disability (ID) studies (Rauch et al. 2012; de Ligt et al. 2012). These calls came from 151 ID and 20 unrelated control families.
The de novo exome variants (SNVs and indels) identified in children with autism spectrum disorders (ASD) and intellectual disability (see above) were analyzed along with unaffected siblings or controls, considering 88 nonsense, 1,015 missense, 359 synonymous, 32 canonical splice-site and 150 other variants, including indels. This correlates to 61%/63% missense variants, 6%14% nonsense variants, 4%/2% splice site events, 20%/25% synonymous variants, and 10%/6% other variants in probands and controls for ASD and intellectual disability, respectively.
Results of this analysis are shown in
B. Eleven Men.
GATK VCF variant call files were obtained for all autosomes and the X chromosome from shotgun sequencing of eleven men originating from diverse human populations (Meyer et al. 2012, see http://cdna.eva.mpg.de/denisova/VCF/human/). Variants were filtered as previously described (Meyer et al. 2012) using the annotation available in the obtained VCF files. The following variant positions were removed: (1) positions with extremely high or low coverage (upper and lower 2.5% of the coverage distribution for each sample), (2) positions surrounding insertions/deletions (±5 by of an insertion/deletion), (3) positions identified as prone to systematic error in Illumina sequencing, (4) positions marked by soft masking in the human reference sequence, (5) positions with a 20-mer mapability score <1, (6) positions with genotype quality (GQ) <40, as well as (7) positions with a non-empty GATK flag field. Results of this analysis are shown in
It is well established that annotations such as PolyPhen and conservation scores are valuable in the sequencing-based identification of disease-causal genes by virtue of their ability to highly rank pathogenic variants (Cooper et al. 2010; Cooper & Shendure 2011a; Ng et al. 2010). Therefore, the distribution of C-scores for variants in the genomes of 11 individuals representing diverse populations (Rohland & Reich 2012) were examined, finding that CADD highly ranked known disease-causal (ClinVar pathogenic) variants within the complete spectrum of variation in personal genomes (
C. Increased C-Scores of GWAS Lead SNPs.
The National Human Genome Research Institute (NHGRI) genome-wide association study (GWAS) catalog was downloaded (http://www.genome.gov/gwastudies/), which included 9,977 distinct SNP-trait associations spanning 7,531 unique SNPs in the 1000 Genomes release (v3 20101123); these variants are hereafter referred to as “lead SNPs”. The Genome Variation Server (GVS, http://gvs.gs.washington.edu/GVS137/) was used to find all SNPs within 100 kb of a lead SNP that have a pairwise correlation of R2>=0.8 within Utah residents with ancestry from northern and western Europe (CEU); it is noted that not all GWAS studies in the catalog were conducted within populations of European ancestry, but CEU was chosen as the single most broadly applicable population. This resulted in an additional 56,538 unique SNPs, hereafter referred to as “tag SNPs”. Lead and tag SNPs are referred to as “trait-associated” or simply “associated”.
“Control” SNP sets were also developed, and were selected to match trait-associated SNPs for a variety of features that may bias SNPs found by GWAS in the absence of any causal effects. Specifically, for each trait-associated SNP the closest SNP that has the same reference and alternate alleles, has a 1000 Genomes average alternate allele frequency within 5%, and has a similar SNP array presence profile was chosen. For the last criterion, rather than attempt to match for all possible SNP array designs, a match for presence/absence on four widely used genotyping arrays that were directly used in many GWASs and indirectly capture many of the general biases that affect SNP array design was chosen: Affymetrix 5.0, Affymetrix 6.0, Illumina HumanHap 550, and Illumina 1M. Each control SNP was required to match the exact same combination of array presence/absence; for example, if an associated SNP is present on both Affymetrix arrays but neither Illumina array, then its matched control must also be present on both Affymetrix arrays and neither Illumina array.
No control SNPs were selected more than 500 kbp away from an associated SNP, and each control SNP was assigned to one and only one associated SNP. In total, 5,498 lead SNPs (˜73%) and 46,195 tag SNPs (˜82%) were matched. The median distance between associated SNPs and their matched controls is 32,601 bp, while the median alternate allele frequency difference between associated SNPs and matched controls is 2%.
C-score distributions were subsequently compared between the associated and control SNPs defined above. Details of all statistical tests, including SNP set descriptions, counts, and p-values, are supplied in
It was found that lead GWAS SNPs had significantly higher C-scores than control SNPs (one-sided Wilcoxon rank-sum test P=1.3×10−12, n=5,498 GWAS/5,498 control); nearby SNPs in linkage disequilibrium with lead SNPs (tag SNPs) scored lower on average than lead SNPs but also had significantly higher scores than their matched controls (P=5.1×10−107). Lead SNP C-scores are significantly higher than lead-matched controls (p=1.27×10−12); the difference is less pronounced, in terms of absolute score difference, for tag SNPs but also highly significant (p=5.11×10−107). The drop in scores from lead (mean scaled C-score of 4.46) to tag SNPs (mean of 3.89) would be expected if causal variants are more highly enriched among lead SNPs relative to tag SNPs, as is likely given that lead SNPs for any given locus are selected on the basis of showing the strongest association signal within that locus (www.genome.gov/gwastudies). However, the differences between leads and tags also correlates with differences in allele frequencies, as tag SNPs tend to have higher alternate allele frequencies (median of 35%) than lead SNPs (median of 32%) and would thus tend to exhibit lower CADD scores as a result (
Differences in C-score remained significant after controlling for properties such as gene-body effect, gene expression level, conservation and regulatory element overlap; each of these properties was significantly different (all P<0.01) for associated and control SNPs, but none could fully explain discrepancies in C-score (
Further, CADD scores for SNPs identified by GWAS of complex traits were analyzed, contrasting them with scores for nearby control SNPs matched for allele frequency and genotyping array availability (
Very similar results are observed when comparing C-scores to the p-values of the individual associations (
The above results were generated using only those associated SNPs for which a match could be obtained; associated SNPs for which no match could be identified (i.e., no SNP meeting all the matching criteria within 500 kbp) have similar C-scores as associated SNPs with a match and also correlate with sample size and association significance (not shown). Their inclusion therefore tends to result in similarly highly significant test results as those presented here.
In addition, it was found that neither physical distance nor allele frequency discrepancies can explain the effects that were observed. For example, CADD scores are significantly higher for lead SNPs that are <10 kb from their matched control, for those that have a similar (+/−1%) 1000 Genomes alternate allele frequency as their matched control, and also for lead SNPs that meet both criteria (
Moreover, the role of individual annotation contributions to the C-score differences between associated and control SNPs was examined. In particular, the contributions of missense variation, distance to transcriptional start site (TSS), gene body overlap/consequence, and sequence conservation were evaluated. Such annotations may have intrinsic biases with respect to GWAS signals but are also likely to correlate with variant functionality/causality. Indeed, these features are among the most widely used criteria to evaluate candidate variants in disease studies and among the largest individual contributors to C-scores (
It was found that each annotation explains part of the C-score differences, but none are sufficient to fully explain them, even when conservatively controlled for alone or in combination. For example (all relevant p-values and other information can be found in
All the same trends and significant differences described here for lead and lead-matched control SNPs (i.e., controlling for conservation, gene consequence, missense, or distance to TSS) also hold for tag and tag-matched SNPs, with smaller absolute differences in C-scores that are more highly significant owing to the substantially increased SNP counts (not shown).
Thus, while clear single-annotation contributors to the GWAS SNP C-score increase can be found, no individual annotation differences are as statistically strong as that seen for C-scores and none can fully explain the observations described above. These observations suggest that CADD is able to effectively exploit multiple information sources and prioritize causal variants across a diverse range of functional and evolutionary categories.
The references, patents and published patent applications listed below, and all references cited in the specification above are hereby incorporated by reference in their entirety, as if fully set forth herein.
This application claims priority to U.S. Provisional Application No. 61/880,286, filed Sep. 20, 2013, the subject matter of which is hereby incorporated by reference as if fully set forth herein.
This invention was made with government support under Grant No. U54HG006493, awarded by the National Human Genome Research Institute and Grant Nos. DP1HG007811 and DP50D009145, awarded by the National Institutes of Health. The government has certain rights in the invention
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/056701 | 9/20/2014 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61880286 | Sep 2013 | US |