Understanding the impact of genotypic (e.g., sequence) variants within functional elements in the genome—such as protein coding genes, non-coding genes, and regulatory elements—is critical to a diverse array of life sciences applications. Today, nearly half of all disease-associated genes harbor a higher number of uncharacterized variants in the general population than variants of known clinical significance. This poses significant challenges for both diagnostic and screening tests evaluating genetic and genomic sequences (Landrum et al. 2015; Lek et al. 2016). A high number of novel variants of unknown clinical significance is a feature of nearly all genes (e.g., for both germline and somatic variants in the population) and affects even the most frequently tested genes. For example, tests that evaluate gene-panels for cancer predisposing mutations report finding as many as 95 uncharacterized variants per known disease-causing variant (Maxwell et al. 2016). As such, predicting the phenotypic (e.g., cellular, organismal, clinical, or otherwise) consequences of genotypic variants is a hurdle to leveraging genetic and genomic information in a wide array of clinical settings.
Genotypic (e.g., sequence) variants within genomically-encoded functional elements can affect diverse biophysical processes, altering distinct molecular functions within each element, and resulting in varied clinical and non-clinical phenotypes. For example, in an established tumor suppressor protein coding gene, phosphatase and tensin homolog (PTEN), genotypic variants affecting transcription (f.g. −903G>A, −975G>C, and −1026C>A), protein stability (f.g. C136R), phosphatase catalytic activity (f.g. C124S, H93R), and substrate recognition (f.g. G129E), have all been associated with Cowden Syndrome (CS), presenting high-risks of breast, thyroid, endometrial, kidney, colorectal cancers and melanoma (Heikkinen et al. 2011; He et al. 2013; Myers et al. 1997; Myers et al. 1998). Variants affecting the same biophysical processes and molecular functions can lead to co-morbidities between distinct disorders, as exemplified by PTEN variants affecting phosphatase activity (e.g., H93R) which have been additionally implicated in autism spectrum disorder (ASD) (Johnston and Raines 2015), leading to frequent co-morbidities between ASD and cancers (Markkanen et al. 2016). Moreover, variants affecting distinct biophysical processes and molecular mechanisms within a functional element can present stereotypic, differentiated clinical and non-clinical phenotypes. Mutations in the lamina A/C gene (LMNA) cause a compendium of more than fifteen diseases collectively known as “laminopathies,” which include A-EDMD (autosomal Emery-Dreifuss muscular dystrophy), DCM (dilated cardiomyopathy), LGMD1B (limb-girdle muscular dystrophy 1B), L-CMD (LMNA-related congenital muscular dystrophy), FPLD2 (familial partial lipodystrophy 2), HGPS (Hutchinson-Gilford progeria syndrome), atypical WRN (Werner syndrome), MAD (mandibuloacral dysplasia) and CMT2B (Charcot-Marie-Tooth disorder type 2B) (Scharner et al. 2010). In LMNA, genotypic (e.g., sequence) variants leading to HGPS create a cryptic splice site donor in the lamin A-specific exon 11 that results in a truncated form of lamin A, whereas variants leading to FPLD2 alter surface charge of the Ig-like domain and do not change the crystal structure of the mutant protein (Scharner et al. 2010). Thus, disentangling the complexity of genotype-phenotype relationships across a wide array of variant types, functional elements, and molecular systems, and cellular effects is an outstanding challenge to robust, scalable interpretation of the phenotypic consequences of variants discovered in clinical and non-clinical genetic and genomic tests.
Indeed, assessment of the significance of genotypic (e.g., sequence) variants can be a complex and challenging task. As recently as 2015, a survey of variant classifications demonstrated that as many as 17% (e.g., 2,229/12,895) of variant classifications were inconsistent among classification submitters (Rehm et al. 2015). Between clinical testing laboratories, the concordance in interpretations has been measured to be as low as 34% though specific recommendations can increase inter-laboratory concordance to 71% (Amendola et al. 2016).
With greater than 5,300 genes evaluated by genetic tests (e.g., according to the NCBI Genetic Test Registry) in the market, scalable solutions for interpreting (e.g., classifying) genotypic (e.g., sequence) variants in a broad array of genes, diseases, and contexts (e.g., clinical and non-clinical) are critical to the efforts in the precision medicine and life sciences industries. With greater than 14,000,000 possible (e.g., unique) molecular variants within the subset of molecular variants corresponding to single nucleotide variants (SNVs), within the subset of coding sequences, and within the subset of protein-coding genes in the clinical testing market, effective solutions for molecular variant classification need to be robust and scalable.
While multiple strategies exist for identifying the phenotypic impacts of molecular variants—including but not limited to family segregation, functional assays, and case-control studies—at present, only computational variant impact predictors are able to provide supporting evidence at the required scale. In effect, an analysis of clinical variant classifications from practitioners following the joint guidelines for clinical variant interpretation from the American College of Medical Genetics and Genomics (ACMG) and the Association of Molecular Pathology (AMP) demonstrate that ˜50% of clinical variant classifications rely on the use of computational variant impact predictors. Yet, despite their wide use, benchmarking studies indicate that computational variant impact prediction algorithms—such as SIFT, PolyPhen (v2), GERP++, Condel, CADD, REVEL, and others—have demonstrably low performances, with accuracies (AUC) in the 0.52-0.75 range (Mahmood et al. 2017).
Direct assays of molecular function may provide a basis for the accurate interpretation of the clinical and non-clinical impacts of genotypic (e.g., sequence) variants (Shendure and Fields 2016; Araya and Fowler 2011). To date, a diverse spectrum of assays have been devised to directly assess the impact of variants on a wide array of molecular functions. However, existing methods require a priori knowledge or assumptions of the mechanism of action of variants associated with the clinical (and non-clinical) phenotypes under investigation to define the molecular functions to assay (Shendure and Fields 2016). These methods are often limited to capturing the effects of, and informing on, only variants affecting specific molecular functions assayed, imposing limitations on the types of variants, types of molecular functions, and types of functional elements and genes which can be assayed in large-scale. Thus, while a phosphatase assay, for example, can nominate (e.g., rule-in) potential disease-associations for variants affecting catalytic activity of the PTEN tumor suppressor, such assay may not be able to exclude (e.g., rule-out) potential disease-associations for variants affecting protein stability as these variants may increase risk of developing disease without observable defects in catalytic activity. Conversely, while a protein stability assay, for example, can nominate (e.g., rule-in) potential disease-associations for variants leading to stability defects in the PTEN tumor suppressor, such assay may not be able to exclude (e.g., rule-out) potential disease-associations for variants affecting catalytic activity. The potential need for a priori knowledge or assumptions of the mechanism of action (and hence relevant molecular functions to assay) may limit the application of these methods to well-characterized functional elements (e.g., genes) and phenotypes which may prevent their application to poorly understood disease-associated genes.
Building on the technological foundations of high-throughput DNA sequencing platforms, recently developed large-scale functional assays—such as Deep Mutational Scanning (DMS), HITS-KIN, RNA-MAP, and others—have enabled comprehensive or near-comprehensive coverage of the possible sequence variants of distinct sequence classes, including single-nucleotide variants (SNVs) and non-synonymous variants (NSVs, missense variants) in coding, non-coding, and regulatory elements (Fowler et al. 2010; Araya et al. 2012; Guenther et al. 2013; Buenrostro et al. 2014; Kelsic et al. 2016; Patwardhan et al. 2009). Such methods may serve as the basis for robust, statistically-validated interpretation of the impact of molecular variants—such as genotypic (e.g., sequence) variants—on patient phenotypes (Starita et al. 2015; Majithia et al. 2016), including clinical phenotypes such as lipodystrophy and increased risk of type 2 diabetes (T2D) in patients with variants in PPARG, or increased risk of breast and ovarian cancers in patients with variants in BRCA1. While such methods may provide robust variant interpretation in clinical and non-clinical testing settings, these methods may require significant development and customization to assay each molecular function and each functional element. This may limit their utility as a generalizable, scalable solution to systematically assess the clinical and non-clinical consequences of molecular variants—such as genotypic (e.g., sequence) variants—across diverse types of variants, biophysical processes, molecular functions, functional elements, genes, and ultimately, pathways. Thus, there is a need for a multi-functional platform and methods for variant impact assessment.
The accompanying drawings are incorporated herein and form a part of the specification.
In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
Provided herein are system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof, for enabling multi-functional, multi-element, and multi-gene (e.g., pathway-scale) assessment of the phenotypic impact of variants across a wide array of variant types, biophysical processes, molecular functions, and phenotypes.
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments that can leverage high-throughput molecular measurements (e.g., next-generation sequencing), single-cell manipulation, molecular biology, computational modeling, and statistical learning techniques and can enable multi-functional, multi-element, and multi-gene (pathway-scale) assessment of the phenotypic impact of variants across a wide array of variant types, biophysical processes, molecular functions, and phenotypes.=
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments for systematically determining and statistically validating one or more phenotypic (e.g., clinical or non-clinical) impacts (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified—such as genotypic (e.g., sequence) variants—in one or more (e.g., coding or non-coding) functional elements (e.g., protein-coding genes, non-coding genes, molecular domains such as protein or RNA domains, promoters, enhancers, silencers, regulatory binding sites, origins of replication, etc.) in the (e.g., nuclear, mitochondrial, etc.) genome(s), or their derivative molecules—within a biological sample or record thereof of a subject.
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments for the classification (or regression) of likely phenotypic impacts in a subject on the basis of one or more molecular signals, phenotype signals, or population signals measured in in vivo or in vitro functional model systems. The derived regressions or classifications can be referred to as functional scores or functional classifications.
Embodiments herein represent a departure from existing computational or functional evidence support systems for molecular variant classification, as for example utilized in clinical genetic and genomic diagnostics.
First, while existing computational methods and systems for variant classification rely on a wide-array of populational, evolutionary, physico-chemical, structural, and or molecular annotations and properties for the classification of variants, existing computational methods and systems do not employ information pertaining to the impacts of molecular variants on cellular biology. As a consequence, such computational methods are unable to capture phenotypic impacts acting through variation in molecular properties within cells or variation in cellular populations and cellular heterogeneity.
Second, existing large-scale functional assays and solutions that are capable of assaying the activity of thousands of molecular variants provide activity measurements along a single dimension per molecular variant, and often require a priori knowledge or assumptions of the mechanism of action through which molecular variants exert phenotypic impacts.
Owing to these limitations, while conventional computational methods and systems for variant classification can access data across a multiplicity of annotations and parameters, these conventional approaches have demonstrably poor performance in classification (and regression) tasks for the phenotypic impact of molecular variants. Similarly, these conventional approaches require a priori knowledge or assumptions of the mechanism of action (and hence relevant molecular functions to assay), which limits their application to well-characterized functional elements (e.g., genes). This further precludes their application to poorly understood disease-associated genes. Finally, these conventional approaches require significant development and customization to assay each molecular function and each functional element.
In embodiments herein, a technological solution to overcome these technological problems involves data structures providing multi-dimensional characterization of cells and cellular populations harboring specific genotypes (e.g., molecular variants) in one or more functional elements (e.g., genes) and in one or more contexts (e.g., cell-types, drug treatments, genotypic backgrounds). Such data structures enable systems and methods for statistical learning to achieve improved accuracy in the classification tasks pertaining to the phenotypic impacts of genotypes (e.g., molecular variants or combinations thereof).
Embodiments herein enable robust, scalable, multi-dimensional classification of molecular variants (and combinations thereof) across a wide-array of functional elements and phenotypes through the acquisition of hundreds to tens of thousands (˜102-104) of molecular measurements per model system (e.g., cell), the construction of molecular profiles for tens to thousands (˜101-103) of model systems per molecular variant, thousands (˜103) of molecular variants per functional element (e.g., genes), and a single or a multiplicity of functional elements in parallel.
As illustrated in
In some embodiments, the present disclosure describes the use of these series of methods and technologies of
In some embodiments, the present disclosure describes the determination of functional scores or functional classifications of molecular variants by applying statistical (e.g., machine) learning approaches that associate molecular signals, phenotype signals, or population signals with the phenotypic impacts of the molecular variants. In some embodiments, the present disclosure describes the determination of evidence scores or evidence classifications of the molecular variants based on functional scores, functional classifications, predictor scores, predictor classifications, hotspot scores, or hotspot classifications. In some embodiments, the present disclosure describes the determination of the phenotypic impacts of the molecular variants identified within biological samples on the basis of the functional scores, the functional classifications, the evidence scores, or the evidence classifications of the identified molecular variants.
Embodiments herein integrate methods, techniques, and technologies from a multiplicity of domains. While statistical, machine learning techniques leveraging single-cell molecular measurements have been developed and applied for the classification of model systems (e.g., cells) originating from tens (e.g., less than 102) of different tissues or developmental stages, the requirements for achieving accurate genotype-specific (e.g. molecular variant-specific) classifications among thousands of cells with subtle differences—such as a single nucleotide difference in a genomic background defined by greater than 3×109 nucleotides—within the same cell-lines, tissues, or developmental stages, can present substantial challenges.
The present disclosure provides Deep Mutational Learning (DML) system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof for overcoming challenges in the identification (e.g., classification) of the phenotypic impact of molecular variants identified in subjects on the basis of biological signals assayed in single and populations of model systems (e.g., cells).
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof that improve cost-efficiency in the classification of molecular variants through (i) the directed deployment of DML processes and systems with lower-cost prediction models (see
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof that improve the scalability and performance across functional elements (e.g., genes) through DML processes and systems that leverage information between functional elements (see
The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof for assessing the phenotypic impacts (e.g., pathogenicity, functionality, or relative effect) of one or more molecular (e.g., genotypic) variants in one or more (e.g., coding or non-coding) functional elements (e.g., protein-coding genes, non-coding genes, molecular domains such as protein or RNA domains, promoters, enhancers, silencers, regulatory binding sites, origins of replication, etc.) in the (e.g., nuclear, mitochondrial, etc.) genome(s), or their derivative molecules. As would be appreciated by a person of ordinary skill in the art, a molecular variant may be a genotypic (e.g., sequence) variant such as a single-nucleotide variant (SNV), a copy-number variant (CNV), or an insertion or deletion affecting a coding or non-coding sequence (or both) in the nuclear, mitochondrial, or episomal genome-natural or synthetic. As would be appreciated by a person of ordinary skill in the art, a molecular variant may also be a single-amino acid substitution in a protein molecule, a single-nucleotide substitution in a RNA molecule, a single-nucleotide substitution in a DNA molecule, or any other molecular alteration to the cognate sequence of a polymeric biological molecule.
In some embodiments, the classification (or regression) may relate to (e.g., likely) disease-causing (e.g., pathogenic) and neutral (e.g., benign) variants for disorders with genetic components, or predictions of the severity thereof, on the basis of the molecular variants identified within a biological sample or record thereof of a subject. In some other embodiments, the classification (or regression) may relate to molecular impacts (e.g., loss-of-function, gain-of-function or neutral) on the basis of molecular variants of probable molecular consequence (e.g., nonsense or insertion and deletion mutations) and probable molecular neutrality (e.g., synonymous). In some other embodiments, the classification (or regression) may relate to variation in the response to therapeutic treatments (e.g., chemical, biochemical, physical, behavioral, digital, or otherwise) on the basis of molecular variants identified within a biological sample or record thereof of a subject. In some embodiments, phenotypic impacts may refer to phenotype classes (e.g., neutral, pathogenic, benign, high-risk, low-risk, positive response variants, negative response variants) and phenotype scores (e.g., a probability of developing specific clinical and non-clinical phenotypes, the levels of metabolites in blood, and the rate at which specific compounds are absorbed or metabolized).
In some embodiments, the present disclosure provides systems and methods for modeling the diversity and prevalence of phenotypic properties within a population on the basis of the diversity and prevalence of molecular variants in representative populations. In some embodiments, the present disclosure provides systems and methods for modeling the diversity and prevalence of phenotypic properties within a population on the basis of the phenotypic impacts of molecular variants—with known or expected diversity and prevalence—where the phenotypic impacts may be modeled from one or more molecular signals, phenotype signals, or population signals, previously associated with variants in an in vivo or in vitro functional model system. In some embodiments, such modeling may be used to inform on the diversity and prevalence of mechanisms of drug-resistance in a population.
In some embodiments, the present disclosure describes the use of models of the diversity and prevalence of phenotypic properties within a population of individuals (e.g., as informed by the phenotypic impacts of molecular variants modeled from one or more molecular signals, phenotype signals, or populations signals in a functional model system) to construct cohorts of subjects (e.g., patients) and to investigate the efficacy of therapeutic and non-therapeutic interventions.
In some embodiments, the present disclosure provides systems and methods for the classification (or regression) of the phenotypic impact of molecular variants on the basis of functional scores or functional classifications derived from one or more molecular signals, phenotype signals, or population signals associated with variants as assayed in a functional model system. In some embodiments, molecular variants may be functionally modeled within cells, cellular compartments or synthetic compartments as in vivo or in vitro model systems.
In some embodiments, the molecular variants modeled (e.g., in vivo or in vitro) may be identified directly within the nucleic acid sequence of the functional elements modeled via library preparation, sequencing, and characterization of nucleic acids or nucleic acid fragments within single-cells, cellular compartments, subcellular compartments, or synthetic compartments (e.g., collectively termed model systems). In some other embodiments, the molecular variants modeled (e.g., in vivo or in vitro) may be inferred from barcode sequences associated with individual variants in the functional elements via library preparation, sequencing, and characterization of nucleic acids or nucleic acid fragments within model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments), using a pre-assembled database of associated barcodes and variants. As would be appreciated by a person of ordinary skill in the art, molecular variants may be produced via a diversity of techniques, such as direct (e.g., chemical) synthesis, error-prone PCR, oligonucleotide-directed mutagenesis, nicking mutagenesis, or Saturation Genome Editing (SGE), among others (Firnberg et al. 2012; Kitzman et al. 2014; Wrenbeck et al. 2016; and Findlay et al. 2014). As would be appreciated by a person of ordinary skill in the art, variant libraries can be then introduced (e.g., added) into model systems (e.g., cells, cellular compartments, subcellular compartments, or synthetic compartments) using a variety of approaches, such as but not limited to homologous recombination (e.g., Cas9-mediated or Adenovirus-mediated), site-specific recombination (e.g., Flp-mediated), or viral transduction (eg., lentiviral-mediated) (Findlay et al. 2018; Wissink et al. 2016; and Macosko et al. 2015).
In some embodiments, functional scores and functional classifications associated with individual molecular variants may be derived from measurements of molecules and or chemical modifications present within in vivo or in vitro model systems harboring the variant within the functional element, including but not limited to DNA, RNA, and protein molecules or modifications thereof. For example, in some embodiments, measurements or models of molecular signals, cellular signals, or population signals may be made and used to learn the functional scores and or functional classifications. In some embodiments, the functional scores and functional classifications may be derived from molecular measurements obtained via nucleic acid barcoding, isolation, enrichment library preparation, sequencing, and characterization of a plurality of nucleic acids or nucleic acid fragments within single-cells, cellular compartments, subcellular compartments, or synthetic compartments including, but not limited to, RNA molecules, genomic DNA, chromatin-associated DNA, protein-associated DNA, accessible DNA fragments, or chemically-modified nucleic acids. In some embodiments, these procedures may utilize molecular barcoding techniques to uniquely identify or associate nucleic acids, nucleic acid fragments, or nucleic acid sequences stemming from individual single-cells, cellular compartments, subcellular compartments, or synthetic compartments (Macosko et al. 2015; Buenrostro et al. 2015; Cusanovich et al. 2015; Dixit et al. 2016; Adamson et al. 2016; Jaitin et al. 2016; Datlinger et al. 2017; Zheng et al. 2017; Cao et al. 2017). These methods may build on developments from the field of single-cell genomics (Schwartzman and Tanay 2015; Tanay and Regev 2017; Gawad et al. 2016). In some embodiments, the systems and methods of the present disclosure may apply methods for single-cell RNA sequencing to derive molecular measurements from single-cells, cellular compartments, subcellular compartments, or synthetics compartments. These methods include but are not limited to single-cell sequencing library generation, high-throughput nucleic acid sequencing, sequencing read quality control, barcode identification (e.g., of single-cell, cellular compartment, subcellular compartment, or synthetic compartment) and quality control, sequencing read unique molecular barcode identification and quality control, sequencing read alignments, as well as read alignment filtering and quality control. In some embodiments, molecular measurements may correspond to locus-specific measurements of gene expression (e.g., RNA transcript abundance), protein abundance or modifications (e.g., phospho-protein abundance), chromatin accessibility (e.g., nucleosome occupancy), epigenetic modification (e.g., DNA methylation), regulatory activity (e.g., transcription factor binding), post-transcriptional processing (e.g., splicing), post-translational modification (e.g., ubiquitination), mutation burden (e.g., count), mutation rate (e.g., frequency), mutation signatures (e.g., count or frequency per type of mutation), or various other types of measurements of molecules within single-cells, cellular compartments, subcellular compartments, or synthetic compartments as would be appreciated by a person of ordinary skill in the art. In some embodiments, the present disclosure describes systems and methods for augmenting the quality of the molecular measurements for specific target genes and functional elements via the use targeted enrichment or targeted capture techniques—via hybridization- or amplicon-based techniques and probes—either before, during or after single-cell RNA library processing.
In some embodiments, molecular measurements from single-cells, cellular (or subcellular) compartments or synthetic compartments may be utilized to derive multi-locus measurements of molecular processes. For example, these measurements of molecular processes may include multi-locus measurements of gene expression, chromatin accessibility, epigenetic modification, regulatory activity, transcriptional activity, translational activity, signaling activity, signaling activity, pathway activity, mutation burden, mutation rate, mutation signatures, and various other measurements as would be appreciated by a person of ordinary skill in the art.
In some embodiments, molecular measurements and molecular processes from single-cells, cellular (or subcellular) compartments or synthetic compartments may be utilized to derive global (e.g., pan-locus or locus-independent) measurements of molecular features. For example, these measurements of molecular features may include global measurements of gene expression, chromatin accessibility, epigenetic modification, regulatory activity, transcriptional activity, translational activity, signaling activity, signaling activity, pathway activity, mutation burden, mutation rate, mutation signatures, and various other measurements as would be appreciated by a person of ordinary skill in the art.
In some embodiments, molecular measurements, molecular processes, or molecular features of single-cells, cellular compartments, subcellular compartments, or synthetic compartments may serve directly as (e.g., lower-order) molecular scores. In some embodiments, a (e.g., higher-order) molecular score may be derived by applying pre-existing models that associate multiple lower-order (e.g., lower-order) molecular scores (e.g., molecular measurements, molecular processes, or molecular features) to regulatory, signaling, pathway, processing, cell-cycle activities, alterations, defects, or states. In some embodiments, such methods may apply gene set enrichment analysis or other derivative methods as would be appreciated by a person of ordinary skill in the art. In some embodiments, as illustrated in
In some embodiments, as illustrated in
As would be appreciated by a person of ordinary skill in the art, the molecular measurements, processes, features, and scores from model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) may define or correspond to distinct molecular states or specific subpopulations of model systems (e.g., single-cells, cellular compartments, subcellular compartments or synthetic compartments) with similar molecular properties. As would be appreciated by a person of ordinary skill in the art and as shown in
For example, the molecular states of model systems can be identified on the basis of cell-cycle signatures derived from gene-expression molecular scores (Macosko et al. 2015). As would be appreciated by a person of ordinary skill in the art, molecular states can be derived via scoring using previously-derived models—for example, scoring gene-expression signatures of previously characterized molecular states such as gene-expression signatures reflecting distinct phases of the cell-cycle previously characterized in chemically synchronized cells (Whitfield et al. 2002). As would be appreciated by a person of ordinary skill in the art, molecular states may also be derived via scoring using internally-derived models from partitions of model systems within which characteristic correlations between molecular signals can be detected or expected (e.g., as is the case with gene expression variation throughout distinct stages of cell-cycle). As would be appreciated by a person of ordinary skill in the art, the internally-derived models may be generated using a variety of statistical techniques (e.g., machine learning techniques).
In some embodiments, as illustrated in
In some embodiments, a Phenotype Model (mP) and database of phenotype scores (or phenotype classifications) is generated by accessing a database of features describing (e.g., lower- and higher-order) molecular scores and molecular states 704 of single-cells 702, and input labels 708 (e.g., a database) describing the phenotypic impact 706 of molecular variants identified within single-cells 702. In some embodiments, a training/validation layer 710 generates and quality-controls Phenotype Models (mP) that can predict the phenotypic impact 706 of individual single-cells 702. In some embodiments, a database of features describing the molecular scores and molecular states 716 of single-cells (testing) 714 are provided to the generated Phenotype Models (mP) to calculate and create a database of phenotype scores 720 describing the predicted phenotypic impact 718 of molecular variants in single-cells (testing) 714. As would be appreciated by a person of ordinary skill in the art, the performance (e.g. accuracy) of the predicted phenotypic impacts 718 in each cell (e.g., phenotype scores 720) can be determined against the known phenotypic impact of molecular variants in single-cells (testing) 714 within a testing layer 712. As would be appreciated by a person of ordinary skill in the art, the Phenotype Models (mP) can be applied to pre-compute or compute, on demand, the phenotype scores of single cells not included in training, validation, or testing. In some embodiments, such scoring and evaluation can occur in a phenotype scoring and classification layer 722. Phenotype scoring and classification layer 722 can examine the phenotype impact classification accuracy permitted on the basis of phenotype scores 720.
In some embodiments, summary statistics relating to the tendency, dispersion, shape, range, or error of phenotype scores may be used to create a database of (e.g., quality-controlled) phenotype signals associated with individual molecular variants.
In some embodiments, and as illustrated in
In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand. In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand by a molecular state, variant-specific scoring layer 1016 yielding matrices that are molecular state-specific. In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand by a multi-state, variant-specific scoring layer 1014, yielding matrices that contain data from multiple molecular states.
In some embodiments, as illustrated in
In some embodiments, subsampling of molecular measurements, molecular processes, molecular features, molecular scores, or phenotype scores from model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) harboring the same molecular variant may be applied to generate independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, or molecular scores or phenotype scores associated with individual molecular variants.
In some embodiments, independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, molecular scores or phenotype scores may be used to create a database of (quality-controlled) independent or disjoint estimates of molecular signals or phenotype signals associated with individual molecular variants. As would be appreciated by a person of ordinary skill in the art, independent or disjoint estimates of molecular signals or phenotype signals can be used to create a database of (quality-controlled) molecular or phenotype signals associated with individual molecular variants.
In some embodiments, the present disclosure describes systems and methods for deriving independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, or molecular scores or phenotype scores associated with individual molecular variants within subpopulations of model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) from specific molecular states. As would be appreciated by a person of ordinary skill in the art, these methods may leverage a plurality of statistical techniques (e.g., machine learning techniques).
In some embodiments, molecular state-specific independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, molecular scores or phenotype scores may be used to create a database of (e.g., quality-controlled) molecular state-specific, independent and disjoint estimates of molecular signals and phenotype signals associated with individual molecular variants in specific molecular states.
In some embodiments, independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of population signals associated with individual molecular variants may be used to create a database of (e.g., quality-controlled) population signals associated with individual molecular variants.
In some embodiments, as illustrated in
In some embodiments, the present disclosure describes the use of molecular state-specific, lower-order molecular signals or phenotype signals for the derivation of molecular state-specific higher-order molecular signals or phenotype signals. In some embodiments, the present disclosure describes the use of multi-state matrices of lower-order molecular, phenotype, or population signals to derive multi-state higher-order molecular, phenotype, or population signals, leveraging structured relationships between molecular signals across molecular states, such as structured gene expression patterns (e.g., molecular signals) across cell-cycle stages (e.g., molecular states). In some embodiments, the present disclosure describes the use of Convolutional Neural Networks (CNNs) to learn patterned-associations in molecular, phenotype, or population signals (and annotation features) across molecular states.
In some embodiments, and as illustrated in
In some embodiments, a Functional Model (mF) and a database of functional scores (or functional classifications) is generated by accessing a database of features describing molecular (e.g., lower-order or higher-order), phenotype, or population signals 1304 of molecular variants 1302 for training/validation, and a set of input labels 1310 (e.g., a database) describing the phenotypic impacts 1308 of molecular variants 1302. The generating is further performed by applying statistical (e.g., machine) learning techniques to associate molecular, phenotype, or population signals 1304 (e.g., features) to phenotypic impacts (e.g., labels).
In some embodiments, a training/validation layer 1312 performs training and validation to generate quality-control Functional Models (mF) that can predict the phenotypic impacts 1308 of molecular variants 1302. In some embodiments, training/validation layer 1312 can deploy cross-validation techniques, such as, but not limited to, K-fold or Leave-One-Out Cross-Validation (LOOCV). In some embodiments, a database of features describing the molecular, phenotype, or population signals 1318 of molecular variants (testing) 1316 can be provided to the generated Functional Models (mF) to calculate and create a database of functional scores 1324 describing the predicted phenotypic impact 1322 of molecular variants (testing) 1316. As would be appreciated by a person of ordinary skill in the art, the performance (e.g. accuracy) of the predicted phenotypic impacts 1322 (e.g., functional score 1324) of molecular variants can be determined against known phenotypic impacts of molecular variants, such as testing molecular variants 1316. As would be appreciated by a person of ordinary skill in the art, the Functional Models (mF) can be applied to pre-compute, or compute on demand, the functional scores of molecular variants not included in training, validation, or testing phases within a testing layer 1314. In some embodiments, such scoring and evaluation can occur in a functional scoring and classification layer 1326 to, for example, examine the phenotype impact classification accuracy permitted on the basis of functional scores 1324.
In some embodiments, additional annotation features 1306, 1320 may be provided during training and testing (prediction generation) of Functional Models (mF). In some embodiments, the annotation features 1306 and 1320 may encompass a plurality of independent (e.g., non-assayed) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others as would be appreciated by a person of ordinary skill in the art), describing changes associated with the changes in genotype (e.g., sequence, molecular variants).
As would be appreciated by a person of ordinary skill in the art, a diverse array of sources for phenotypic impacts (e.g., labels) of molecular variants can be used to define Truth Sets, including (e.g., public and or private) clinical and non-clinical variant databases (e.g., ClinVar, HumVar, VariBench, SwissVar, PhenCode, PharmGKB, or locus-specific databases), and outcome databases.
In some other embodiments, the present disclosure provides systems and methods for deriving functional scores and functional classifications via statistical (e.g., machine) learning to generate a Functional Model (mF) that associates molecular, phenotype, or population signals (e.g., features)—derived from one or more molecular measurements, molecular processes, molecular features, and/or molecular scores—with phenotypic impacts (e.g., labels) of molecular variants computed directly from distinct molecular, phenotype, or population signals, via regression and classification techniques. In some embodiments, this approach may permit, for example, deriving functional scores and functional classifications that predict the relative mutation burden, mutation rate, or mutation signatures of samples from subjects harboring specific molecular variants. In some embodiments, functional scores or functional classifications from such assays may permit informing on the lifetime risk of developing cancer in test subjects.
As would be appreciated by a person of ordinary skill in the art, regression and classification to generate Functional Models (mF's) may rely on various statistical (e.g., machine) learning techniques for semi-supervised or supervised learning, including, but not limited to, Random Forests (RFs), Gradient Boosted Trees (GBTs), Zero Rules (ZRs), Naive Bayesian (NBs), Simple Logistic Regression (LRs), Support Vector Machines (SVMs), k-Nearest Neighbors (kNNs), and approaches deploying a wide-array of Artificial Neural Network (ANN) architectures and techniques. In some embodiments, the present disclosure describes the use of molecular state-specific, molecular signals for the derivation of molecular state-specific functional scores or functional classifications. In some other embodiments, the present disclosure describes the use of multi-state matrices of molecular signals for the derivation of molecular state-aware functional scores or functional classifications. In some embodiments, the present disclosure describes the use of Convolutional Neural Networks (CNNs) to learn patterned-associations between functional scores or functional classifications and molecular signals distributed across molecular states.
Embodiments can use wildtype, somatic, and germline molecular variants of key RAS/MAPK pathway constituents, such as HRAS (e.g., G12V), PTPN11 (e.g., E76K and N308D), and MAP2K2 (e.g., F57C and P128Q), that are constructed and overexpressed in HEK293 cells. Embodiments can select cells with 1 mg/ml puromycin to ensure expression of the exogenously introduced functional elements (e.g., genes), and RAS/MAPK pathway activation can be verified using an enzyme-linked immunosorbent assays (ELISA) for phospho-ERK protein and total ERK protein abundances (see
As illustrated in
According to some embodiments,
To evaluate the performance of DML processes and systems as a scalable solution for the accurate identification of disease-associated (e.g., pathogenic) molecular variants across multiple genes and disorders, a uniform, distributed DML processing pipeline can be deployed for the pre-processing, scaling, normalization, dimensionality reduction, and computation of molecular and population signals on, for example, three genes of the RAS/MAPK pathway, HRAS, PTPN11, and MAP2K2. Applying a similar training/testing schema for the evaluation of classification accuracies as above, the DML processes can achieve (e.g., median) raw classification accuracies 202 of ˜99.9% and ˜100% in the analysis of somatic cancer-driving molecular variants in HRAS (e.g., G12V) and PTPN11 (e.g., E76K), respectively, and (e.g., median) raw classification accuracies 204 of ˜98.5% and ˜96.1% in the analysis of molecular variants form germline (e.g., inherited) disorders in PTPN11 (e.g., N308D) and MAP2K2 (e.g., F57C, P128Q), respectively, as demonstrated in
In some embodiments, the present disclosure provides systems and methods for the derivation of model system-level (e.g., cell-level) phenotypic scores through application of statistical machine learning models to associate lower-order and higher-order molecular scores with the known phenotypic impacts of variants harbored within model systems (e.g., cells).
In
In some embodiments, the present disclosure describes the learning and prediction of the phenotypic consequences of molecular variants on the basis of molecular, phenotype, or population signals assayed in multiple genes, molecular elements, within the same, related, or interacting pathways. As shown in
In some embodiments, the present disclosure provides systems and methods for deriving functional scores and functional classifications for individual functional elements (e.g., individual genes). In some embodiments, the present disclosure provides methods for deriving functional scores and functional classifications across a multitude of functional elements leveraging concordant molecular signals across molecular variants within a plurality of functional elements. In some embodiments, the present disclosure describes systems and methods combining the use of mutagenesis, molecular barcoding, molecular cloning, and cellular pooling techniques to generate populations of cells in which molecular variants in distinct functional elements are uniquely created, barcoded, or both.
In some embodiments, independent or disjoint estimates of molecular, phenotype, or population signals (e.g., features) may be used to derive independent or disjoint functional scores and functional classifications via statistical (e.g., machine) learning to associate molecular signals (e.g., features) with phenotypic impacts (e.g., labels) of molecular variants via regression and classification techniques, respectively.
In some embodiments, feature weights from statistical (e.g., machine) learning models generated using independent or disjoint estimates of each molecular, phenotype, or population signal are computed, collected and utilized for robust feature selection using techniques as would be appreciated by a person of ordinary skill in the art. In some embodiments, the present disclosure provides methods for deriving functional scores and functional classifications via statistical (e.g., machine) learning to associate the identified robust molecular, phenotype, or population signals (e.g., robust features) with phenotypic impacts (e.g., labels) of molecular variants via regression and classification techniques, respectively.
In some embodiments, the present disclosure describes systems and methods for deriving functional scores and functional classifications from a plurality of statistical (e.g., machine) learning models generated using independent or disjoint estimates of molecular signals, applying either model selection or model combination (e.g., mixing) techniques (Pan et al. 2006).
In some embodiments applying model selection techniques, a model selection criterion measuring the predictive performance of a model or the probability of it being the true model may be used to compare the models and selection can be applied to maximize an estimate of the selection criterion. As would be appreciated by a person of ordinary skill in the art, a diversity of model selection criteria can be applied, including (but not limited to) the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Cross-Validation (CV), Bootstrap (Efron 1983; Efron 1986; Efron and Tibshirani 1997), or adaptive model selection criteria (George and Foster 2000; Shen and Ye 2002; Shen et al. 2004) computed on the training data or input test data, as exemplified by test input-dependent weights (IDWs). The IDW for a candidate model may be defined as the probability of the model giving a correct prediction for a given input or a reasonable measure to quantify the predictive performance of the model for the input test data (Pan et al. 2006).
In some other embodiments applying model combination techniques, a combined model can be generated by applying ensemble methods, by taking an equally or unequally weighted average of the outputs from individual models (Ripley 2008; Hastie et al. 2001). For example, ensemble methods can include but are not limited to Bayesian model averaging, stacking, bagging, random forests, boosting, ARM, and using performance metrics (e.g., AIC and BIC) as weights computed on training data (Burnham and Anderson 2003; Hastie et al. 2001) or computed on input test data (Pan et al. 2006). In some other embodiments applying model combination techniques, a combined model can be generated applying an Artificial Neural Network (ANN) architecture. In some embodiments, the present disclosure describes systems and methods for deriving functional scores and functional classifications from a plurality of statistical (e.g., machine) learning models generated using independent or disjoint estimates of molecular signals that involve applying various noise-control techniques (e.g., a Bootstrap Ensemble with Noise Algorithm (Yuval Raviv 1996)).
In some embodiments, the present disclosure describes systems and methods for estimating functional scores and functional classifications for molecular variants applying statistical (e.g., machine) learning techniques to generate an Inference Model (mI) that models the relationship between (e.g., assay end-points) functional scores or functional classifications and a plurality of dependent (e.g., assayed) features (e.g., molecular, phenotype, or population signals) or independent (e.g., non-assay) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others as would be appreciated by a person of ordinary skill in the art). As would be appreciate by a person of ordinary skill in the art, such Inference Model (mI) may permit estimating functional scores and functional classifications for molecular variants with or without the explicit use of molecular, phenotype, or population signals, molecular measurements, molecular processes, molecular features, or molecular scores. In some embodiments, such methods may permit inferring sequence-function maps describing functional scores and functional classifications for molecular variants beyond those for which the functional scores and functional classifications were directly assayed. In some embodiments, as illustrated in
In some embodiments, and as illustrated in
In some embodiments, and as illustrated in
In some embodiments, the present disclosure describes the use of statistical (e.g., machine) learning to generate a Functional Model (mF) 1607 that associates molecular, phenotype, or population signals and functional scores and functional classifications as learned from molecular variants in the Truth Set (e.g., from genotypes 1606) to predict the functional scores and functional classifications of molecular variants in the Target Set (e.g., from genotypes 1606), thereby yielding a sequence-function map of functional scores 1608.
In some embodiments, as illustrated in
In some embodiments, the phenotypic impacts of known molecular variants, high-confidence predicted molecular variants, and functionally-modeled molecular variants can be leveraged by an Inference Model (mI) 1609 that models the relationship between phenotypic impacts and a plurality of dependent (e.g., assayed) features (e.g., molecular, phenotype, or population signals) or independent (e.g., non-assay) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others, as would be appreciated by a person of ordinary skill in the art) to yield an augmented sequence-function of functional scores 1610. As would be appreciate by a person of ordinary skill in the art, such Inference Model (mI) 1609 may permit estimating the phenotypic impacts of molecular variants with or without the explicit use of molecular, phenotype, or population signals.
In some embodiments, the present disclosure describes systems and methods for the optimization of cost-efficiency of molecular variant classification through the staged deployment of Deep Mutational Learning (DML) processes and systems on Truth and Target (Query) Sets of molecular variants. Some embodiments include a Stage I Optimization 610 step as illustrated in, for example,
In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of the functional scores and functional classifications determined as described above. In some embodiments, time-stamped records of incorporation of functional scores and functional classifications for a set of (e.g., a plurality of unique) molecular variants may be created, evaluated, validated, selected, and applied to determine the phenotypic impact of molecular variants identified within a biological sample or record of a subject.
In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of the predictor scores or predictor classifications from computational predictors generated by applying statistical (e.g., machine) learning methods to leverage the functional scores and functional classifications.
In some embodiments, and as illustrated in
In some embodiments, the present disclosure further describes systems and methods for generating pathway- and condition-specific (higher-order) Variant Interpretation Engines (VIEs) applying model combination techniques that integrate (lower-order) gene- and condition-specific Variant Interpretation Engines (VIEs) from a plurality of genes in target pathways of interest. In other embodiments, the present disclosure further describes systems and methods for generating pathway- and condition-specific (higher-order) Variant Interpretation Engines (VIEs) through statistical (e.g., machine) learning techniques that model the phenotypic impacts of molecular variants on the basis of their functional scores, functional classifications, and other features commonly used in the creation of the computational predictors, including but not limited to evolutionary, population, functional (annotation-based), structural, dynamical, and physicochemical features associated with variants and residues of functional elements.
In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject on the basis of the hotspot scores and hotspot classifications from mutational hotspots computed by applying spatial clustering techniques to identify networks of residues with specific phenotypic impacts leveraging the herein-described and enabled functional scores, functional classifications, and molecular signals associated with molecular variants and residues.
In some embodiments, the present disclosure describes systems and methods for deriving a matrix of functional distances between molecular variants or their corresponding residues by (1) computing a distance metric between molecular variants projected in the N-dimensional space (1≤N≤M) defined by a set of M of functional scores, functional classifications, and molecular signals (as described above), where N<M when dimensionality-reduction techniques are applied to reduce the feature-space of molecular variants. As would be appreciated by a person of ordinary skill in the art, various dimensionality-reduction techniques may be applied including but not limited to techniques reliant on linear transformations—as in principal component analysis (PCA)—or non-linear transformations—as in the manifold learning techniques (e.g., t-distributed stochastic neighbor embedding (tSNE) and kernel principal component analysis (kPCA)). As would be appreciated by a person of ordinary skill in the art, various distance metrics can be utilized, including but not limited to, the Euclidean distance, Manhattan distance (e.g., City-Block), Mahalanobis distance, or Chebychev distance, and various others.
In some embodiments, the present disclosure describes systems and methods for the identification of Significantly Mutated Regions (SMRs) and Networks (SMNs) by measuring and scoring the phenotype-associated mutation density (e.g., number of observed phenotype-associated variants per residue) within spatially-proximal residues of functional elements (e.g., protein-coding genes) through the application of spatial clustering techniques across a plurality of spatial distance metrics, including the herein described and enabled functional distances, sequence distances, structure distances, (co)evolutionary distances, and combinations thereof.
In some embodiments, and as illustrated in
In some embodiments, and as illustrated in
SMR/SMN-detection techniques 1805 can comprise a series of steps including but not limited to: (1.1) projection 1810 of phenotype-associated molecular variants 1806 in functional, sequence, structural, or (co)evolutionary dimensions (or combinations thereof), (1.2) application of spatial clustering techniques 1812 (e.g., DBSCAN) to detect clusters of spatially-proximal phenotype-associated variants, and (1.3) measurement of mutation density, scoring number of phenotype-associated variants per residue in cluster.
SMN-detection techniques 1805 can further comprise the steps denoted in 1814 including, but not limited to: (1.4) scoring of mutation density probability by, for example, computing the (e.g., binomial) probability of obtaining k-or-more (e.g., greater than or equal to k) observed phenotype-associated variants per cluster, given the per-residue mutation rate within each functional element (e.g., protein-coding gene), (1.5) applying multiple hypothesis correction (MHC) across mutation density probabilities of discovered clusters, and (1.6) computing false-discovery rates (FDRs) for the observed (e.g., raw or corrected) mutation density probabilities using background models of mutation density probabilities derived by randomizing positions of the observed phenotype-associated variants within each functional element.
Training/Validation Layer 1804 can further perform the SMR/SMN-selection techniques 1815. SMR/SMN-selection techniques can comprise the steps of (2.1) defining (e.g., raw or corrected) mutation density probabilities and/or false discovery rates (FDRs) as hotspot scores and applying cutoffs to statistically define hotspot classifications, thereby nominating residues in candidate clusters (e.g., sequence 1816, function 1818, and sequence 1820), (2.2) detecting residues in candidate clusters from multiple, distinct projections/spaces, (2.3) assigning residues to individual clusters applying an assignment heuristic (e.g., selecting the cluster largest in size (e.g., cluster with the highest number of residues), and (2.4) identifying SMRs/SMNs as the final set of clusters meeting these criteria. The final set of SMRs/SMNs can be derived from multiple, distinct projections (e.g., sequence 1820, function 1818, or sequence, function (combined) 1822).
In some embodiments, the present disclosure describes systems and methods for the identification of SMRs/SMNs by measuring and scoring the phenotype-associated mutation density (e.g., number of observed phenotype-associated variants per residue) within spatially-proximal residues of functional elements (e.g., protein-coding genes) through the application of spatial clustering techniques across a plurality of spatial distance metrics, where the phenotype-associated variants may be defined on the basis of the functional scores and functional classifications herein described. As would be appreciated by a person of ordinary skill in the art, these methods may allow the determination of clusters of residues in which variants with specifically-defined phenotypic impacts occur.
In some embodiments, the present disclosure describes systems and methods for evaluating the accuracy, performance, or robustness of independent evidence datasets for the interpretation of molecular variants, such as quantitative (e.g., scores) or qualitative (classifications) evidence from computational predictors (e.g., M-CAP, REVEL, SIFT, and PolyPhen2), as well as gene-specific predictors (e.g., PON-P2), mutational hotspots, and population genomics metrics (e.g., allele frequency-based variant classifications), (Amendola et al. 2016) against the herein described functional scores and functional classifications.
In some embodiments, the present disclosure describes systems and methods for computing evaluation metrics to assess concordance between an evidence dataset and the herein described functional scores and functional classifications, and based on these evaluation metrics selecting the best-performing evidence dataset for use in variant interpretation and prioritization. As would be appreciated by a person of ordinary skill in the art, various evaluation metrics can be used to assess the concordance of an evidence dataset against the herein described functional scores or functional classifications. For quantitative evidence (e.g., scores), these may include the Pearson's correlation coefficient, Spearman's rank-order correlation, Kendall correlation, and various others as would be appreciated by a person of ordinary skill in the art. For qualitative evidence (e.g., classifications), these may include accuracy, Matthew's correlation coefficient, Cohen's kappa coefficient, Youden's index (e.g., informedness), F-measure (e.g., F1 score), true positive rate (e.g., sensitivity or recall), true negative rate (e.g., specificity), positive predictive value (e.g., precision), negative predictive value, positive likelihood ratio, negative likelihood ratio, and diagnostic odds ratio, and various others as would be appreciated by a person of ordinary skill in the art.
In some embodiments, the present disclosure describes systems and methods that may continuously evaluate, validate, and optimize (e.g., select, remove, or modify) diverse evidence datasets on the basis of the above described evaluation metrics, and distribute the best-performing (e.g., independent) evidence datasets to client systems via an Application Program Interface (API) for use in variant interpretation and prioritization practices determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject.
In some embodiments, the present disclosure describes systems and methods for determining the degree of ascertainment bias, reporting bias, or outcome bias present within a dataset of variants, including clinical datasets (e.g., ClinVar, HumVar, VariBench, SwissVar, PhenCode, or locus-specific databases), population datasets (e.g., ExAC, GnomAD, and 1000 Genomes), or independent evidence datasets for the interpretation of molecular variants, such as but not limited to computational predictors (e.g., M-CAP, REVEL, SIFT, PolyPhen2, and PON-P2). In some embodiments, the present disclosure describes systems and methods for determining biases on the basis of the expected distributions of the herein described functional scores, functional classifications, and molecular signals associated with molecular variants and residues.
In some embodiments, the present disclosure describes systems and methods for the evaluation of a target variant dataset by measuring and scoring the difference between the distributions of functional scores, functional classifications, and molecular signals of molecular variants and residues within the target dataset against the expected distributions of functional scores, functional classifications, and molecular signals of molecular variants from a reference dataset. In some embodiments, the measurement of inherent biases within a target variant dataset may comprise a series of steps, including but not limited to: (1) collection of functional scores, functional classifications, and molecular signals associated with molecular variants in the target and reference datasets, (2) estimating the probability density function of functional scores, functional classifications, or molecular signals associated with molecular variants within the reference dataset, (3) estimating the probability density function of functional scores, functional classifications, or molecular signals associated with molecular variants within the target dataset, and (4) measuring the statistical distance between the target dataset-derived probability density function and the reference dataset-derived probability density function of functional scores, functional classifications, or molecular signals. In some embodiments, the measurement of inherent biases within a target variant dataset comprises a series of steps, including: (5) sampling variants from the reference dataset (e.g., to match the sample population size of the target dataset), (6) estimating the probability density function of functional scores, functional classifications, or molecular signals of the sampled reference dataset in step 5, (7) measuring the statistical distance between the target dataset-derived probability density function and the sampled reference dataset-derived probability density function of functional scores, functional classifications, or molecular signals, (8) iterating steps 5-8 to obtain a robust estimate and confidence intervals of the statistical distance between the probability density function of functional scores, functional classifications, or molecular signals of the target and reference datasets. In some embodiments, the above systems and methods for the detection and statistical evaluation of bias permit the identification of clinical datasets, population datasets, or evidence datasets in which the contained variants have different functional scores, functional classifications, or molecular signals from that expected in a reference dataset.
In some other embodiments, the present disclosure describes systems and methods for evaluating underlying biases within evidence datasets by a series of steps, including but not limited to: (1) partitioning evidence and reference datasets into matching sets of quantiles (e.g., for quantitative evidence scores) or classes (e.g., qualitative evidence classifications); (2) scoring variants within each set (e.g., evidence vs. reference) across a plurality of properties (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants); (3) estimating the probability density function of each property score within each set (e.g., evidence vs. reference); (4) measuring the statistical distance between the evidence set-derived probability density function and the reference set-derived probability density function of each property score; and (5) identifying properties with statistically significant differences in scores between reference and evidence sets.
In some embodiments, the present disclosure describes systems and methods that may continuously evaluate and select diverse evidence datasets on the basis of the above described bias metrics, and distribute the least-biased (e.g., independent) evidence datasets to client systems via an Application Program Interface (API) for use in variant interpretation and prioritization practices determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject.
In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of herein described functional scores, functional classifications, predictor scores, predictor classifications, hotspot scores, and hotspot classifications, in functional elements (e.g., genes) and pathways associated with Mendelian disorders (e.g., Table 1), that are known cancer-drivers (e.g., Table 2), pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response (Table 3), or other clinically-valuable genes (e.g., Table 4).
In some embodiments, the present disclosure describes systems and methods for evaluating, selecting, distributing and utilizing independent evidence—determined to be the best-performing and least biased on the basis of the herein described functional scores and classifications—for the interpretation and prioritization of variants in functional elements (e.g., genes) and pathways associated with Mendelian disorders (e.g., Table 1), that are known cancer-drivers (e.g., Table 2), pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response (e.g., Table 3), or other clinically-valuable genes (e.g., Table 4).
As discussed above, Table 1 is an example table of functional elements and pathways associated with Mendelian disorders, according to some embodiments. Table 2 is an example table of functional elements and pathways that are known cancer-drivers, according to some embodiments. Table 3 is an example table of pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response, according to some embodiments. Table 4 is an example table of other clinically-valuable genes, according to some embodiments. Tables 1-4 may be found on page 49 of the specification.
In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of herein described and enabled functional scores, functional classifications, predictor scores, predictor classifications of variants within known targets of pathogenic variation, including (but not limited) to mutational hotspots, or for variants within, for example, 50, 100, 500, and 1,000 base pair (bp) of such hotspots. In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of functional scores, functional classifications, predictor scores, or predictor classifications of variants within regions of constrained variation in a population, or for variants within, for example, 50, 100, 500, and 1,000 bp of such regions. As would be appreciated by a person of ordinary skill in the art, a variety of methods for determining mutational hotspots and regions of constrained variation can be applied.
Various embodiments can be implemented, for example, using one or more computer systems, such as computer system 1900 shown in
Computer system 1900 can be any well-known computer capable of performing the functions described herein.
Computer system 1900 includes one or more processors (also called central processing units, or CPUs), such as a processor 1904. Processor 1904 is connected to a communication infrastructure or bus 1906.
One or more processors 1904 may each be a graphics processing unit (GPU). In an embodiment, a GPU is a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.
Computer system 1900 also includes user input/output device(s) 1903, such as monitors, keyboards, pointing devices, etc., that communicate with communication infrastructure 1906 through user input/output interface(s) 1902.
Computer system 1900 also includes a main or primary memory 1908, such as random access memory (RAM). Main memory 1908 may include one or more levels of cache. Main memory 1908 has stored therein control logic (e.g., computer software) and/or data.
Computer system 1900 may also include one or more secondary storage devices or memory 1910. Secondary memory 1910 may include, for example, a local, network, or cloud-accessible hard disk drive 1912 and/or a removable storage device or drive 1914. Removable storage drive 1914 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.
Removable storage drive 1914 may interact with a removable storage unit 1918. Removable storage unit 1918 includes a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 1918 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/any other computer data storage device. Removable storage drive 1914 reads from and/or writes to removable storage unit 1918 in a well-known manner.
According to an exemplary embodiment, secondary memory 1910 may include other means, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 1900. Such means, instrumentalities or other approaches may include, for example, a removable storage unit 1922 and an interface 1920. Examples of the removable storage unit 1922 and the interface 1920 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.
Computer system 1900 may further include a communication or network interface 1924. Communication interface 1924 enables computer system 1900 to communicate and interact with any combination of remote devices, remote networks, remote entities, etc. (individually and collectively referenced by reference number 1928). For example, communication interface 1924 may allow computer system 1900 to communicate with remote devices 1928 over communications path 1926, which may be wired and/or wireless, and which may include any combination of LANs, WANs, the Internet, etc. Control logic and/or data may be transmitted to and from computer system 1900 via communication path 1926.
In an embodiment, a tangible apparatus or article of manufacture comprising a tangible computer useable or readable medium having control logic (software) stored thereon is also referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 1900, main memory 1908, secondary memory 1910, and removable storage units 1918 and 1922, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 1900), causes such data processing devices to operate as described herein.
Based on the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art(s) how to make and use embodiments of this disclosure using data processing devices, computer systems and/or computer architectures other than that shown in
It is to be appreciated that the Detailed Description section, and not any other section, is intended to be used to interpret the claims. Other sections can set forth one or more but not all exemplary embodiments as contemplated by the inventor(s), and thus, are not intended to limit this disclosure or the appended claims in any way.
While this disclosure describes exemplary embodiments for exemplary fields and applications, it should be understood that the disclosure is not limited thereto. Other embodiments and modifications thereto are possible, and are within the scope and spirit of this disclosure. For example, and without limiting the generality of this paragraph, embodiments are not limited to the software, hardware, firmware, and/or entities illustrated in the figures and/or described herein. Further, embodiments (whether or not explicitly described herein) have significant utility to fields and applications beyond the examples described herein.
Embodiments have been described herein with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined as long as the specified functions and relationships (or equivalents thereof) are appropriately performed. Also, alternative embodiments can perform functional blocks, steps, operations, methods, etc. using orderings different than those described herein.
References herein to “one embodiment,” “an embodiment,” “an example embodiment,” or similar phrases, indicate that the embodiment described can include a particular feature, structure, or characteristic, but every embodiment can not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of persons skilled in the relevant art(s) to incorporate such feature, structure, or characteristic into other embodiments whether or not explicitly mentioned or described herein. Additionally, some embodiments can be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, some embodiments can be described using the terms “connected” and/or “coupled” to indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, can also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
The breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application claims priority to U.S. Provisional Patent Application No. 62/521,759, filed on Jun. 19, 2017, now pending, and U.S. Provisional Patent Application No. 62/640,432, filed on Mar. 8, 2018, now pending, both of which are herein incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
62640432 | Mar 2018 | US | |
62521759 | Jun 2017 | US |