SYSTEMS AND METHODS FOR THE INTERPRETATION OF GENETIC AND GENOMIC VARIANTS VIA AN INTEGRATED COMPUTATIONAL AND EXPERIMENTAL DEEP MUTATIONAL LEARNING FRAMEWORK

Information

  • Patent Application
  • 20230187016
  • Publication Number
    20230187016
  • Date Filed
    December 14, 2022
    a year ago
  • Date Published
    June 15, 2023
    11 months ago
  • CPC
    • G16B5/00
    • G16B20/00
    • G16B40/00
    • G16B40/30
    • G16B40/20
    • G16B20/20
  • International Classifications
    • G16B5/00
    • G16B20/00
    • G16B40/00
    • G16B40/30
    • G16B40/20
    • G16B20/20
Abstract
Disclosed herein are system, method, and computer program product embodiments for determining phenotypic impacts of molecular variants identified within a biological sample. Embodiments include receiving molecular variants associated with functional elements within a model system. The embodiments then determine molecular scores associated with the model system. The embodiments then determine molecular signals and population signals associated with the molecular variants based on the molecular scores. The embodiments then determine functional scores for the molecular variants based on statistical learning. The embodiments then derive evidence scores of the molecular variants based on the functional scores. The embodiments then determine phenotypic impacts of the molecular variants based on the functional scores or evidence scores.
Description
OVERVIEW

Understanding the impact of genotypic (e.g., sequence) variants within functional elements in the genome—such as protein coding genes, non-coding genes, and regulatory elements—is critical to a diverse array of life sciences applications. Today, nearly half of all disease-associated genes harbor a higher number of uncharacterized variants in the general population than variants of known clinical significance. This poses significant challenges for both diagnostic and screening tests evaluating genetic and genomic sequences (Landrum et al. 2015; Lek et al. 2016). A high number of novel variants of unknown clinical significance is a feature of nearly all genes (e.g., for both germline and somatic variants in the population) and affects even the most frequently tested genes. For example, tests that evaluate gene-panels for cancer predisposing mutations report finding as many as 95 uncharacterized variants per known disease-causing variant (Maxwell et al. 2016). As such, predicting the phenotypic (e.g., cellular, organismal, clinical, or otherwise) consequences of genotypic variants is a hurdle to leveraging genetic and genomic information in a wide array of clinical settings.


Genotypic (e.g., sequence) variants within genomically-encoded functional elements can affect diverse biophysical processes, altering distinct molecular functions within each element, and resulting in varied clinical and non-clinical phenotypes. For example, in an established tumor suppressor protein coding gene, phosphatase and tensin homolog (PTEN), genotypic variants affecting transcription (f.g. −903G>A, −975G>C, and −1026C>A), protein stability (f.g. C136R), phosphatase catalytic activity (f.g. C124S, H93R), and substrate recognition (f.g. G129E), have all been associated with Cowden Syndrome (CS), presenting high-risks of breast, thyroid, endometrial, kidney, colorectal cancers and melanoma (Heikkinen et al. 2011; He et al. 2013; Myers et al. 1997; Myers et al. 1998). Variants affecting the same biophysical processes and molecular functions can lead to co-morbidities between distinct disorders, as exemplified by PTEN variants affecting phosphatase activity (e.g., H93R) which have been additionally implicated in autism spectrum disorder (ASD) (Johnston and Raines 2015), leading to frequent co-morbidities between ASD and cancers (Markkanen et al. 2016). Moreover, variants affecting distinct biophysical processes and molecular mechanisms within a functional element can present stereotypic, differentiated clinical and non-clinical phenotypes. Mutations in the lamina A/C gene (LMNA) cause a compendium of more than fifteen diseases collectively known as “laminopathies,” which include A-EDMD (autosomal Emery—Dreifuss muscular dystrophy), DCM (dilated cardiomyopathy), LGMD1B (limb-girdle muscular dystrophy 1B), L-CMD (LMNA-related congenital muscular dystrophy), FPLD2 (familial partial lipodystrophy 2), HGPS (Hutchinson—Gilford progeria syndrome), atypical WRN (Werner syndrome), MAD (mandibuloacral dysplasia) and CMT2B (Charcot—Marie—Tooth disorder type 2B) (Scharner et al. 2010). In LMNA, genotypic (e.g., sequence) variants leading to HGPS create a cryptic splice site donor in the lamin A-specific exon 11 that results in a truncated form of lamin A, whereas variants leading to FPLD2 alter surface charge of the Ig-like domain and do not change the crystal structure of the mutant protein (Scharner et al. 2010). Thus, disentangling the complexity of genotype-phenotype relationships across a wide array of variant types, functional elements, and molecular systems, and cellular effects is an outstanding challenge to robust, scalable interpretation of the phenotypic consequences of variants discovered in clinical and non-clinical genetic and genomic tests.


Indeed, assessment of the significance of genotypic (e.g., sequence) variants can be a complex and challenging task. As recently as 2015, a survey of variant classifications demonstrated that as many as 17% (e.g., 2,229/12,895) of variant classifications were inconsistent among classification submitters (Rehm et al. 2015). Between clinical testing laboratories, the concordance in interpretations has been measured to be as low as 34% though specific recommendations can increase inter-laboratory concordance to 71% (Amendola et al. 2016).


With greater than 5,300 genes evaluated by genetic tests (e.g., according to the NCBI Genetic Test Registry) in the market, scalable solutions for interpreting (e.g., classifying) genotypic (e.g., sequence) variants in a broad array of genes, diseases, and contexts (e.g., clinical and non-clinical) are critical to the efforts in the precision medicine and life sciences industries. With greater than 14,000,000 possible (e.g., unique) molecular variants within the subset of molecular variants corresponding to single nucleotide variants (SNVs), within the subset of coding sequences, and within the subset of protein-coding genes in the clinical testing market, effective solutions for molecular variant classification need to be robust and scalable.


While multiple strategies exist for identifying the phenotypic impacts of molecular variants—including but not limited to family segregation, functional assays, and case-control studies— at present, only computational variant impact predictors are able to provide supporting evidence at the required scale. In effect, an analysis of clinical variant classifications from practitioners following the joint guidelines for clinical variant interpretation from the American College of Medical Genetics and Genomics (ACMG) and the Association of Molecular Pathology (AMP) demonstrate that ˜50% of clinical variant classifications rely on the use of computational variant impact predictors. Yet, despite their wide use, benchmarking studies indicate that computational variant impact prediction algorithms—such as SIFT, PolyPhen (v2), GERP++, Condel, CADD, REVEL, and others— have demonstrably low performances, with accuracies (AUC) in the 0.52-0.75 range (Mahmood et al. 2017).


Direct assays of molecular function may provide a basis for the accurate interpretation of the clinical and non-clinical impacts of genotypic (e.g., sequence) variants (Shendure and Fields 2016; Araya and Fowler 2011). To date, a diverse spectrum of assays have been devised to directly assess the impact of variants on a wide array of molecular functions. However, existing methods require a priori knowledge or assumptions of the mechanism of action of variants associated with the clinical (and non-clinical) phenotypes under investigation to define the molecular functions to assay fShendure and Fields 2016). These methods are often limited to capturing the effects of, and informing on, only variants affecting specific molecular functions assayed, imposing limitations on the types of variants, types of molecular functions, and types of functional elements and genes which can be assayed in large-scale. Thus, while a phosphatase assay, for example, can nominate (e.g., rule-in) potential disease-associations for variants affecting catalytic activity of the PTEN tumor suppressor, such assay may not be able to exclude (e.g., rule-out) potential disease-associations for variants affecting protein stability as these variants may increase risk of developing disease without observable defects in catalytic activity. Conversely, while a protein stability assay, for example, can nominate (e.g., rule-in) potential disease-associations for variants leading to stability defects in the PTEN tumor suppressor, such assay may not be able to exclude (e.g., rule-out) potential disease-associations for variants affecting catalytic activity. The potential need for a priori knowledge or assumptions of the mechanism of action (and hence relevant molecular functions to assay) may limit the application of these methods to well-characterized functional elements (e.g., genes) and phenotypes which may prevent their application to poorly understood disease-associated genes.


Building on the technological foundations of high-throughput DNA sequencing platforms, recently developed large-scale functional assays—such as Deep Mutational Scanning (DMS), HITS-KIN, RNA-MAP, and others— have enabled comprehensive or near-comprehensive coverage of the possible sequence variants of distinct sequence classes, including single-nucleotide variants (SNVs) and non-synonymous variants (NSVs, missense variants) in coding, non-coding, and regulatory elements (Fowler et al. 2010; Araya et al. 2012; Guenther et al. 2013; Buenrostro et al. 2014; Kelsic et al. 2016; Patwardhan et al. 2009). Such methods may serve as the basis for robust, statistically-validated interpretation of the impact of molecular variants—such as genotypic (e.g., sequence) variants—on patient phenotypes (Starita et al. 2015; Majithia et al. 2016), including clinical phenotypes such as lipodystrophy and increased risk of type 2 diabetes (T2D) in patients with variants in PPARG, or increased risk of breast and ovarian cancers in patients with variants in BRCAL While such methods may provide robust variant interpretation in clinical and non-clinical testing settings, these methods may require significant development and customization to assay each molecular function and each functional element. This may limit their utility as a generalizable, scalable solution to systematically assess the clinical and non-clinical consequences of molecular variants—such as genotypic (e.g., sequence) variants— across diverse types of variants, biophysical processes, molecular functions, functional elements, genes, and ultimately, pathways. Thus, there is a need for a multi-functional platform and methods for variant impact assessment.





BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are incorporated herein and form a part of the specification.



FIGS. 1A-1C illustrate integrated functional assay and computational Deep Mutational Learning (DML) processes and systems for determining the phenotypic impact of molecular variants, as well as example (e.g., intermediate) data generated from the application of processes and systems in two genes of the RAS/MAPK family of disorders, according to some embodiments.



FIGS. 2A-2B illustrate the performance of Deep Mutational Learning (DML) processes and systems in the identification (e.g., binary classification) of disease-causing (e.g., pathogenic) and neutral (e.g., benign) molecular variants for germline (e.g., inherited) and somatic disorders in three genes of the RAS/MAPK pathway, HRAS, PTPN11, and MAP2K2, according to some embodiments.



FIGS. 3A-3B illustrate the performance of Deep Mutational Learning (DML) processes and systems in the identification (e.g., binary classification) of cells harboring germline disease-causing (e.g., pathogenic) or neutral (e.g., benign) molecular variants in MAP2K2, according to some embodiments.



FIG. 4 illustrates an architecture of a neural network-based Denoising Autoencoder trained and applied to generate robust, reduced representations of molecular scores, according to some embodiments.



FIG. 5 illustrates normalized ERK pathway activation measured as the fraction of total ERK protein phosphorylated through enzyme-linked immunosorbent assays of cellular extracts from H293 cells harboring control, wildtype, and mutant versions ofMAP2K2 and PTPN11, according to some embodiments.



FIG. 6 illustrates an example of a method for reducing the costs of deploying Deep Mutational Learning (DML) to identify the phenotypic impact of molecular variants through the staged optimization and deployment of assays with varying cell-number, read-depth, Dimensionality Reduction Models (mDR), and Functional Models (mF), whereby optimization is first carried out on a (reduced) Truth Set of molecular variants, and deployment includes a Target Set of molecular variants, according to some embodiments.



FIG. 7 illustrates an example of a method for computing phenotype scores, according to some embodiments.



FIG. 8 illustrates an example of a method for computing molecular scores, according to some embodiments.



FIG. 9 illustrates methods for computing molecular signals associated with individual molecular variants, according to some embodiments.



FIG. 10 illustrates methods for computing molecular state-specific independent or disjoint estimates of molecular signals, according to some embodiments.



FIG. 11 illustrates methods for characterizing the distribution of cells with specific molecular variants across molecular states or phenotype scores, and deriving population signals, according to some embodiments.



FIG. 12 illustrates an example of a method for leveraging unsupervised learning techniques for identification of higher-order molecular signals from lower-order molecular signals associated with individual molecular variants, according to some embodiments.



FIG. 13 illustrates an example of a method for deriving functional scores and functional classifications via machine learning to associate molecular, phenotype, or population signals with phenotypic impacts of molecular variants via regression and classification techniques, according to some embodiments.



FIGS. 14A-14B illustrate an example of the performance of methods and systems for the binomial classification of molecular variants with two distinct phenotypic impacts as trained using varying numbers of cells, according to some embodiments.



FIG. 15 illustrates an example of a method that permits inferring sequence-function maps describing the functional scores or functional classifications for all possible non-synonymous variants in a protein coding gene using functional scores and functional classifications from a subset of the possible non-synonymous variants, according to some embodiments.



FIG. 16 illustrates an example of systems and methods for reducing the costs and increasing the scope of DML processes to determine the phenotypic impact of molecular variants through a series of modeling layers, according to some embodiments.



FIG. 17 illustrates an example of a method for generating lower-order Variant Interpretation Engines (VIEs) that can be gene and condition-specific using machine learning techniques, according to some embodiments.



FIG. 18 illustrates an example of a method for identification of Significantly Mutated Regions (SMRs) and Networks (SMNs), according to some embodiments.



FIG. 19 is an example computer system useful for implementing various embodiments.





In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.


DETAILED DESCRIPTION

Provided herein are system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof, for enabling multi-functional, multi-element, and multi-gene (e.g., pathway-scale) assessment of the phenotypic impact of variants across a wide array of variant types, biophysical processes, molecular functions, and phenotypes.


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments that can leverage high-throughput molecular measurements (e.g., next-generation sequencing), single-cell manipulation, molecular biology, computational modeling, and statistical learning techniques and can enable multi-functional, multi-element, and multi-gene (pathway-scale) assessment of the phenotypic impact of variants across a wide array of variant types, biophysical processes, molecular functions, and phenotypes.=


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments for systematically determining and statistically validating one or more phenotypic (e.g., clinical or non-clinical) impacts (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified—such as genotypic (e.g., sequence) variants— in one or more (e.g., coding or non-coding) functional elements (e.g., protein-coding genes, non-coding genes, molecular domains such as protein or RNA domains, promoters, enhancers, silencers, regulatory binding sites, origins of replication, etc.) in the (e.g., nuclear, mitochondrial, etc.) genome(s), or their derivative molecules—within a biological sample or record thereof of a subject.


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments for the classification (or regression) of likely phenotypic impacts in a subject on the basis of one or more molecular signals, phenotype signals, or population signals measured in in vivo or in vitro functional model systems. The derived regressions or classifications can be referred to as functional scores or functional classifications.


Embodiments herein represent a departure from existing computational or functional evidence support systems for molecular variant classification, as for example utilized in clinical genetic and genomic diagnostics.


First, while existing computational methods and systems for variant classification rely on a wide-array of populational, evolutionary, physico-chemical, structural, and or molecular annotations and properties for the classification of variants, existing computational methods and systems do not employ information pertaining to the impacts of molecular variants on cellular biology. As a consequence, such computational methods are unable to capture phenotypic impacts acting through variation in molecular properties within cells or variation in cellular populations and cellular heterogeneity.


Second, existing large-scale functional assays and solutions that are capable of assaying the activity of thousands of molecular variants provide activity measurements along a single dimension per molecular variant, and often require a priori knowledge or assumptions of the mechanism of action through which molecular variants exert phenotypic impacts.


Owing to these limitations, while conventional computational methods and systems for variant classification can access data across a multiplicity of annotations and parameters, these conventional approaches have demonstrably poor performance in classification (and regression) tasks for the phenotypic impact of molecular variants. Similarly, these conventional approaches require a priori knowledge or assumptions of the mechanism of action (and hence relevant molecular functions to assay), which limits their application to well-characterized functional elements (e.g., genes). This further precludes their application to poorly understood disease-associated genes. Finally, these conventional approaches require significant development and customization to assay each molecular function and each functional element.


In embodiments herein, a technological solution to overcome these technological problems involves data structures providing multi-dimensional characterization of cells and cellular populations harboring specific genotypes (e.g., molecular variants) in one or more functional elements (e.g., genes) and in one or more contexts (e.g., cell-types, drug treatments, genotypic backgrounds). Such data structures enable systems and methods for statistical learning to achieve improved accuracy in the classification tasks pertaining to the phenotypic impacts of genotypes (e.g., molecular variants or combinations thereof).


Embodiments herein enable robust, scalable, multi-dimensional classification of molecular variants (and combinations thereof) across a wide-array of functional elements and phenotypes through the acquisition of hundreds to tens of thousands (˜102-104) of molecular measurements per model system (e.g., cell), the construction of molecular profiles for tens to thousands (˜101-103 of model systems per molecular variant, thousands (˜103) of molecular variants per functional element (e.g., genes), and a single or a multiplicity of functional elements in parallel.


As illustrated in FIG. 1A, an embodiment of the present disclosure integrates Variant Library Generation 102 and Cellular Library Generation 104 methods for high-throughput mutagenesis and cellular engineering techniques to create compendiums of model systems (e.g., cells) harboring distinct molecular variants in target functional elements (e.g., genes). The embodiment provides Treatment, Single-Cell Capture, Library Preparation, Sequencing 106 methods utilizing cellular, molecular biology, and genomics techniques and technologies for treatment and capture of model systems, preparation of libraries of molecular entities, and for measuring diverse molecular entities (e.g., transcripts) within model systems. The embodiment provides Mapping, Normalization 108 bioinformatics, computational biology, and statistical techniques for mapping, quantifying, and normalizing associations between molecular variants, model systems, and molecular entities within each model system. The embodiment provides Feature Selection, Dimensionality Reduction 110 and Context Labeling, Training, Classification 112 statistical (e.g., machine) learning, distributed and high-performance computing, systems biology, population and clinical genomics techniques for label generation, feature selection, dimensionality reduction, training, and classification of molecular variants.


In some embodiments, the present disclosure describes the use of these series of methods and technologies of FIG. 1A to determine the phenotypic impacts of molecular variants identified within a biological sample. In some embodiments, the present disclosure describes the introduction of molecular variants into one or more functional elements within a model system. The model system can include single-cells, cellular compartments, subcellular compartments, or synthetic compartments. In some embodiments, the present disclosure describes the determination of molecular scores or phenotype scores of the single-cells, the cellular compartments, the subcellular compartments, or the synthetic compartments. In some embodiments, the present disclosure describes the identification of molecular variants within the single-cells, the cellular compartments, the subcellular compartments, or the synthetic compartments. As would be appreciated by a person of ordinary skill in the art, various methods can be utilized to identify molecular variants within the single-cells, the cellular compartments, the subcellular compartments, or the synthetic compartments. This may be on the basis of molecular measurements of the single-cells, the cellular compartments, the subcellular compartments, or the synthetic compartments. In some embodiments, the present disclosure describes the determination of molecular signals or phenotype signals associated with individual molecular variants on the basis of molecular scores or phenotype scores, respectively, from the single-cells, the cellular compartments, the subcellular compartments, or the synthetic compartments associated with specific molecular variants. In some embodiments, the present disclosure describes the determination of population signals associated with molecular variants on the basis of molecular scores or phenotype scores of the single-cells, the cellular compartments, subcellular compartments, or the synthetic compartments associated with specific molecular variants.


In some embodiments, the present disclosure describes the determination of functional scores or functional classifications of molecular variants by applying statistical (e.g., machine) learning approaches that associate molecular signals, phenotype signals, or population signals with the phenotypic impacts of the molecular variants. In some embodiments, the present disclosure describes the determination of evidence scores or evidence classifications of the molecular variants based on functional scores, functional classifications, predictor scores, predictor classifications, hotspot scores, or hotspot classifications. In some embodiments, the present disclosure describes the determination of the phenotypic impacts of the molecular variants identified within biological samples on the basis of the functional scores, the functional classifications, the evidence scores, or the evidence classifications of the identified molecular variants.


Embodiments herein integrate methods, techniques, and technologies from a multiplicity of domains. While statistical, machine learning techniques leveraging single-cell molecular measurements have been developed and applied for the classification of model systems (e.g., cells) originating from tens (e.g., less than 102) of different tissues or developmental stages, the requirements for achieving accurate genotype-specific (e.g. molecular variant-specific) classifications among thousands of cells with subtle differences—such as a single nucleotide difference in a genomic background defined by greater than 3×109 nucleotides— within the same cell-lines, tissues, or developmental stages, can present substantial challenges.


The present disclosure provides Deep Mutational Learning (DML) system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof for overcoming challenges in the identification (e.g., classification) of the phenotypic impact of molecular variants identified in subjects on the basis of biological signals assayed in single and populations of model systems (e.g., cells).


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof that improve cost-efficiency in the classification of molecular variants through (i) the directed deployment of DML processes and systems with lower-cost prediction models (see FIG. 16), and (ii) tiered deployment of DML processes and systems that allow robust reconstruction of molecular signals at reduced costs (see FIG. 6).


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof that improve the scalability and performance across functional elements (e.g., genes) through DML processes and systems that leverage information between functional elements (see FIGS. 3A and 3B).


The present disclosure provides system, apparatus, device, method and/or computer program product embodiments, and/or combinations and sub-combinations thereof for assessing the phenotypic impacts (e.g., pathogenicity, functionality, or relative effect) of one or more molecular (e.g., genotypic) variants in one or more (e.g., coding or non-coding) functional elements (e.g., protein-coding genes, non-coding genes, molecular domains such as protein or RNA domains, promoters, enhancers, silencers, regulatory binding sites, origins of replication, etc.) in the (e.g., nuclear, mitochondrial, etc.) genome(s), or their derivative molecules. As would be appreciated by a person of ordinary skill in the art, a molecular variant may be a genotypic (e.g., sequence) variant such as a single-nucleotide variant (SNV), a copy-number variant (CNV), or an insertion or deletion affecting a coding or non-coding sequence (or both) in the nuclear, mitochondrial, or episomal genome-natural or synthetic. As would be appreciated by a person of ordinary skill in the art, a molecular variant may also be a single-amino acid substitution in a protein molecule, a single-nucleotide substitution in a RNA molecule, a single-nucleotide substitution in a DNA molecule, or any other molecular alteration to the cognate sequence of a polymeric biological molecule.


In some embodiments, the classification (or regression) may relate to (e.g., likely) disease-causing (e.g., pathogenic) and neutral (e.g., benign) variants for disorders with genetic components, or predictions of the severity thereof, on the basis of the molecular variants identified within a biological sample or record thereof of a subject. In some other embodiments, the classification (or regression) may relate to molecular impacts (e.g., loss-of-function, gain-of-function or neutral) on the basis of molecular variants of probable molecular consequence (e.g., nonsense or insertion and deletion mutations) and probable molecular neutrality (e.g., synonymous). In some other embodiments, the classification (or regression) may relate to variation in the response to therapeutic treatments (e.g., chemical, biochemical, physical, behavioral, digital, or otherwise) on the basis of molecular variants identified within a biological sample or record thereof of a subject. In some embodiments, phenotypic impacts may refer to phenotype classes (e.g., neutral, pathogenic, benign, high-risk, low-risk, positive response variants, negative response variants) and phenotype scores (e.g., a probability of developing specific clinical and non-clinical phenotypes, the levels of metabolites in blood, and the rate at which specific compounds are absorbed or metabolized).


In some embodiments, the present disclosure provides systems and methods for modeling the diversity and prevalence of phenotypic properties within a population on the basis of the diversity and prevalence of molecular variants in representative populations. In some embodiments, the present disclosure provides systems and methods for modeling the diversity and prevalence of phenotypic properties within a population on the basis of the phenotypic impacts of molecular variants—with known or expected diversity and prevalence— where the phenotypic impacts may be modeled from one or more molecular signals, phenotype signals, or population signals, previously associated with variants in an in vivo or in vitro functional model system. In some embodiments, such modeling may be used to inform on the diversity and prevalence of mechanisms of drug-resistance in a population.


In some embodiments, the present disclosure describes the use of models of the diversity and prevalence of phenotypic properties within a population of individuals (e.g., as informed by the phenotypic impacts of molecular variants modeled from one or more molecular signals, phenotype signals, or populations signals in a functional model system) to construct cohorts of subjects (e.g., patients) and to investigate the efficacy of therapeutic and non-therapeutic interventions.


In some embodiments, the present disclosure provides systems and methods for the classification (or regression) of the phenotypic impact of molecular variants on the basis of functional scores or functional classifications derived from one or more molecular signals, phenotype signals, or population signals associated with variants as assayed in a functional model system. In some embodiments, molecular variants may be functionally modeled within cells, cellular compartments or synthetic compartments as in vivo or in vitro model systems.


In some embodiments, the molecular variants modeled (e.g., in vivo or in vitro) may be identified directly within the nucleic acid sequence of the functional elements modeled via library preparation, sequencing, and characterization of nucleic acids or nucleic acid fragments within single-cells, cellular compartments, subcellular compartments, or synthetic compartments (e.g., collectively termed model systems). In some other embodiments, the molecular variants modeled (e.g., in vivo or in vitro) may be inferred from barcode sequences associated with individual variants in the functional elements via library preparation, sequencing, and characterization of nucleic acids or nucleic acid fragments within model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments), using a pre-assembled database of associated barcodes and variants. As would be appreciated by a person of ordinary skill in the art, molecular variants may be produced via a diversity of techniques, such as direct (e.g., chemical) synthesis, error-prone PCR, oligonucleotide-directed mutagenesis, nicking mutagenesis, or Saturation Genome Editing (SGE), among others (Firnberg et al. 2012; Kitzman et al. 2014; Wrenbeck et al. 2016; and Findlay et al. 2014). As would be appreciated by a person of ordinary skill in the art, variant libraries can be then introduced (e.g., added) into model systems (e.g., cells, cellular compartments, subcellular compartments, or synthetic compartments) using a variety of approaches, such as but not limited to homologous recombination (e.g., Cas9-mediated or Adenovirus-mediated), site-specific recombination (e.g., Flp-mediated), or viral transduction (eg., lentiviral-mediated) (Findlay et al. 2018; Wissink et al. 2016; and Macosko et al. 2015).


In some embodiments, functional scores and functional classifications associated with individual molecular variants may be derived from measurements of molecules and or chemical modifications present within in vivo or in vitro model systems harboring the variant within the functional element, including but not limited to DNA, RNA, and protein molecules or modifications thereof. For example, in some embodiments, measurements or models of molecular signals, cellular signals, or population signals may be made and used to learn the functional scores and or functional classifications. In some embodiments, the functional scores and functional classifications may be derived from molecular measurements obtained via nucleic acid barcoding, isolation, enrichment library preparation, sequencing, and characterization of a plurality of nucleic acids or nucleic acid fragments within single-cells, cellular compartments, subcellular compartments, or synthetic compartments including, but not limited to, RNA molecules, genomic DNA, chromatin-associated DNA, protein-associated DNA, accessible DNA fragments, or chemically-modified nucleic acids. In some embodiments, these procedures may utilize molecular barcoding techniques to uniquely identify or associate nucleic acids, nucleic acid fragments, or nucleic acid sequences stemming from individual single-cells, cellular compartments, subcellular compartments, or synthetic compartments (Macosko et al. 2015; Buenrostro et al. 2015; Cusanovich et al. 2015; Dixit et al. 2016; Adamson et al. 2016; Jaitin et al. 2016; Datlinger et al. 2017; Zheng et al. 2017; Cao et al. 2017). These methods may build on developments from the field of single-cell genomics Schwartzman and Tanay 2015; Tanay and Regev 2017; Gawad et al. 2016). In some embodiments, the systems and methods of the present disclosure may apply methods for single-cell RNA sequencing to derive molecular measurements from single-cells, cellular compartments, subcellular compartments, or synthetics compartments. These methods include but are not limited to single-cell sequencing library generation, high-throughput nucleic acid sequencing, sequencing read quality control, barcode identification (e.g., of single-cell, cellular compartment, subcellular compartment, or synthetic compartment) and quality control, sequencing read unique molecular barcode identification and quality control, sequencing read alignments, as well as read alignment filtering and quality control. In some embodiments, molecular measurements may correspond to locus-specific measurements of gene expression (e.g., RNA transcript abundance), protein abundance or modifications (e.g., phospho-protein abundance), chromatin accessibility (e.g., nucleosome occupancy), epigenetic modification (e.g., DNA methylation), regulatory activity (e.g., transcription factor binding), post-transcriptional processing (e.g., splicing), post-translational modification (e.g., ubiquitination), mutation burden (e.g., count), mutation rate (e.g., frequency), mutation signatures (e.g., count or frequency per type of mutation), or various other types of measurements of molecules within single-cells, cellular compartments, subcellular compartments, or synthetic compartments as would be appreciated by a person of ordinary skill in the art. In some embodiments, the present disclosure describes systems and methods for augmenting the quality of the molecular measurements for specific target genes and functional elements via the use targeted enrichment or targeted capture techniques—via hybridization- or amplicon-based techniques and probes— either before, during or after single-cell RNA library processing.


In some embodiments, molecular measurements from single-cells, cellular (or subcellular) compartments or synthetic compartments may be utilized to derive multi-locus measurements of molecular processes. For example, these measurements of molecular processes may include multi-locus measurements of gene expression, chromatin accessibility, epigenetic modification, regulatory activity, transcriptional activity, translational activity, signaling activity, signaling activity, pathway activity, mutation burden, mutation rate, mutation signatures, and various other measurements as would be appreciated by a person of ordinary skill in the art.


In some embodiments, molecular measurements and molecular processes from single-cells, cellular (or subcellular) compartments or synthetic compartments may be utilized to derive global (e.g., pan-locus or locus-independent) measurements of molecular features. For example, these measurements of molecular features may include global measurements of gene expression, chromatin accessibility, epigenetic modification, regulatory activity, transcriptional activity, translational activity, signaling activity, signaling activity, pathway activity, mutation burden, mutation rate, mutation signatures, and various other measurements as would be appreciated by a person of ordinary skill in the art.


In some embodiments, molecular measurements, molecular processes, or molecular features of single-cells, cellular compartments, subcellular compartments, or synthetic compartments may serve directly as (e.g., lower-order) molecular scores. In some embodiments, a (e.g., higher-order) molecular score may be derived by applying pre-existing models that associate multiple lower-order (e.g., lower-order) molecular scores (e.g., molecular measurements, molecular processes, or molecular features) to regulatory, signaling, pathway, processing, cell-cycle activities, alterations, defects, or states. In some embodiments, such methods may apply gene set enrichment analysis or other derivative methods as would be appreciated by a person of ordinary skill in the art. In some embodiments, as illustrated in FIG. 8, the molecular measurements, molecular processes, molecular features, or (e.g., lower-order) molecular scores 806 from single-cells, cellular compartments, subcellular compartments, or synthetic compartments harboring the same molecular variants 802 may be fed through a series of artificial neuron layers (e.g., convolutional or perceptron layers) in an Artificial Neural Network 804 (ANN) to derive increasingly complex (e.g., higher-order) molecular scores 806, and generate autoencoders with learned features. In some embodiments, methods for computing molecular scores, such as pathway level analyses, may be used to preserve information of biological function while allowing for dimensionality reduction.


In some embodiments, as illustrated in FIG. 9, a database of molecular scores may be constructed via a cell scoring layer 902 from a plurality of individual single-cells, cellular compartments, subcellular compartments, or synthetic compartments. In some embodiments, the molecular scores from a plurality of single-cells, cellular compartments, subcellular compartments, or synthetic compartments, harboring the same molecular variants 906 (e.g., v1, v2, and v3) may be accessed with a variant sampling layer 908 and analyzed in a variant scoring layer 910 to derive (e.g., directly measure or model) summary statistics relating to the tendency (e.g., mean, median, mode), dispersion (e.g., variance, standard deviation), shape (e.g., skewness, kurtosis), probability (e.g., quantiles), range (e.g., confidence interval, minimum, maximum), error (e.g., standard error), or covariation (e.g., covariance) of molecular scores associated with individual molecular variants. In some embodiments, as illustrated in FIG. 9, summary statistics relating to the tendency, dispersion, shape, range, or error of molecular scores may be used to create a database of (e.g., quality-controlled) molecular signals 912 associated with individual molecular variants 906. In some embodiments, molecular measurements, molecular processes, molecular features, and molecular scores 904 may be properties of individual single-cells, cellular compartments, subcellular compartments, or synthetic compartments. In some embodiments, molecular signals may be a property of molecular variants.


As would be appreciated by a person of ordinary skill in the art, the molecular measurements, processes, features, and scores from model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) may define or correspond to distinct molecular states or specific subpopulations of model systems (e.g., single-cells, cellular compartments, subcellular compartments or synthetic compartments) with similar molecular properties. As would be appreciated by a person of ordinary skill in the art and as shown in FIG. 10, a cell scoring layer 1002 can be applied to determine the molecular states, phenotype scores 1006 (e.g., s1, s2, s3) of model systems on the basis of a variety of methods.


For example, the molecular states of model systems can be identified on the basis of cell-cycle signatures derived from gene-expression molecular scores (Macosko et al. 2015). As would be appreciated by a person of ordinary skill in the art, molecular states can be derived via scoring using previously-derived models—for example, scoring gene-expression signatures of previously characterized molecular states such as gene-expression signatures reflecting distinct phases of the cell-cycle previously characterized in chemically synchronized cells Whitfield et al. 2002). As would be appreciated by a person of ordinary skill in the art, molecular states may also be derived via scoring using internally-derived models from partitions of model systems within which characteristic correlations between molecular signals can be detected or expected (e.g., as is the case with gene expression variation throughout distinct stages of cell-cycle). As would be appreciated by a person of ordinary skill in the art, the internally-derived models may be generated using a variety of statistical techniques (e.g., machine learning techniques).


In some embodiments, as illustrated in FIG. 7, the present disclosure provides systems and methods to generate a Phenotype Model (mP) for deriving phenotype scores through the use of statistical techniques (e.g., machine learning techniques) that associate molecular scores and molecular states of model systems (e.g., single-cells, cellular compartments, subcellular compartments or synthetic compartments) with the phenotypic impacts of molecular variants within each model system. Whereas molecular scores can relate directly to molecular, biological, or physical properties within individual model systems, phenotype scores can describe the (e.g., likely) phenotypic associations of molecular variants. In some embodiments, the phenotype scores are derived by applying supervised learning techniques to associate the phenotypic impacts (e.g., labels) of molecular variants within model systems with the molecular scores or molecular states (e.g., features) of model systems.


In some embodiments, a Phenotype Model (mP) and database of phenotype scores (or phenotype classifications) is generated by accessing a database of features describing (e.g., lower- and higher-order) molecular scores and molecular states 704 of single-cells 702, and input labels 708 (e.g., a database) describing the phenotypic impact 706 of molecular variants identified within single-cells 702. In some embodiments, a training/validation layer 710 generates and quality-controls Phenotype Models (mP) that can predict the phenotypic impact 706 of individual single-cells 702. In some embodiments, a database of features describing the molecular scores and molecular states 716 of single-cells (testing) 714 are provided to the generated Phenotype Models (mP) to calculate and create a database of phenotype scores 720 describing the predicted phenotypic impact 718 of molecular variants in single-cells (testing) 714. As would be appreciated by a person of ordinary skill in the art, the performance (e.g. accuracy) of the predicted phenotypic impacts 718 in each cell (e.g., phenotype scores 720) can be determined against the known phenotypic impact of molecular variants in single-cells (testing) 714 within a testing layer 712. As would be appreciated by a person of ordinary skill in the art, the Phenotype Models (mP) can be applied to pre-compute or compute, on demand, the phenotype scores of single cells not included in training, validation, or testing. In some embodiments, such scoring and evaluation can occur in a phenotype scoring and classification layer 722. Phenotype scoring and classification layer 722 can examine the phenotype impact classification accuracy permitted on the basis of phenotype scores 720.


In some embodiments, summary statistics relating to the tendency, dispersion, shape, range, or error of phenotype scores may be used to create a database of (e.g., quality-controlled) phenotype signals associated with individual molecular variants.


In some embodiments, and as illustrated in FIG. 10, the present disclosure describes the use of molecular state-specific molecular signals for subsequent rounds of unsupervised and supervised learning, in either the generation of molecular state-specific models or multi-state models. In some embodiments and as illustrated in FIG. 10, the present disclosure describes the use of a molecular state-, variant-specific sampling layer 1008 to access the molecular measurements, processes, features, and scores 1004 and the molecular states, phenotype scores 1006 of model systems with specific molecular variants 1010 (e.g., v1, v2, v3) and in specific molecular states, with characteristic phenotype scores, or combinations thereof. In some embodiments, the molecular measurements, processes, features, and scores 1004 or the molecular states, phenotype scores 1006 may be pre-computed or computed on demand by a cell scoring layer 1002. In some embodiments, data, summary statistics, descriptive statistics (e.g., univariate, bivariate, or multivariate analysis), inferential statistics, Bayesian inference models (e.g., variational Bayesian inference models), Dirichlet processes, or other models of the data accessed by the molecular state-, variant-specific sampling layer 1008 are used to construct a molecular, phenotype signals matrix 1012, describing molecular signals and phenotype signals in each molecular state for each molecular variant.


In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand. In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand by a molecular state, variant-specific scoring layer 1016 yielding matrices that are molecular state-specific. In some embodiments, the molecular, phenotype signals matrix 1012 may be pre-computed or computed on demand by a multi-state, variant-specific scoring layer 1014, yielding matrices that contain data from multiple molecular states.


In some embodiments, as illustrated in FIG. 11, the present disclosure provides methods for characterizing the distribution of cells with specific molecular variants across molecular states (e.g., sub-populations) or phenotype scores 1106, as produced by a cell scoring layer 1102 using molecular measurements, processes, features and scores 1104 as inputs. These molecular states (e.g., sub-populations) or phenotype scores may be associated with, but not limited to, subpopulations of cells defined by (a) characteristic levels of or correlations between molecular signals (e.g., cyclin dependent kinases during the cell-cycle stage), whether determined by the application of pre-existing or internally-derived models, (b) characteristic levels of or correlations between phenotype scores, or (c) unsupervised or supervised machine learning methods, including but not limited to dimensionality reduction techniques, examples of which include but are not limited to Principal Component Analysis (PCA), Independent Component Analysis (ICA), and t-Stochastic Neighbor Embedding (tSNE). In some embodiments, as illustrated in FIG. 11, for each individual molecular variant 1110, a population sampling layer 1108 produces metrics of the relative representation (e.g., distribution, probability, etc.) of cells across molecular states (e.g., the proportion or the probability of variant-harboring cells residing in a molecular state) or phenotype scores (e.g., the proportion or the probability of variant-harboring cells having a particular score), and may serve to provide a population signals matrix 1112 describing how molecular variants affect cells at the population-level. The population signals matrix 1112 may contain a plurality of population signals for a plurality of molecular variants.


In some embodiments, subsampling of molecular measurements, molecular processes, molecular features, molecular scores, or phenotype scores from model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) harboring the same molecular variant may be applied to generate independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, or molecular scores or phenotype scores associated with individual molecular variants.


In some embodiments, independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, molecular scores or phenotype scores may be used to create a database of (quality-controlled) independent or disjoint estimates of molecular signals or phenotype signals associated with individual molecular variants. As would be appreciated by a person of ordinary skill in the art, independent or disjoint estimates of molecular signals or phenotype signals can be used to create a database of (quality-controlled) molecular or phenotype signals associated with individual molecular variants.


In some embodiments, the present disclosure describes systems and methods for deriving independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, or molecular scores or phenotype scores associated with individual molecular variants within subpopulations of model systems (e.g., single-cells, cellular compartments, subcellular compartments, or synthetic compartments) from specific molecular states. As would be appreciated by a person of ordinary skill in the art, these methods may leverage a plurality of statistical techniques (e.g., machine learning techniques).


In some embodiments, molecular state-specific independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of molecular measurements, molecular processes, molecular features, molecular scores or phenotype scores may be used to create a database of (e.g., quality-controlled) molecular state-specific, independent and disjoint estimates of molecular signals and phenotype signals associated with individual molecular variants in specific molecular states.


In some embodiments, independent or disjoint estimates of summary statistics relating to the tendency, dispersion, shape, probability, range, covariation, or error of population signals associated with individual molecular variants may be used to create a database of (e.g., quality-controlled) population signals associated with individual molecular variants.


In some embodiments, as illustrated in FIG. 12, the present disclosure provides systems and methods leveraging a feature extraction layer 1208 (e.g., unsupervised learning techniques) for the identification of higher-order molecular signals, phenotype signals, or population signals from lower-order molecular signals, phenotype signals, or population signals 1204 associated with individual molecular variants 1202, including but not limited to feature learning (or representation learning) techniques deploying Artificial Neural Networks (ANNs) 1210 to generate auto-encoders capable of leveraging subjacent associations to yield higher-order representations of lower-order molecular, phenotype, or population signals. In some embodiments, these methods allow the construction of databases lower- and higher-order molecular signals, phenotype signals, and population signals 1214. In some embodiments, the feature extraction layer 1208 may access or receive data from annotation features 1206, in addition to the lower-order molecular signal, phenotype signals, or population signals 1204. In some embodiments, the annotation features 1206 may encompass a plurality of independent (e.g., non-assayed) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others as would be appreciated by a person of ordinary skill in the art), describing changes associated with the changes in genotype (e.g., sequence, molecular variants, etc.).


In some embodiments, the present disclosure describes the use of molecular state-specific, lower-order molecular signals or phenotype signals for the derivation of molecular state-specific higher-order molecular signals or phenotype signals. In some embodiments, the present disclosure describes the use of multi-state matrices of lower-order molecular, phenotype, or population signals to derive multi-state higher-order molecular, phenotype, or population signals, leveraging structured relationships between molecular signals across molecular states, such as structured gene expression patterns (e.g., molecular signals) across cell-cycle stages (e.g., molecular states). In some embodiments, the present disclosure describes the use of Convolutional Neural Networks (CNNs) to learn patterned-associations in molecular, phenotype, or population signals (and annotation features) across molecular states.


In some embodiments, and as illustrated in FIG. 13, the present disclosure provides systems and methods for deriving functional scores and functional classifications via statistical (e.g., machine) learning to generate a Functional Model (mF) that associates molecular, phenotype, or population signals (e.g., features)—a single or plurality of molecular measurements, molecular processes, molecular features, and molecular scores— with phenotypic impacts (e.g., labels) of molecular variants via regression and classification techniques, respectively.


In some embodiments, a Functional Model (mF) and a database of functional scores (or functional classifications) is generated by accessing a database of features describing molecular (e.g., lower-order or higher-order), phenotype, or population signals 1304 of molecular variants 1302 for training/validation, and a set of input labels 1310 (e.g., a database) describing the phenotypic impacts 1308 of molecular variants 1302. The generating is further performed by applying statistical (e.g., machine) learning techniques to associate molecular, phenotype, or population signals 1304 (e.g., features) to phenotypic impacts (e.g., labels).


In some embodiments, a training/validation layer 1312 performs training and validation to generate quality-control Functional Models (mF) that can predict the phenotypic impacts 1308 of molecular variants 1302. In some embodiments, training/validation layer 1312 can deploy cross-validation techniques, such as, but not limited to, K-fold or Leave-One-Out Cross-Validation (LOOCV). In some embodiments, a database of features describing the molecular, phenotype, or population signals 1318 of molecular variants (testing) 1316 can be provided to the generated Functional Models (mF) to calculate and create a database of functional scores 1324 describing the predicted phenotypic impact 1322 of molecular variants (testing) 1316. As would be appreciated by a person of ordinary skill in the art, the performance (e.g. accuracy) of the predicted phenotypic impacts 1322 (e.g., functional score 1324) of molecular variants can be determined against known phenotypic impacts of molecular variants, such as testing molecular variants 1316. As would be appreciated by a person of ordinary skill in the art, the Functional Models (mF) can be applied to pre-compute, or compute on demand, the functional scores of molecular variants not included in training, validation, or testing phases within a testing layer 1314. In some embodiments, such scoring and evaluation can occur in a functional scoring and classification layer 1326 to, for example, examine the phenotype impact classification accuracy permitted on the basis of functional scores 1324.


In some embodiments, additional annotation features 1306, 1320 may be provided during training and testing (prediction generation) of Functional Models (mF). In some embodiments, the annotation features 1306 and 1320 may encompass a plurality of independent (e.g., non-assayed) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others as would be appreciated by a person of ordinary skill in the art), describing changes associated with the changes in genotype (e.g., sequence, molecular variants).


As would be appreciated by a person of ordinary skill in the art, a diverse array of sources for phenotypic impacts (e.g., labels) of molecular variants can be used to define Truth Sets, including (e.g., public and or private) clinical and non-clinical variant databases (e.g., ClinVar, HumVar, VariBench, SwissVar, PhenCode, PharmGKB, or locus-specific databases), and outcome databases.


In some other embodiments, the present disclosure provides systems and methods for deriving functional scores and functional classifications via statistical (e.g., machine) learning to generate a Functional Model (mF) that associates molecular, phenotype, or population signals (e.g., features)—derived from one or more molecular measurements, molecular processes, molecular features, and/or molecular scores— with phenotypic impacts (e.g., labels) of molecular variants computed directly from distinct molecular, phenotype, or population signals, via regression and classification techniques. In some embodiments, this approach may permit, for example, deriving functional scores and functional classifications that predict the relative mutation burden, mutation rate, or mutation signatures of samples from subjects harboring specific molecular variants. In some embodiments, functional scores or functional classifications from such assays may permit informing on the lifetime risk of developing cancer in test subjects.


As would be appreciated by a person of ordinary skill in the art, regression and classification to generate Functional Models (mF's) may rely on various statistical (e.g., machine) learning techniques for semi-supervised or supervised learning, including, but not limited to, Random Forests (RFs), Gradient Boosted Trees (GBTs), Zero Rules (ZRs), Naive Bayesian (NBs), Simple Logistic Regression (LRs), Support Vector Machines (SVMs), k-Nearest Neighbors (kNNs), and approaches deploying a wide-array of Artificial Neural Network (ANN) architectures and techniques. In some embodiments, the present disclosure describes the use of molecular state-specific, molecular signals for the derivation of molecular state-specific functional scores or functional classifications. In some other embodiments, the present disclosure describes the use of multi-state matrices of molecular signals for the derivation of molecular state-aware functional scores or functional classifications. In some embodiments, the present disclosure describes the use of Convolutional Neural Networks (CNNs) to learn patterned-associations between functional scores or functional classifications and molecular signals distributed across molecular states.



FIG. 1A illustrates the application of DML processes and systems in genes of the RAS/MAPK pathway, according to some embodiments. The RAS/mitogen-activated protein kinase (MAPK) pathway can play a role in cellular proliferation, differentiation, survival and death, and somatic mutations in RAS/MAPK genes can have a role in the development, progression, and therapeutic response of diverse cancer types through the activation and disregulation of MAPK/ERK signaling. In addition, inherited (e.g., germline) mutations in RAS/MAPK genes have been associated with multiple autosomal dominant congenital syndromes, including but not limited to Noonan syndrome (NS), Costello syndrome (CS), and cardio-facio-cutaneous (CFC) syndrome, and LEOPARD syndrome (LS), which present in patients with characteristic facial appearances, heart defects, musculocutaneous abnormalities, and mental retardation, as well as abnormalities of the skin, inner ears and genitalia (Aoki et al. 2008). For example, mutations in the protein tyrosine phosphatase, non-receptor type 11 (PTPN11) and the dual specificity mitogen-activated protein kinase kinase 1/2 genes (MAP2K1, MAP2K2) have been recurrently observed in Noonan and CFC patients, with PTPN11 mutations present in as many as 50% of Noonan patients (Aoki et al. 2008).


Embodiments can use wildtype, somatic, and germline molecular variants of key RAS/MAPK pathway constituents, such as HRAS (e.g., G12V), PTPN11 (e.g., E76K and N308D), and MAP2K2 (e.g., F57C and P128Q), that are constructed and overexpressed in HEK293 cells. Embodiments can select cells with 1 mg/ml puromycin to ensure expression of the exogenously introduced functional elements (e.g., genes), and RAS/MAPK pathway activation can be verified using an enzyme-linked immunosorbent assays (ELISA) for phospho-ERK protein and total ERK protein abundances (see FIG. 5). To generate single-cell RNA-seq data, embodiments can target for capture 500 cells for each molecular variant using a 10×Genomics Chromium system. Capture and subsequent single-cell library generation can be performed according to manufacturer's recommendations. The resultant libraries for each functional element (e.g., gene) can be pooled and sequenced on an Illumina MiniSeq sequencer until the average reads per cell for each genotype exceeds 30,000 reads/cell. Single-cell RNA-seq processing (e.g., single cell quality control, normalizations, transcriptome counts, etc.) can be performed using the 10×Genomics Cell Ranger 2.1.0 pipeline and default settings.



FIGS. 1B and 1C, illustrate the projection of mammalian cells (e.g., HEK293) harboring wildtype and mutant PTPN11 and MAP2K2, for molecular variants associated with germline disorders (F57C, P128Q, and N308D) as well as somatic disorders (E76K), according to some embodiments. Cells can be projected on a two-dimensional plane derived by t-Stochastic Neighbor Embedding (tSNE) on the basis of molecular scores (e.g., lower-order) determined from scaled, normalized unique molecular identifier (UMI) counts of single-cell gene expression, according to some embodiments. For each gene, tSNE projections are shown based on higher-order molecular scores derived via application of broad, generalized algorithms standard in the field (e.g., Principal Component Analysis, PCA) and custom-developed solutions, including cell-type, gene- or pathway-specific Autoencoders (AE) trained for robust, compressed representation of lower-order molecular scores. In some embodiments, the Autoencoder can be constructed as a neural network with fully connected layers, containing symmetric numbers of neurons (e.g., across layers) around the middle layer, and with rectified linear-units (ReLu) for activation. In some embodiments, the Autoencoder can be trained using an Adam optimizer and optimized against a mean-squared error (MSE) loss function.


As illustrated in FIGS. 1B and 1C, cellular projections from customized, cell-type and pathway-specific Autoencoders (AEs) can improve the hyperdimensional separation between model systems (e.g., cells) harboring neutral (e.g., wildtype) and disease-associated molecular variants (e.g., N308D, E76K), relative to generalized dimensionality reduction algorithms. A Denoising Autoencoder (AE) was trained on 8.3 Million lower-order molecular scores from greater than 18,800 genes detected in 3,495 single HEK293 cells harboring wildtype and mutant versions of RAS/MAPK genes. Training was performed in 30 epochs with a mini-batch size of 10, with noise simulations following a randomized 5% reduction in the sampling of UMI counts between epochs. The architecture of the utilized fully-connected, symmetric Autoencoder is shown in FIG. 4. Whereas conventional approaches in the domain for the scaling, normalization, and dimensionality reduction of lower-order molecular scores can fail to separate the tSNE-projections of cells harboring Noonan syndrome (NS; N308D) molecular variants and wildtype PTPN11, customized cell-type and pathway-specific Autoencoders can show a robust separation of cells harboring somatic (E76K) and germline (N308D) disorder molecular variants from wildtype cells in PTPN11.


According to some embodiments, FIGS. 14A and 14B illustrates the performance of systems and methods for the binomial classification of molecular variants with two distinct phenotypic impacts as determined in mammalian cells harboring either disease-associated (e.g., pathogenic) genotypic (e.g., sequence) variants (e.g., G12V) and a wild-type (e.g., benign) genotypic (e.g., sequence) version of the human HRAS gene, or a third member of the RAS/MAPK pathway which encodes the onco-protein h-Ras (also known as transforming protein p21). A small G protein in the Ras subfamily of the Ras superfamily of small GTPases, h-Ras—once bound to guanosine triphosphate— can activate RAF-family kinases (e.g., c-Raf), leading to cellular activation of the MAPK/ERK pathway.



FIG. 14A illustrates the projection 1402 of wildtype and mutant mammalian cells (HEK293) on the two-dimensional plane derived by t-Stochastic Neighbor Embedding (tSNE) of cells on the basis of their normalized, single-cell gene expression measurements. As indicated in FIG. 14A, lower-order molecular scores can be derived from the molecular measurements of greater than 33,500 genes, with an average of ˜3,500 molecular measurements made per cell. Principal Component Analysis (PCA) can be applied to derive higher-order molecular scores that reduce the dimensionality of the lower-order molecular scores. Gaussian Mixture Models (GMMs) can be applied to assign the projected cells to molecular states 1404, defining, for example, N=6 sub-populations of cells on the basis of the lower-order molecular scores derived from their normalized, single-cell gene expression measurements (e.g., UMI counts). Pseudo disease-associated genotypes and benign genotypes can be generated by randomly assigning mutant and wildtype cells to, for example, kP=15 disease-associated and kB=15 benign pseudo-populations, respectively. To train and test a machine learning Functional Model (mF) capable of discriminating between disease-associated and benign genotypes, pseudo-populations (kP1-15, kB1-15) can be divided into training and testing sets applying, for example, an 80/20 cross-validation scheme, resulting in, for example, kTRAIN=12 training and kTEST=3 testing genotypes of each class label (e.g., disease-associated and benign), collectively termed a Truth Set. This procedure can be repeated, for example, i=25 iterations in each of f=5 folds, wherein within each fold the cells within the pseudo-population (e.g., kP1-15, kB1-15) can be sampled with replacement to retain, for example, 20%, 40%, 60%, 80%, or 100% of the cells. In each iteration, fold, and sampling, lower-order molecular signals and higher-order molecular signals for disease-associated and benign genotypes can be computed as the mean of the lower-order molecular scores and higher-order scores, respectively. In each iteration, fold, and sampling, population signals for disease-associated and benign genotypes can be determined as the fraction of cells corresponding to each of the, for example, N=6 sub-populations. In each iteration, fold, and sampling, a machine learning Functional Model (mF) can partition disease-associated and benign genotypes from the Truth Set on the basis of the lower-order molecular signals, higher-order molecular signals, or population signals observed in the kTRAIN data. This Functional Model (mF) can be trained utilizing a 10×cross-validation strategy as well as a Random Forest estimator to partition variants. In each iteration, fold, and sampling, the trained Functional Model (mF) can predict the class label (e.g., disease-associated or benign) of the kTEST pseudo-populations on the basis of their lower-order molecular signals, higher-order molecular signals, or population signals. As illustrated in FIG. 14B, this approach can result in robust discrimination between disease-associated and benign genotypes on the basis of the lower-order molecular signals, higher-order molecular signals, and population signals determined within populations of mutant and wildtype cells.


To evaluate the performance of DML processes and systems as a scalable solution for the accurate identification of disease-associated (e.g., pathogenic) molecular variants across multiple genes and disorders, a uniform, distributed DML processing pipeline can be deployed for the pre-processing, scaling, normalization, dimensionality reduction, and computation of molecular and population signals on, for example, three genes of the RAS/MAPK pathway, HRAS, PTPN11, and MAP2K2. Applying a similar training/testing schema for the evaluation of classification accuracies as above, the DML processes can achieve (e.g., median) raw classification accuracies 202 of ˜99.9% and ˜100% in the analysis of somatic cancer-driving molecular variants in HRAS (e.g., G12V) and PTPN11 (e.g., E76K), respectively, and (e.g., median) raw classification accuracies 204 of ˜98.5% and ˜96.1% in the analysis of molecular variants form germline (e.g., inherited) disorders in PTPN11 (e.g., N308D) and MAP2K2 (e.g., F57C, P128Q), respectively, as demonstrated in FIG. 2A. The balanced accuracies 206, 208 (e.g., Matthews Correlation Coefficient, MCC) in the classification of molecular variants known to cause somatic disorders in HRAS, somatic disorders in PTPN11, germline disorders in PTPN11, and germline disorders in MAP2K2, can be ˜99.4%, ˜100%, ˜95.2%, and ˜90.1%, respectively, as shown in FIG. 2B. The raw classification accuracies (e.g., ACC) and balanced classification accuracies (e.g., MCC) in the analysis of disease-associated (e.g., somatic and germline, combined) molecular variants can be ˜98.4% and ˜95.6%, respectively, on the basis of the herein described molecular and population signals.


In some embodiments, the present disclosure provides systems and methods for the derivation of model system-level (e.g., cell-level) phenotypic scores through application of statistical machine learning models to associate lower-order and higher-order molecular scores with the known phenotypic impacts of variants harbored within model systems (e.g., cells). FIGS. 3A and 3B illustrates the cell-level raw classification accuracy of machine learning models trained to derive phenotypic scores in cells harboring wildtype and mutant versions of MAP2K2, according to some embodiments.


In FIG. 3A, germline and enhanced bars can indicate the average classification accuracy of test cells harboring MAP2K2 germline-disorder molecular variants excluded from training, on the basis of cell phenotype scores, where training was exclusively based on MAP2K2 neutral and germline-disorder molecular variants (e.g., germline 302) or included data from PTPN11 germline-disorder molecular variants (e.g., enhanced 304). Germline 302 and enhanced 304 bars in FIG. 3B indicate the average classification accuracy of test MAP2K2 germline-disorder molecular variants excluded from training, as determined on the basis of the predominant cell phenotype scores for populations of cells with varying numbers of cells. As in FIG. 3A, germline and enhanced bars can correspond to the raw accuracies in classification of test molecular variants where training was exclusively based on MAP2K2 neutral and germline-disorder molecular variants (e.g., germline) or included data from PTPN11 germline-disorder molecular (e.g., enhanced).



FIGS. 3A and 3B illustrates data obtained with a logistic regression (LR) classifier trained for binary classification of cells harboring disease-associated molecular variants and cells harboring wildtype MAP2K2, on the basis of higher-order molecular scores computed as the top 100 principal components from (e.g., scaled and or normalized) lower-order molecular scores. Sets of cells for training and testing can be created by partitioning molecular variants into training and testing bins, and partitioning cells into corresponding training and testing sets on the molecular variant genotypes, such that specific sets of cells with specific disease-associated molecular variant are excluded from training. As such, classification test performance can be computed on complete populations of cells harboring variants excluded from training. As shown in FIGS. 3A and 3B, the average per-cell classification accuracy across molecular variants associated with germline (e.g., inherited) disorders in MAP2K2 can be ˜80.3%.


In some embodiments, the present disclosure describes the learning and prediction of the phenotypic consequences of molecular variants on the basis of molecular, phenotype, or population signals assayed in multiple genes, molecular elements, within the same, related, or interacting pathways. As shown in FIGS. 3A and 3B, inclusion of data from PTPN11 molecular variants associated with germline (e.g., inherited) disorders can increase the average per-cell classification accuracy across germline-disorder molecular variants in MAP2K2 from ˜80.3% (e.g., germline 302) to ˜92.8% (e.g., enhanced 304), thereby demonstrating the ability of the disclosed DML, processes and systems to identify and leverage coherent cellular properties for accurate classification of the phenotypic impacts of molecular variants across multiple functional elements. As shown in FIGS. 3A and 3B, the increased performance in per-cell classification can result in increases in classification of molecular variants on the basis of the majority-type classification from populations of cells harboring molecular variants.


In some embodiments, the present disclosure provides systems and methods for deriving functional scores and functional classifications for individual functional elements (e.g., individual genes). In some embodiments, the present disclosure provides methods for deriving functional scores and functional classifications across a multitude of functional elements leveraging concordant molecular signals across molecular variants within a plurality of functional elements. In some embodiments, the present disclosure describes systems and methods combining the use of mutagenesis, molecular barcoding, molecular cloning, and cellular pooling techniques to generate populations of cells in which molecular variants in distinct functional elements are uniquely created, barcoded, or both.


In some embodiments, independent or disjoint estimates of molecular, phenotype, or population signals (e.g., features) may be used to derive independent or disjoint functional scores and functional classifications via statistical (e.g., machine) learning to associate molecular signals (e.g., features) with phenotypic impacts (e.g., labels) of molecular variants via regression and classification techniques, respectively.


In some embodiments, feature weights from statistical (e.g., machine) learning models generated using independent or disjoint estimates of each molecular, phenotype, or population signal are computed, collected and utilized for robust feature selection using techniques as would be appreciated by a person of ordinary skill in the art. In some embodiments, the present disclosure provides methods for deriving functional scores and functional classifications via statistical (e.g., machine) learning to associate the identified robust molecular, phenotype, or population signals (e.g., robust features) with phenotypic impacts (e.g., labels) of molecular variants via regression and classification techniques, respectively.


In some embodiments, the present disclosure describes systems and methods for deriving functional scores and functional classifications from a plurality of statistical (e.g., machine) learning models generated using independent or disjoint estimates of molecular signals, applying either model selection or model combination (e.g., mixing) techniques (Pan et al. 2006).


In some embodiments applying model selection techniques, a model selection criterion measuring the predictive performance of a model or the probability of it being the true model may be used to compare the models and selection can be applied to maximize an estimate of the selection criterion. As would be appreciated by a person of ordinary skill in the art, a diversity of model selection criteria can be applied, including (but not limited to) the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Cross-Validation (CV), Bootstrap (Efron 1983; Efron 1986; Efron and Tibshirani 1997), or adaptive model selection criteria (George and Foster 2000; Shen and Ye 2002; Shen et al. 2004) computed on the training data or input test data, as exemplified by test input-dependent weights (IDWs). The IDW for a candidate model may be defined as the probability of the model giving a correct prediction for a given input or a reasonable measure to quantify the predictive performance of the model for the input test data Wan et al. 2006).


In some other embodiments applying model combination techniques, a combined model can be generated by applying ensemble methods, by taking an equally or unequally weighted average of the outputs from individual models (Ripley 2008; Hastie et al. 2001). For example, ensemble methods can include but are not limited to Bayesian model averaging, stacking, bagging, random forests, boosting, ARM, and using performance metrics (e.g., AIC and BIC) as weights computed on training data (Burnham and Anderson 2003; Hastie et al. 2001) or computed on input test data Wan et al. 2006). In some other embodiments applying model combination techniques, a combined model can be generated applying an Artificial Neural Network (ANN) architecture. In some embodiments, the present disclosure describes systems and methods for deriving functional scores and functional classifications from a plurality of statistical (e.g., machine) learning models generated using independent or disjoint estimates of molecular signals that involve applying various noise-control techniques (e.g., a Bootstrap Ensemble with Noise Algorithm (Yuval Raviv 1996)).


In some embodiments, the present disclosure describes systems and methods for estimating functional scores and functional classifications for molecular variants applying statistical (e.g., machine) learning techniques to generate an Inference Model (mI) that models the relationship between (e.g., assay end-points) functional scores or functional classifications and a plurality of dependent (e.g., assayed) features (e.g., molecular, phenotype, or population signals) or independent (e.g., non-assay) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others as would be appreciated by a person of ordinary skill in the art). As would be appreciate by a person of ordinary skill in the art, such Inference Model (mI) may permit estimating functional scores and functional classifications for molecular variants with or without the explicit use of molecular, phenotype, or population signals, molecular measurements, molecular processes, molecular features, or molecular scores. In some embodiments, such methods may permit inferring sequence-function maps describing functional scores and functional classifications for molecular variants beyond those for which the functional scores and functional classifications were directly assayed. In some embodiments, as illustrated in FIG. 15, such systems and methods may permit inferring a sequence-function map 1514 describing the functional scores or functional classifications for all possible non-synonymous variants in a protein coding gene using functional scores and functional classifications from a sequence function map 1502, representing a subset of the possible non-synonymous variants. In some embodiments, this inference can utilize a score regression layer 1504 that accesses an annotation matrix 1506, consisting of annotation features 1508, labels 1510, and functional scores 1512 as inputs. As would be appreciated by a person of ordinary skill in the art, a multiplicity of statistical validation and cross-validation techniques can be applied to monitor and ensure the accuracy of estimated functional scores and functional classifications.


In some embodiments, and as illustrated in FIG. 16, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants through a series of modeling layers that (a) collect or generate existing knowledge or reliable predictions of the phenotypic impacts of molecular variants, (b) enlarge the set of molecular variants with known or predicted phenotypic impacts through functional modeling (e.g., performed via a Functional Modeling Engine (FME)) of sampled molecular variants of known, high-confidence predicted, and unknown phenotypic impacts, and (c) further complete the set of molecular variants with known or predicted phenotypic impacts through inference modeling. In combination, these layers can expand (or optimize) the scope of the Truth Sets available for Functional Model (mF) 1607 generation and reduce (or optimize) the required scope of Functional Model (mF) 1607 generated support for Inference Model (mI) 1609 generation. In some embodiments, these systems and methods can overcome limitations in training, validation, and testing for functional elements (e.g., genes) and contexts with limited availability of molecular variants of known phenotypic impact (e.g., pathogenicity, functionality, or relative effect). Such systems and methods thereby enable elucidating the phenotypic impacts of molecular variants for functional elements (e.g., genes) with otherwise limited data for model generation and can reduce overall costs.


In some embodiments, and as illustrated in FIG. 16, such systems and methods may combine one or more of the following modeling layers to achieve this: (1) a Prediction Model (mP) 1603, (2) a Sampling Model (mS) 1605, (3) a Functional Model (mF) 1607, and (4) an Inference Model (mI) 1609. In some embodiments, the present disclosure describes systems and methods that access molecular variants with known phenotypic impacts (e.g., pathogenic or benign) from pre-existing sources to populate a sequence-function map 1602 describing the phenotypic impacts of molecular variants in a gene/functional element. In some embodiments, a well-characterized Prediction Model (mP) 1603 can be used to generate an enhanced sequence-function map 1604, incorporating the phenotypic impacts of molecular variants with high-confidence predictions. In some embodiments, a Sampling Model (mS) 1605 is applied to generate a set of genotypes (e.g. molecular variants) 1606 containing (a) a Truth Set by selecting or sub-sampling molecular variants with known or high-confidence, predicted phenotypic impacts, and (b) a Target Set of molecular variants of unknown phenotypic impacts.


In some embodiments, the present disclosure describes the use of statistical (e.g., machine) learning to generate a Functional Model (mF) 1607 that associates molecular, phenotype, or population signals and functional scores and functional classifications as learned from molecular variants in the Truth Set (e.g., from genotypes 1606) to predict the functional scores and functional classifications of molecular variants in the Target Set (e.g., from genotypes 1606), thereby yielding a sequence-function map of functional scores 1608.


In some embodiments, as illustrated in FIG. 16, the Functional Model (mF) 1607 accesses enhanced Truth Sets 1611 and 1612 that include molecular and population signals from a plurality of functional elements (e.g., genes) in the same, related, or interacting pathways. This capability can allow the system to generate a Functional Model (mF) 1607 for functional elements (e.g., genes) with limited availability—or devoid—of molecular variants with known or high-confidence, predicted phenotypic impacts, on the basis of molecular, phenotype, or population signals from functional elements (e.g., genes) with coherent mechanisms of action. FIGS. 3A and 3B illustrates an example of this.


In some embodiments, the phenotypic impacts of known molecular variants, high-confidence predicted molecular variants, and functionally-modeled molecular variants can be leveraged by an Inference Model (mI) 1609 that models the relationship between phenotypic impacts and a plurality of dependent (e.g., assayed) features (e.g., molecular, phenotype, or population signals) or independent (e.g., non-assay) features (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants, genomic coordinates, transcript (e.g., RNA) coordinates, translated (e.g., protein) coordinates, amino acids, and various others, as would be appreciated by a person of ordinary skill in the art) to yield an augmented sequence-function of functional scores 1610. As would be appreciate by a person of ordinary skill in the art, such Inference Model (mI) 1609 may permit estimating the phenotypic impacts of molecular variants with or without the explicit use of molecular, phenotype, or population signals.


In some embodiments, the present disclosure describes systems and methods for the optimization of cost-efficiency of molecular variant classification through the staged deployment of Deep Mutational Learning (DML) processes and systems on Truth and Target (Query) Sets of molecular variants. Some embodiments include a Stage I Optimization 610 step as illustrated in, for example, FIG. 6), where model systems (e.g., cells) harboring Truth Set variants are assayed at high model system (e.g., cell) number and read-depth—in Cell Number, Read-Depth Optimization 612—to generate high-quality data for Dimensionality Reduction Model (mDR) 614—such as an Autoencoder (mAE)— and Functional Model (mF) 616 optimizations. In this first stage, dimensionality reduction and classification accuracies for the target phenotypic impacts of molecular variants can be optimized to identify combinations of Dimensionality Reduction Models (614), Functional Models (616), and Cell-Numbers, Read-Depths (612) that guarantee robust target performance. In some embodiments, subsampling and noise simulations can be utilized to train and model performance of Dimensionality Reduction Models and Functional Models. As illustrated in FIG. 6, some embodiments include a Stage II Production 620 step, where model systems (e.g., cells) harboring Target Set variants—and, optionally, Truth Set variants can be assayed in deployments with (e.g., optimal or minimal) Cell-Numbers and/or Read-Depths 622 identified as robust when specific Dimensionality Reduction Models 624 and Functional Models 626 are deployed.


In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of the functional scores and functional classifications determined as described above. In some embodiments, time-stamped records of incorporation of functional scores and functional classifications for a set of (e.g., a plurality of unique) molecular variants may be created, evaluated, validated, selected, and applied to determine the phenotypic impact of molecular variants identified within a biological sample or record of a subject.


In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of the predictor scores or predictor classifications from computational predictors generated by applying statistical (e.g., machine) learning methods to leverage the functional scores and functional classifications.


In some embodiments, and as illustrated in FIG. 17, the present disclosure describes methods for generating (e.g., lower-order) Variant Interpretation Engines (VIEs) that can be gene- and condition-specific, through statistical (e.g., machine) learning techniques that model the phenotypic impacts 1712 of molecular variants on the basis of input labels 1714 and an annotation matrix 1706 comprising their functional scores 1702, 1708 (or functional classifications) and other annotation features 1710, including commonly used features in the creation of the computational predictors, including but not limited to evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants and residues of functional elements. In some embodiments, the training and validation layer 1704 may employ cross-validation techniques 1716 (e.g., K-fold or LOOCV) to train and quality control VIEs that are subsequently evaluated by a testing layer 1718 to derive predictor scores 1720 used in molecular variant classification.


In some embodiments, the present disclosure further describes systems and methods for generating pathway- and condition-specific (higher-order) Variant Interpretation Engines (VIEs) applying model combination techniques that integrate (lower-order) gene- and condition-specific Variant Interpretation Engines (VIEs) from a plurality of genes in target pathways of interest. In other embodiments, the present disclosure further describes systems and methods for generating pathway- and condition-specific (higher-order) Variant Interpretation Engines (VIEs) through statistical (e.g., machine) learning techniques that model the phenotypic impacts of molecular variants on the basis of their functional scores, functional classifications, and other features commonly used in the creation of the computational predictors, including but not limited to evolutionary, population, functional (annotation-based), structural, dynamical, and physicochemical features associated with variants and residues of functional elements.


In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject on the basis of the hotspot scores and hotspot classifications from mutational hotspots computed by applying spatial clustering techniques to identify networks of residues with specific phenotypic impacts leveraging the herein-described and enabled functional scores, functional classifications, and molecular signals associated with molecular variants and residues.


In some embodiments, the present disclosure describes systems and methods for deriving a matrix of functional distances between molecular variants or their corresponding residues by (1) computing a distance metric between molecular variants projected in the N-dimensional space (1≤N≤M) defined by a set of M of functional scores, functional classifications, and molecular signals (as described above), where N<M when dimensionality-reduction techniques are applied to reduce the feature-space of molecular variants. As would be appreciated by a person of ordinary skill in the art, various dimensionality-reduction techniques may be applied including but not limited to techniques reliant on linear transformations—as in principal component analysis (PCA)—or non-linear transformations—as in the manifold learning techniques (e.g., t-distributed stochastic neighbor embedding (tSNE) and kernel principal component analysis (kPCA)). As would be appreciated by a person of ordinary skill in the art, various distance metrics can be utilized, including but not limited to, the Euclidean distance, Manhattan distance (e.g., City-Block), Mahalanobis distance, or Chebychev distance, and various others.


In some embodiments, the present disclosure describes systems and methods for the identification of Significantly Mutated Regions (SMRs) and Networks (SMNs) by measuring and scoring the phenotype-associated mutation density (e.g., number of observed phenotype-associated variants per residue) within spatially-proximal residues of functional elements (e.g., protein-coding genes) through the application of spatial clustering techniques across a plurality of spatial distance metrics, including the herein described and enabled functional distances, sequence distances, structure distances, (co)evolutionary distances, and combinations thereof.


In some embodiments, and as illustrated in FIG. 18, the identification of SMRs/SMNs may apply a Training/Validation Layer 1804 to identify spatial clustering among phenotypically-related or functionally-related molecular variants 1806 as determined on the basis of commonalities in the functional scores of molecular variants. In some embodiments, these commonalities may be identified from the functional scores of molecular variants in a sequence-function map of a protein-coding gene 1802.


In some embodiments, and as illustrated in FIG. 18, the identification of SMRs/SMNs in the Training/Validation Layer 1804 may comprise a series of steps, including but not limited to: (1) SMR/SMN-detection techniques 1805 for the identification of single-residues or networks of residues that are enriched in molecular variants with specific phenotypic associations as have been previously described (Araya et al. 2016, U.S. Patent Application 20160378915A1), and (2) SMR/SMN-selection techniques 1815.


SMR/SMN-detection techniques 1805 can comprise a series of steps including but not limited to: (1.1) projection 1810 of phenotype-associated molecular variants 1806 in functional, sequence, structural, or (co)evolutionary dimensions (or combinations thereof), (1.2) application of spatial clustering techniques 1812 (e.g., DBSCAN) to detect clusters of spatially-proximal phenotype-associated variants, and (1.3) measurement of mutation density, scoring number of phenotype-associated variants per residue in cluster.


SMN-detection techniques 1805 can further comprise the steps denoted in 1814 including, but not limited to: (1.4) scoring of mutation density probability by, for example, computing the (e.g., binomial) probability of obtaining k-or-more (e.g., greater than or equal to k) observed phenotype-associated variants per cluster, given the per-residue mutation rate within each functional element (e.g., protein-coding gene), (1.5) applying multiple hypothesis correction (MHC) across mutation density probabilities of discovered clusters, and (1.6) computing false-discovery rates (FDRs) for the observed (e.g., raw or corrected) mutation density probabilities using background models of mutation density probabilities derived by randomizing positions of the observed phenotype-associated variants within each functional element.


Training/Validation Layer 1804 can further perform the SMR/SMN-selection techniques 1815. SMR/SMN-selection techniques can comprise the steps of (2.1) defining (e.g., raw or corrected) mutation density probabilities and/or false discovery rates (FDRs) as hotspot scores and applying cutoffs to statistically define hotspot classifications, thereby nominating residues in candidate clusters (e.g., sequence 1816, function 1818, and sequence 1820), (2.2) detecting residues in candidate clusters from multiple, distinct projections/spaces, (2.3) assigning residues to individual clusters applying an assignment heuristic (e.g., selecting the cluster largest in size (e.g., cluster with the highest number of residues), and (2.4) identifying SMRs/SMNs as the final set of clusters meeting these criteria. The final set of SMRs/SMNs can be derived from multiple, distinct projections (e.g., sequence 1820, function 1818, or sequence, function (combined) 1822).


In some embodiments, the present disclosure describes systems and methods for the identification of SMRs/SMNs by measuring and scoring the phenotype-associated mutation density (e.g., number of observed phenotype-associated variants per residue) within spatially-proximal residues of functional elements (e.g., protein-coding genes) through the application of spatial clustering techniques across a plurality of spatial distance metrics, where the phenotype-associated variants may be defined on the basis of the functional scores and functional classifications herein described. As would be appreciated by a person of ordinary skill in the art, these methods may allow the determination of clusters of residues in which variants with specifically-defined phenotypic impacts occur.


In some embodiments, the present disclosure describes systems and methods for evaluating the accuracy, performance, or robustness of independent evidence datasets for the interpretation of molecular variants, such as quantitative (e.g., scores) or qualitative (classifications) evidence from computational predictors (e.g., M-CAP, REVEL, SIFT, and PolyPhen2), as well as gene-specific predictors (e.g., PON-P2), mutational hotspots, and population genomics metrics (e.g., allele frequency-based variant classifications), (Amendola et al. 2016) against the herein described functional scores and functional classifications.


In some embodiments, the present disclosure describes systems and methods for computing evaluation metrics to assess concordance between an evidence dataset and the herein described functional scores and functional classifications, and based on these evaluation metrics selecting the best-performing evidence dataset for use in variant interpretation and prioritization. As would be appreciated by a person of ordinary skill in the art, various evaluation metrics can be used to assess the concordance of an evidence dataset against the herein described functional scores or functional classifications. For quantitative evidence (e.g., scores), these may include the Pearson's correlation coefficient, Spearman's rank-order correlation, Kendall correlation, and various others as would be appreciated by a person of ordinary skill in the art. For qualitative evidence (e.g., classifications), these may include accuracy, Matthew's correlation coefficient, Cohen's kappa coefficient, Youden's index (e.g., informedness), F-measure (e.g., Fi score), true positive rate (e.g., sensitivity or recall), true negative rate (e.g., specificity), positive predictive value (e.g., precision), negative predictive value, positive likelihood ratio, negative likelihood ratio, and diagnostic odds ratio, and various others as would be appreciated by a person of ordinary skill in the art.


In some embodiments, the present disclosure describes systems and methods that may continuously evaluate, validate, and optimize (e.g., select, remove, or modify) diverse evidence datasets on the basis of the above described evaluation metrics, and distribute the best-performing (e.g., independent) evidence datasets to client systems via an Application Program Interface (API) for use in variant interpretation and prioritization practices determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject.


In some embodiments, the present disclosure describes systems and methods for determining the degree of ascertainment bias, reporting bias, or outcome bias present within a dataset of variants, including clinical datasets (e.g., ClinVar, HumVar, VariBench, SwissVar, PhenCode, or locus-specific databases), population datasets (e.g., ExAC, GnomAD, and 1000 Genomes), or independent evidence datasets for the interpretation of molecular variants, such as but not limited to computational predictors (e.g., M-CAP, REVEL, SIFT, PolyPhen2, and PON-P2). In some embodiments, the present disclosure describes systems and methods for determining biases on the basis of the expected distributions of the herein described functional scores, functional classifications, and molecular signals associated with molecular variants and residues.


In some embodiments, the present disclosure describes systems and methods for the evaluation of a target variant dataset by measuring and scoring the difference between the distributions of functional scores, functional classifications, and molecular signals of molecular variants and residues within the target dataset against the expected distributions of functional scores, functional classifications, and molecular signals of molecular variants from a reference dataset. In some embodiments, the measurement of inherent biases within a target variant dataset may comprise a series of steps, including but not limited to: (1) collection of functional scores, functional classifications, and molecular signals associated with molecular variants in the target and reference datasets, (2) estimating the probability density function of functional scores, functional classifications, or molecular signals associated with molecular variants within the reference dataset, (3) estimating the probability density function of functional scores, functional classifications, or molecular signals associated with molecular variants within the target dataset, and (4) measuring the statistical distance between the target dataset-derived probability density function and the reference dataset-derived probability density function of functional scores, functional classifications, or molecular signals. In some embodiments, the measurement of inherent biases within a target variant dataset comprises a series of steps, including: (5) sampling variants from the reference dataset (e.g., to match the sample population size of the target dataset), (6) estimating the probability density function of functional scores, functional classifications, or molecular signals of the sampled reference dataset in step 5, (7) measuring the statistical distance between the target dataset-derived probability density function and the sampled reference dataset-derived probability density function of functional scores, functional classifications, or molecular signals, (8) iterating steps 5-8 to obtain a robust estimate and confidence intervals of the statistical distance between the probability density function of functional scores, functional classifications, or molecular signals of the target and reference datasets. In some embodiments, the above systems and methods for the detection and statistical evaluation of bias permit the identification of clinical datasets, population datasets, or evidence datasets in which the contained variants have different functional scores, functional classifications, or molecular signals from that expected in a reference dataset.


In some other embodiments, the present disclosure describes systems and methods for evaluating underlying biases within evidence datasets by a series of steps, including but not limited to: (1) partitioning evidence and reference datasets into matching sets of quantiles (e.g., for quantitative evidence scores) or classes (e.g., qualitative evidence classifications); (2) scoring variants within each set (e.g., evidence vs. reference) across a plurality of properties (e.g., evolutionary, population, functional (e.g., annotation-based), structural, dynamical, and physicochemical features associated with variants); (3) estimating the probability density function of each property score within each set (e.g., evidence vs. reference); (4) measuring the statistical distance between the evidence set-derived probability density function and the reference set-derived probability density function of each property score; and (5) identifying properties with statistically significant differences in scores between reference and evidence sets.


In some embodiments, the present disclosure describes systems and methods that may continuously evaluate and select diverse evidence datasets on the basis of the above described bias metrics, and distribute the least-biased (e.g., independent) evidence datasets to client systems via an Application Program Interface (API) for use in variant interpretation and prioritization practices determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record thereof of a subject.


In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of herein described functional scores, functional classifications, predictor scores, predictor classifications, hotspot scores, and hotspot classifications, in functional elements (e.g., genes) and pathways associated with Mendelian disorders (e.g., Table 1), that are known cancer-drivers (e.g., Table 2), pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response (Table 3), or other clinically-valuable genes (e.g., Table 4).


In some embodiments, the present disclosure describes systems and methods for evaluating, selecting, distributing and utilizing independent evidence—determined to be the best-performing and least biased on the basis of the herein described functional scores and classifications— for the interpretation and prioritization of variants in functional elements (e.g., genes) and pathways associated with Mendelian disorders (e.g., Table 1), that are known cancer-drivers (e.g., Table 2), pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response (e.g., Table 3), or other clinically-valuable genes (e.g., Table 4).


As discussed above, Table 1 is an example table of functional elements and pathways associated with Mendelian disorders, according to some embodiments. Table 2 is an example table of functional elements and pathways that are known cancer-drivers, according to some embodiments. Table 3 is an example table of pharmacogenomic genes in which genotypic (e.g., sequence) variation is associated with variation in drug response, according to some embodiments. Table 4 is an example table of other clinically-valuable genes, according to some embodiments. Tables 1-4 may be found on page 49 of the specification.


In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of herein described and enabled functional scores, functional classifications, predictor scores, predictor classifications of variants within known targets of pathogenic variation, including (but not limited) to mutational hotspots, or for variants within, for example, 50, 100, 500, and 1,000 base pair (bp) of such hotspots. In some embodiments, the present disclosure describes systems and methods for determining the phenotypic impact (e.g., pathogenicity, functionality, or relative effect) of molecular variants identified within a biological sample or record of a subject on the basis of functional scores, functional classifications, predictor scores, or predictor classifications of variants within regions of constrained variation in a population, or for variants within, for example, 50, 100, 500, and 1,000 bp of such regions. As would be appreciated by a person of ordinary skill in the art, a variety of methods for determining mutational hotspots and regions of constrained variation can be applied.


Various embodiments can be implemented, for example, using one or more computer systems, such as computer system 1900 shown in FIG. 19. Computer system 1900 can be used, for example, to implement methods of FIGS. 1A, 6-13, and 15-18. Computer system 1900 can be any computer capable of performing the functions described herein.


Computer system 1900 can be any well-known computer capable of performing the functions described herein.


Computer system 1900 includes one or more processors (also called central processing units, or CPUs), such as a processor 1904. Processor 1904 is connected to a communication infrastructure or bus 1906.


One or more processors 1904 may each be a graphics processing unit (GPU). In an embodiment, a GPU is a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.


Computer system 1900 also includes user input/output device(s) 1903, such as monitors, keyboards, pointing devices, etc., that communicate with communication infrastructure 1906 through user input/output interface(s) 1902.


Computer system 1900 also includes a main or primary memory 1908, such as random access memory (RAM). Main memory 1908 may include one or more levels of cache. Main memory 1908 has stored therein control logic (e.g., computer software) and/or data.


Computer system 1900 may also include one or more secondary storage devices or memory 1910. Secondary memory 1910 may include, for example, a local, network, or cloud-accessible hard disk drive 1912 and/or a removable storage device or drive 1914. Removable storage drive 1914 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive.


Removable storage drive 1914 may interact with a removable storage unit 1918. Removable storage unit 1918 includes a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 1918 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/any other computer data storage device. Removable storage drive 1914 reads from and/or writes to removable storage unit 1918 in a well-known manner.


According to an exemplary embodiment, secondary memory 1910 may include other means, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 1900. Such means, instrumentalities or other approaches may include, for example, a removable storage unit 1922 and an interface 1920. Examples of the removable storage unit 1922 and the interface 1920 may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.


Computer system 1900 may further include a communication or network interface 1924. Communication interface 1924 enables computer system 1900 to communicate and interact with any combination of remote devices, remote networks, remote entities, etc. (individually and collectively referenced by reference number 1928). For example, communication interface 1924 may allow computer system 1900 to communicate with remote devices 1928 over communications path 1926, which may be wired and/or wireless, and which may include any combination of LANs, WANs, the Internet, etc. Control logic and/or data may be transmitted to and from computer system 1900 via communication path 1926.


In an embodiment, a tangible apparatus or article of manufacture comprising a tangible computer useable or readable medium having control logic (software) stored thereon is also referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 1900, main memory 1908, secondary memory 1910, and removable storage units 1918 and 1922, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 1900), causes such data processing devices to operate as described herein.


Based on the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art(s) how to make and use embodiments of this disclosure using data processing devices, computer systems and/or computer architectures other than that shown in FIG. 12. In particular, embodiments can operate with software, hardware, and/or operating system implementations other than those described herein.


It is to be appreciated that the Detailed Description section, and not any other section, is intended to be used to interpret the claims. Other sections can set forth one or more but not all exemplary embodiments as contemplated by the inventor(s), and thus, are not intended to limit this disclosure or the appended claims in any way.


While this disclosure describes exemplary embodiments for exemplary fields and applications, it should be understood that the disclosure is not limited thereto. Other embodiments and modifications thereto are possible, and are within the scope and spirit of this disclosure. For example, and without limiting the generality of this paragraph, embodiments are not limited to the software, hardware, firmware, and/or entities illustrated in the figures and/or described herein. Further, embodiments (whether or not explicitly described herein) have significant utility to fields and applications beyond the examples described herein.


Embodiments have been described herein with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined as long as the specified functions and relationships (or equivalents thereof) are appropriately performed. Also, alternative embodiments can perform functional blocks, steps, operations, methods, etc. using orderings different than those described herein.


References herein to “one embodiment,” “an embodiment,” “an example embodiment,” or similar phrases, indicate that the embodiment described can include a particular feature, structure, or characteristic, but every embodiment can not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it would be within the knowledge of persons skilled in the relevant art(s) to incorporate such feature, structure, or characteristic into other embodiments whether or not explicitly mentioned or described herein. Additionally, some embodiments can be described using the expression “coupled” and “connected” along with their derivatives. These terms are not necessarily intended as synonyms for each other. For example, some embodiments can be described using the terms “connected” and/or “coupled” to indicate that two or more elements are in direct physical or electrical contact with each other. The term “coupled,” however, can also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.


The breadth and scope of this disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.









TABLE 1





Mendelian Disorders


Gene (HGNC Symbol)

















BRCA1



BRCA2



APOB



LDLR



PCSK9



SCN5A



APC



MLH1



MSH2



MSH6



STK11



MUTYH



MYH7



LMNA



MYBPC3



TNNI3



TNNT2



KCNQ1



KCNH2



SDHB



ACTA2



MYH11



VHL



RET



SDHAF2



SDHC



SDHD



TP53



TSC1



TSC2



NF2



PTEN



RB1



RYR1



GLA



RYR2



TGFBR1



TGFBR2



ACTC1



CACNA1S



COL3A1



DSC2



DSG2



DSP



FBN1



MEN1



MYL2



MYL3



PKP2



PMS2



PRKAG2



SMAD3



TMEM43



TPM1



WT1



BMPR1A



SMAD4



ATP7B



OTC

















TABLE 2





Cancer Drivers (CCG La)


Gene (HGNC Symbol)

















TP53



PIK3CA



ARID1A



RB1



PTEN



KRAS



BRAF



CDKN2A



NRAS



FBXW7



STAG2



NFE2L2



NF1



IDH1



ATM



PIK3R1



CASP8



HRAS



MLL2



SF3B1



ERBB2



CREBBP



AKT1



HLA-A



CTCF



ERBB3



CTNNB1



RUNX1



MYD88



SMARCA4



EP300



SETD2



SMARCB1



EGFR



TBL1XR1



U2AF1



EZH2



RAC1



MLL3



IL7R



CD79B



POU2AF1



MAP2K1



PTPN11



CCND1



MAP2K4



TCF7L2



KIT



CDK4



FOXA1



TSC1



FAT1



WT1



BCOR



XPO1



PRDM1



KEAP1



NSD1



PPP2R1A



CDKN1B



ASXL1



MET



RPL5



MYCN



TNFRSF14



FLT3



ALK



KDM5C



KDM6A



APC



PBRM1



STK11



RAD21



EZR



SPOP



TET2



PHF6



IRF4



DDX5



CCDC6



HIST1H3B



CARD11



IDH2



MLL



FGFR2



CDK12



ERCC2



B2M



MED12



CEBPA



NOTCH1



BRCA1



MAP3K1



VHL



DNMT3A



FGFR3



NPM1



FAM46C



CBFB



GATA3



MYB



CDH1



BAP1



ELF3



ZNF198



MALT1



WIF1



KDR



SFRS3



MXRA5



SS18



TAL1



RXRA



TCEA1



HEAB



THRAP3



RUNDC2A



SLC44A3



TNF



TAL2



FLJ27352



LAF4



STK19



DDX10



MSI2



NUTM2A



POU5F1



TRIP11



STAT5B



NCOA2



AZGP1



NCOA1



STAT3



NCOA4



OR52N1



CDKN2a(p14)



CEP1



TFPT



SUFU



HOXA13



DDB2



HOXA11



P2RY8



ECT2L



TRD@



IGH@



SMAD4



RBM10



LASP1



ROS1



KMT2D



WASF3



RBM15



PRKAR1A



KCNJ5



ATRX



EPHA2



BIRC3



HNRNPA2B1



OR4A16



NUTM2B



KLF4



MAP2K2



C15orf21



ERG



CD79A



SRGAP3



MLLT3



MITF



MN1



MLLT2



MLLT7



MLLT6



FAS



C15orf55



POU2F2



EIF2S2



MLLT4



EPS15



HERPUD1



TBC1D12



MLLT1



ALO17



CNOT3



FIP1L1



CBL



OLIG2



HOXC13



NT5C2



ABL1



ZNF521



PLAG1



TPM4



LMO1



LMO2



BLM



NTN4



SLC4A5



IRTA1



JAK3



PMS2



ATP1A1



TERT



CDH11



PTCH



DDX3X



HEY1



MORC4



TLX3



PALB2



BCR



BRCA2



MDM4



MDM2



BRD4



TFG



CSF3R



RPL10



PER1



ITPKB



PDSS2



CREB1



AF3p21



TRIM27



WRN



KIF5B



CHD8



RAB40A



GATA1



ATIC



CD1D



SETBP1



CRTC3



TNFRSF17



COL1A1



DUX4



ACVR1B



C16orf75



NIN



ZNF278



MAF



NF2



AKAP9



CCND2



MAX



MECT1



ARHGEF12



SEPT6



CBLB



FACL6



ALKBH6



CHN1



CBFA2T1



IL6ST



TCEB1



MEN1



FBXO11



HIST1H4I



RALGDS



BUB1B



FHIT



CRLF2



RASA1



TLX1



IGK@



SELP



TXNDC8



CACNA1D



GUSB



NUP214



NKX2-1



INPPL1



CBFA2T3



BCLAF1



TSC2



SDH5



CDC73



ZNF384



CDC27



OTUD7A



SIL



RANBP17



NDRG1



SMC3



FH



PAX7



CD273



HLA-B



PHOX2B



CD274



GNAS



GNAQ



PSIP1



ASPSCR1



GPHN



XIRP2



PAX8



MYOCD



FRMD7



RAP1GDS1



PAX3



AJUBA



SLC34A2



HLF



UBR5



REL



RPS2



GNA11



LHFP



TBX3



SMO



RET



PAPD5



RPS15



SS18L1



MYH11



EIF4A2



LCK



XPA



HSPCA



PPARG



CHIC2



HOXC11



H3F3B



JAK2



TFRC



ZNF620



SOX17



MTCP1



JUN



LCTL



TAF15



NONO



SRSF2



CHCHD7



MAML2



PPM1D



DAXX



H3F3A



JAK1



RIT1



CCND3



TRRAP



MED23



IGL@



SPEN



DIAPH1



CMKOR1



ZNF471



STL



POLE



MAP4K3



ING1



FOXO1A



LIFR



CHEK2



LCP1



AKT2



TPR



NFKB2



FOXL2



COL5A1



FEV



HMGA1



BCL3



HMGA2



CARS



PCSK7



ELL



GMPS



LYL1



BMPR1A



TGFBR2



SLC45A3



GRAF



HLXB9



HIST1H1E



DIS3



WWTR1



PDGFRA



PDE4DIP



ARID5B



ALDH2



STX2



SACS



ARNT



GOPC



SOS1



ITK



DICER1



KEL



CIC



RAB5EP



FVT1



PML



ADNP



FANCA



ABL2



C12orf9



BRIP1



MALAT1



FANCD2



PAFAH1B2



MUTYH



POT1



JAZF1



GNPTAB



FGFR1OP



RAD51L1



DNER



ZNF331



CD70



IKZF1



NCOR1



MLF1



MYH9



SYK



HCMOGT-1



FANCE



FANCF



FANCG



TPM3



NUP210L



INTS12



SDHC



RUNXBP2



BTG1



TTLL9



EML4



SDHB



CDK6



PMX1



PDGFRB



FOXO3A



NTRK1



CLTCL1



SH2B3



EBF1



GPC3



FGFR1



ETV6



NR4A3



SBDS



PIM1



ALPK2



PDGFB



CUL4B



YWHAE



ETV1



BCL10



PBX1



IL21R



CREB3L1



ATF1



FANCC



C2orf44



HSPCB



CANT1



PTPRC



WAS



NFIB



CREB3L2



AF1Q



NOTCH2



ABI1



SH3GL1



NBS1



OMD



SUZ12



TRA@



AF5q31



RSBN1L



BCL11B



MSH6



ERCC5



BCL11A



ERCC3



MSH2



NUMA1



KTN1



TFE3



IL2



MYCL1



LPP



HOXA9



RPL22



MSN



EVI1



BCL7A



AXIN1



NBPF1



ZNF9



MLH1



SFRS2



TRIM33



SIRT4



AXIN2



CIITA



ARHGAP35



SET



ELF4



HIP1



MSF



SOX2



FNBP1



CD74



TCL1A



RAF1



MADH4



COPEB



FLI1



CBLC



GATA2



EXT1



EXT2



MICALCL



DDIT3



D10S170



CDKN2C



MYC



GOLGA5



TRIM23



NTRK3



KLK2



SLC1A3



PRF1



ACSL3



NUP98



ELK4



CYLD



TMPRSS2



DDX6



CCNB1IP1



TTL



ZNF750



TIF1



SOCS1



PNUTL1



FOXQ1



ATP2B3



PMS1



FSTL3



PCBP1



KDM5A



ZNF145



PICALM



EWSR1



AF15Q14



BCL6



GNA13



BCL5



BCL9



ANK3



RHEB



BHD



QKI



PPP6C



CALR



PRCC



FCGR2B



BCL2



RPN1



SSX4



MDS2



TPX2



RARA



ZFHX3



TRB@



MDS1



MAFB



SLC26A3



SGK1



SDHD



CDX2



SSX1



ZRANB3



KIAA1549



SSX2



HOOK3



MTOR



SNX25



TCF1



MGA



LRIG3



PRDM16



ELKS



RHOA



ACO1



ELN



VTI1A



BRD3



MLLT10



RNF43



CDKN1A



ARID2



LCX



TFEB



WHSC1L1



ETV5



ETV4



HOXD11



GAS7



ARHH



IPO7



GOT1



SMAD2



WHSC1



TNFAIP3



TCL6



HOXD13



SDC4



PAX5



MPL



MPO



SFPQ



TCF3



NACA



RECQL4



SMC1A



ERCC4



TCF12



KLHL8



DNM2



CLTC



SMARCE1



DEK



XPC



USP6



FUBP1



PCM1



TRAF7



ZRSR2



FUS



FOXP1



FLG



TOP1



MUC1



TCP11L2



COX6C



MYST4



MUC17



CAMTA1



C3orf70



CUX1



CAP2



TRAF3



MKL1



CCNE1



TSHR



AMER1



CCDC120



CHD4



TAP1

















TABLE 3





Pharmacogenomics (Pharm)


Gene (HGNC Symbol)

















A2M



ABAT



ABCA1



ABCA12



ABCA3



ABCA8



ABCB1



ABCB11



ABCB4



ABCB5



ABCB6



ABCB9



ABCC1



ABCC10



ABCC11



ABCC2



ABCC3



ABCC4



ABCC5



ABCC6



ABCC8



ABCC9



ABCD1



ABCD2



ABCG1



ABCG2



ABCG8



ABL1



ABO



ACBD4



ACE



ACE2



ACHE



ACP5



ACSS2



ACTG1



ACY3



ACYP2



ADA



ADAM12



ADAM33



ADAMTS1



ADAMTS14



ADCK4



ADCY2



ADCY9



ADD1



ADH1A



ADH1B



ADH1C



ADH7



ADIPOQ



ADK



ADM



ADORA1



ADORA2A



ADORA2A-AS1



ADRA1A



ADRA2A



ADRA2B



ADRA2C



ADRB1



ADRB2



ADRB3



ADRBK2



AFAP1L1



AGAP1



AGBL4



AGO1



AGT



AGTR1



AGXT



AHR



AIDA



AK4



AKR1C3



AKR1C4



AKR7A2



AKT1



AKT2



ALDH1A1



ALDH1A2



ALDH2



ALDH3A1



ALDH5A1



ALG10



ALOX12



ALOX15



ALOX5



ALOX5AP



AMHR2



AMPD1



ANGPT2



ANGPTL4



ANKFN1



ANKK1



ANKRD55



ANKS1B



ANXA11



AOX1



APBB1



APEH



APLF



APOA1



APOA4



APOA5



APOB



APOBEC2



APOC1



APOC3



APOE



APOH



AQP2



AQP9



ARAP1



ARAP2



AREG



ARG1



ARHGEF10



ARHGEF4



ARID5B



ARMS2



ARNT



ARNTL



ARRB2



ARVCF



AS3MT



ASIC2



ASPH



ASS1



ATF3



ATG16L1



ATG5



ATIC



ATM



ATP2B1



ATP5E



ATP7A



ATP7B



AXIN2



B4GALT2



BACH1



BAD



BAG6



BAZ2B



BCAP31



BCHE



BCL2



BCL2L11



BCR



BDKRB1



BDKRB2



BDNF



BDNF-AS



BGLAP



BLK



BLMH



BMP5



BMP7



BRAF



BRD2



BTG4



BTRC



C10orf107



C10orf11



C11orf30



C11orf65



C12orf40



C17orf51



C18orf21



C18orf56



C1orf167



C2



C20orf194



C3



C5



C5orf22



C8orf34



C9orf72



CA10



CA12



CACNA1A



CACNA1C



CACNA1E



CACNA1H



CACNA1S



CACNB2



CACNG2



CALU



CAMK1D



CAMK2N1



CAMK4



CAP2



CAPG



CAPN10



CAPZA1



CARD16



CARTPT



CASP1



CASP3



CASP7



CASP9



CASR



CAT



CBR1



CBR3



CBS



CCDC22



CCHCR1



CCL2



CCL21



CCND1



CCNH



CCNY



CCR5



CD14



CD28



CD38



CD3EAP



CD40



CD58



CD69



CD74



CD84



CDA



CDC5L



CDCA3



CDH13



CDH4



CDK1



CDK4



CDK9



CDKAL1



CDKN2B-AS1



CELF4



CELSR2



CEP68



CEP72



CERKL



CERS6



CES1



CES1P1



CES2



CETP



CFAP44



CFB



CFH



CFI



CFLAR



CFTR



CHAT



CHIA



CHIC2



CHL1



CHRM2



CHRM3



CHRM4



CHRNA1



CHRNA3



CHRNA4



CHRNA5



CHRNA7



CHRNB1



CHRNB2



CHRNB3



CHRNB4



CHST13



CHST3



CHUK



CLASP1



CLCN6



CLMN



CLNK



CLOCK



CMPK1



CNKSR3



CNOT1



CNPY4



CNR1



CNTF



CNTN4



CNTN5



CNTNAP2



COL18A1



COL1A1



COL1A2



COL22A1



COL26A1



COLEC10



COMT



COQ2



CPA2



CPS1



CR1



CR1L



CREB1



CRH



CRHR1



CRHR2



CRP



CRTC2



CRY1



CSK



CSMD1



CSMD2



CSMD3



CSNK1E



CSPG4



CSRNP3



CSRP3



CST5



CTH



CTLA4



CTNNA2



CTNNA3



CTNNB1



CUX1



CUX2



CXCL10



CXCL12



CXCL5



CXCL8



CXCR2



CXCR4



CXXC4



CYB5A



CYB5R3



CYBA



CYCSP5



CYP11B2



CYP19A1



CYP1A1



CYP1A2



CYP1B1



CYP24A1



CYP27B1



CYP2A6



CYP2B6



CYP2B7P1



CYP2C18



CYP2C19



CYP2C8



CYP2C9



CYP2D6



CYP2E1



CYP2J2



CYP2R1



CYP39A1



CYP3A



CYP3A4



CYP3A43



CYP3A5



CYP3A7



CYP4A11



CYP4B1



CYP4F11



CYP4F2



CYP51A1



CYP7A1



DAOA



DAPK1



DBH



DCAF4



DCBLD1



DCK



DCP1B



DCTD



DDC



DDHD1



DDRGK1



DDX20



DDX53



DDX58



DEAF1



DGCR5



DGKH



DGKI



DHFR



DHODH



DIAPH3



DIO1



DIO2



DKK1



DLEU7



DLG5



DLGAP1



DMPK



DNAH12



DNAJB13



DNMT3A



DOCK4



DOK5



DOT1L



DPP4



DPYD



DPYS



DRD1



DRD2



DRD3



DRD4



DROSHA



DSCAM



DTNBP1



DUSP1



DUX1



DYNC2H1



E2F7



EBF1



ECT2L



EDN1



EGF



EGFR



EGLN3



EHF



EIF2AK4



EIF3A



EIF4E2



ENG



ENOSF1



EPAS1



EPB41



EPHA5



EPHA6



EPHA8



EPHX1



EPM2A



EPM2AIP1



EPO



ERAP1



ERBB2



ERCC1



ERCC2



ERCC3



ERCC4



ERCC5



ERCC6L2



EREG



ERICH3



ESR1



ESR2



ETS2



EXO1



F11



F12



F13A1



F2



F3



F5



F7



FAAH



FABP1



FABP2



FADS1



FAM19A5



FAM65B



FARS2



FAS



FASLG



FASTKD3



FAT1



FBXL17



FBXL19



FCAR



FCER1A



FCER1G



FCER2



FCGR2A



FCGR2B



FCGR3A



FDPS



FEN1



FGD4



FGF2



FGF5



FGFBP1



FGFBP2



FGFR2



FGFR4



FHIT



FKBP5



FLOT1



FLT1



FLT3



FLT4



FMO1



FMO2



FMO3



FMO5



FNTB



FOLH1



FOLR3



FOXC1



FOXP3



FPGS



FSHR



FSIP1



FSTL5



FTO



FYN



FZD3



FZD4



G6PD



GABRA1



GABRA3



GABRA6



GABRB1



GABRB2



GABRG2



GABRG3



GABRP



GABRQ



GAD2



GADL1



GAL



GALNT14



GALNT18



GALNT2



GALR1



GAPDHP64



GAPVD1



GATA3



GATA4



GATM



GBP6



GCG



GCKR



GCLC



GDNF



GEMIN4



GFRA2



GGCX



GGH



GHSR



GIPR



GJA1



GLCCI1



GLDC



GLP1R



GLRB



GNAS



GNB3



GNMT



GP1BA



GP6



GPR1



GPR83



GPX1



GPX3



GPX5



GRIA1



GRIA3



GRID2



GRIK1



GRIK2



GRIK3



GRIK4



GRIN1



GRIN2A



GRIN2B



GRIN3A



GRK4



GRK5



GRM3



GRM7



GSK3B



GSR



GSTA1



GSTA2



GSTA5



GSTM1



GSTM3



GSTM4



GSTP1



GSTT1



GSTZ1



H19



HAS3



HCG22



HCP5



HDAC1



HES6



HFE



HIF1A



HLA-A



HLA-B



HLA-C



HLA-DOB



HLA-DPA1



HLA-DPB1



HLA-DPB2



HLA-DQA1



HLA-DQB1



HLA-DRA



HLA-DRB1



HLA-DRB3



HLA-DRB5



HLA-E



HLA-G



HMGB1



HMGB2



HMGCR



HNF1A



HNF1B



HNF4A



HNMT



HOMER1



HOTAIR



HOTTIP



HRH1



HRH2



HRH3



HRH4



HS3ST4



HSD11B1



HSD3B1



HSPA1A



HSPA1L



HSPA5



HSPG2



HTR1A



HTR1B



HTR1D



HTR2A



HTR2C



HTR3A



HTR3B



HTR5A



HTR6



HTR7



HTRA1



HUS1



HYKK



IBA57



IDO1



IFIT1



IFNAR1



IFNB1



IFNG



IFNGR1



IFNGR2



IFNL3



IFNL4



IGF1



IGF1R



IGF2BP2



IGF2R



IGFBP3



IGFBP7



IKBKG



IKZF3



IL10



IL11



IL12A



IL12B



IL13



IL16



IL17A



IL17F



IL17RA



IL18



IL1A



IL1B



IL1RN



IL2



IL21R



IL23R



IL27



IL2RA



IL2RB



IL3



IL4



IL4R



IL6



IL6R



IL6ST



IL7R



ILKAP



IMPA2



IMPDH1



IMPDH2



INSIG2



INSR



IP6K2



IRS1



ITGA1



ITGA2



ITGA9



ITGB1



ITGB3



ITGBL1



ITIH3



ITPA



ITPKC



JAK2



KANSL1



KCNE1



KCNH2



KCNH7



KCNIP1



KCNIP4



KCNJ1



KCNJ11



KCNJ6



KCNMA1



KCNMB1



KCNQ1



KCNQ5



KCNT1



KCNT2



KDM4A



KDR



KIAA0391



KIF6



KIR2DL2



KIRREL2



KIT



KL



KLC1



KLC3



KLRC1



KLRD1



KLRK1



KRAS



KYNU



LAMB3



LARP1B



LCE3B



LCE3C



LDLR



LECT2



LEP



LEPR



LGALS3



LGR5



LIG3



LINC00251



LINC00478



LIPC



LPA



LPHN3



LPIN1



LPL



LRP1



LRP1B



LRP2



LRP5



LRRC15



LST1



LTA



LTA4H



LTB



LTC4S



LUC7L2



LYN



LYRM5



MAD1L1



MAFB



MAFK



MALAT1



MAML3



MAN1B1



MAP3K1



MAP3K5



MAP4K4



MAPK1



MAPK14



MAPT



March 1



MC1R



MC4R



MCPH1



MDGA2



MDM2



MDM4



MECP2



MED12L



MEG3



MET



METTL21A



MEX3C



MGAT4A



MGMT



MIA3



MICA



MICB



MIR1206



MIR1307



MIR133B



MIR146A



MIR2053



MIR27A



MIR300



MIR423



MIR4278



MIR449B



MIR492



MIR577



MIR595



MIR604



MIR611



MIR618



MIR7-2



MISP



MLLT3



MLN



MME



MMP1



MMP10



MMP2



MMP3



MMP9



MOB3B



MOCOS



MOV10



MPO



MPZ



MS4A2



MSH2



MSH3



MSH6



MT-RNR1



MTCL1



MTHFD1



MTHFR



MTMR12



MTOR



MTR



MTRF1L



MTRR



MTTP



MUC5B



MUTYH



MVK



MYC



MYLIP



MYOCD



N6AMT1



NALCN



NANOGP6



NAT1



NAT2



NAV2



NBAS



NBEA



NCF4



NCOA1



NCOA3



NEDD4



NEDD4L



NEFM



NELFCD



NELL1



NEUROD1



NFATC1



NFATC2



NFE2L2



NFKB1



NFKBIA



NGF



NGFR



NLGN1



NLRP3



NLRP8



NOD2



NOS1AP



NOS2



NOS3



NPAS3



NPC1L1



NPHS1



NPPA



NPPA-AS1



NQO1



NQO2



NR1D1



NR1H3



NR1I2



NR1I3



NR3C1



NR3C2



NRAS



NRG1



NRG3



NRP1



NRP2



NRXN1



NT5C1A



NT5C2



NT5C3A



NT5E



NTRK1



NTRK2



NUBPL



NUDT15



NUMA1



OAS1



OASL



OCRL



OPN1SW



OPRD1



OPRK1



OPRM1



OR10AE3P



OR4D6



OR52E2



OR52J3



ORM1



ORM2



ORMDL3



OSMR



OTOS



OXT



P2RY1



P2RY12



PACSIN2



PADI4



PAPD7



PAPLN



PAPPA2



PARD3B



PARP11



PAX4



PCK1



PCSK9



PDCD1LG2



PDE4B



PDE4C



PDE4D



PDGFRA



PDGFRB



PDLIM5



PDZRN3



PEAR1



PEMT



PER2



PER3



PGLYRP4



PGR



PHACTR1



PHB2



PHTF1



PI4KA



PICALM



PICK1



PIGB



PIK3CA



PIK3R1



PITPNM2



PKLR



PLA2G4A



PLAGL1



PLCB1



PLCD3



PLCG1



PLEKHH2



PLEKHN1



PLG



PLXNB3



PMCH



POLA2



POLG



POLR3G



POMT2



PON1



PON2



POR



POU2F1



POU2F2



POU5F1



PPARA



PPARD



PPARG



PPARGC1A



PPFIA1



PPM1A



PPP1R13L



PPP1R1C



PPP2R5E



PRB2



PRCP



PRDM1



PRDM16



PRDX4



PRIMPOL



PRKAA1



PRKAA2



PRKCA



PRKCB



PRKCE



PRKCQ



PRKG1



PROC



PROCR



PROM1



PROS1



PROX1



PRRC2A



PRSS53



PSMA4



PSMB3P



PSMB4



PSMB8



PSMD14



PSORS1C1



PSORS1C3



PSRC1



PTCHD1



PTEN



PTGER2



PTGER3



PTGER4



PTGES



PTGFR



PTGIR



PTGS1



PTGS2



PTH



PTH1R



PTPN22



PTPRC



PTPRD



PTPRM



PTPRN2



PYGL



RAB27A



RABEPK



RAC2



RAD18



RAD52



RAF1



RALBP1



RAPGEF5



RARG



RARS



RBFOX1



RBMS3



REEP5



REL



REN



REPS1



RET



REV1



REV3L



RFK



RGS17



RGS2



RGS4



RGS5



RHBDF2



RHOA



RICTOR



RND1



RNFT2



RORA



RPL13



RRAS2



RRM1



RRM2



RRM2B



RSBN1



RSRP1



RUNX1



RXRA



RYR1



RYR2



RYR3



SACM1L



SCAP



SCARB1



SCGB3A1



SCN10A



SCN1A



SCN2A



SCN4A



SCN5A



SCN8A



SCN9A



SCNN1B



SCNN1G



SELE



SELP



SEMA3C



SERPINA3



SERPINA6



SERPINE1



SERPINF1



SERPING1



SETD4



SFRP5



SH2B3



SH2D5



SH3BP2



SHMT1



SIK3



SIN3A



SKIV2L



SKOR2



SLC10A2



SLC12A3



SLC12A8



SLC14A2



SLC15A1



SLC15A2



SLC16A5



SLC16A7



SLC17A3



SLC18A2



SLC19A1



SLC1A1



SLC1A2



SLC1A3



SLC1A4



SLC22A1



SLC22A11



SLC22A12



SLC22A16



SLC22A17



SLC22A2



SLC22A3



SLC22A4



SLC22A5



SLC22A6



SLC22A7



SLC22A8



SLC24A4



SLC25A13



SLC25A14



SLC25A27



SLC25A31



SLC26A9



SLC28A1



SLC28A2



SLC28A3



SLC29A1



SLC2A1



SLC2A2



SLC2A9



SLC30A8



SLC30A9



SLC31A1



SLC37A1



SLC39A14



SLC47A1



SLC47A2



SLC5A2



SLC5A7



SLC6A12



SLC6A2



SLC6A3



SLC6A4



SLC6A5



SLC6A9



SLC7A5



SLC7A8



SLCO1A2



SLCO1B1



SLCO1B3



SLCO1C1



SLCO2B1



SLCO3A1



SLCO4C1



SLCO6A1



SLIT1



SMARCAD1



SMYD3



SNAP25



SNORA59B



SNORD68



SOCS3



SOD2



SOD3



SORT1



SOX10



SP1



SPARC



SPATS2L



SPECC1L



SPG7



SPIDR



SPINK5



SPP1



SPTA1



SQSTM1



SREBF1



SREBF2



SRP19



SRR



ST13



STAT3



STAT4



STAT6



STIM1



STIP1



STK39



STMN1



STMN2



STX1B



STX4



SUGCT



SULT1A1



SULT1A2



SULT1C4



SULT1E1



SULT2B1



SV2C



SYN3



SYNE3



SZRD1



T



TAAR6



TAC1



TAGAP



TANC1



TANC2



TAP1



TAP2



TAPBP



TAS2R16



TBC1D1



TBC1D32



TBX21



TBXA2R



TBXAS1



TCF19



TCF7L2



TCL1A



TDP1



TDRD6



TERT



TET2



TF



TGFB1



TGFBR2



TGFBR3



TH



THBD



THRA



THRB



TIGD1



TK1



TLR2



TLR3



TLR4



TLR5



TLR7



TLR9



TMCC1



TMCO6



TMEFF2



TMEM205



TMEM258



TMEM57



TMPRSS11E



TNF



TNFAIP3



TNFRSF10A



TNFRSF11A



TNFRSF11B



TNFRSF1A



TNFRSF1B



TNFSF10



TNFSF11



TNFSF13B



TNRC6A



TNRC6B



TOLLIP



TOMM40



TOMM40L



TOP1



TOP2B



TP53



TPH1



TPH2



TPMT



TRAF1



TRAF3IP2



TRIB3



TRIM5



TRPM6



TSC1



TSPAN5



TTC6



TUBB1



TUBB2A



TXNRD2



TYMP



TYMS



UBASH3B



UBE2I



UCP2



UCP3



UGGT2



UGT1A



UGT1A1



UGT1A10



UGT1A3



UGT1A4



UGT1A5



UGT1A6



UGT1A7



UGT1A8



UGT1A9



UGT2B10



UGT2B15



UGT2B17



UGT2B4



UGT2B7



ULK3



UMPS



UPB1



USH2A



USP24



USP5



UST



VAC14



VASP



VDR



VEGFA



VKORC1



WBP2NL



WBSCR17



WDR7



WIF1



WNK1



WNT5B



WT1



WWOX



XBP1



XDH



XPA



XPC



XPO1



XPO5



XRCC1



XRCC3



XRCC4



XRCC5



YAP1



YBX1



YEATS4



ZBTB22



ZBTB4



ZCCHC6



ZFP91-CNTF



ZMAT4



ZNF100



ZNF215



ZNF423



ZNF432



ZNF652



ZNF697



ZNF804A



ZNF816



ZNRD1-AS1



ZSCAN25

















TABLE 4





Clinical Testing Genes


Gene (HGNC Symbol)

















LMNA



PTEN



TP53



BRCA2



MLH1



MSH2



BRCA1



MSH6



FGFR3



MECP2



CFTR



RET



PTPN11



SCN5A



MYH7



CAV3



PMS2



KRAS



APC



ATM



ARX



DMD



DES



STK11



POLG



NF1



BRAF



TSC1



CDKL5



TSC2



TTN



COL2A1



FMR1



FKTN



KCNQ1



VHL



SLC2A1



FBN1



EPCAM



HRAS



PALB2



RAF1



TNNT2



CEP290



SMAD4



MUTYH



SCN1A



SCN1B



KCNJ2



RYR2



GLA



CDH1



NRAS



FKRP



KCNH2



LDB3



CACNA1A



MYBPC3



FGFR2



UBE3A



CACNA1C



GJB2



TAZ



SDHB



TNNI3



ACTC1



GAA



TCAP



CHEK2



LAMP2



COL1A1



TTR



DSP



HBB



SDHD



SOS1



NBN



COL1A2



TGFBR2



POMT1



TPM1



FLNA



KCNE1



PCDH19



MAP2K1



CHD7



FOXG1



SDHC



TGFBR1



RYR1



MTHFR



SGCD



CDKN2A



PMP22



POMT2



FH



WT1



EMD



SCN4A



FGFR1



PLP1



PAX6



POMGNT1



TMEM43



MEN1



PKP2



SLC9A6



RHO



F5



GCK



BRIP1



TRIM32



DSG2



RAD51C



TRPV4



SCN2A



CPT2



KCNE2



GJB6



COL3A1



MAP2K2



NPHP1



DNM2



BMPR1A



PRKAG2



ACADM



OFD1



MYOT



CASQ2



HEXA



DSC2



MEF2C



HFE



CLN3



PTCH1



CRYAB



JUP



PLN



MED12



ZEB2



FHL1



ABCC8



F2



ACADVL



BAG3



ATP7A



CASR



SCN9A



BSCL2



PDHA1



SHOC2



ETFDH



KCNQ2



HADHA



TNNC1



PRRT2



TPP1



ANO5



COL5A1



ETFB



MPZ



ETFA



ACTA1



PPT1



CASK



STXBP1



ABCD1



KCNJ11



ATRX



GNAS



ABCA4



DYSF



ABCC9



TCF4



BLM



SLC22A5



SDHA



MYH6



HCN4



ATP7B



PLA2G6



FANCC



MYL2



CBS



ANK2



KCNE3



MYL3



CLN5



DCX



PANK2



ALDH7A1



NKX2-5



GBA



TIMM8A



PNKP



ACTA2



WFS1



MFN2



FOLR1



JAG1



SMN1



SMARCB1



L1CAM



GPC3



KIT



NSD1



OPA1



DHCR7



NF2



SGCA



MITF



CLRN1



TPM2



SPRED1



MKS1



NIPBL



AGL



OTC



RB1



CSRP3



GLB1



TMEM67



CLN6



HNF1B



SMC1A



SCN4B



CACNB2



ACVRL1



DLD



CBL



FXN



ARSA



PSEN1



COL6A3



LAMA2



SMAD3



ENG



PRPS1



ACTN2



TWNK



CAPN3



GDAP1



COL5A2



EYA1



PCDH15



GCH1



SURF1



SGCB



SCN3B



TMEM216



PITX2



COL6A1



PEX1



MYH11



VCL



NOTCH3



LARGE1



SLC26A4



CLN8



BTD



GAMT



USH2A



MYH9



AR



NPC1



TERT



GABRG2



GCDH



HNF1A



FLNC



IDS



COL6A2



BBS1



RPGR



FLCN



GNE



RPGRIP1L



MEFV



CALM1



CDKN1C



MFSD8



PRPH2



SMPD1



OPHN1



CNTNAP2



BCKDHB



PLOD1



PLEC



CREBBP



SDHAF2



ARHGEF9



AKAP9



RAD51D



NEB



OPA3



MBD5



NPC2



MYO7A



CTSD



VPS13B



GALC



KCNJ5



PAFAH1B1



PYGM



GRN



ASPA



CDK4



PEX7



MET



FBN2



CC2D2A



GARS



NRXN1



PIK3CA



COL11A2



HTT



SLC26A2



SETX



NEXN



TGFB3



SELENON



KCNJ10



CPT1A



HPRT1



ELN



UGT1A1



WAS



OCRL



KCND3



MUT



VCP



HADHB



GPD1L



KCNQ3



SUCLA2



SCO2



FTL



EGR2



PMM2



ALPL



SNTA1



BBS2



G6PC



HADH



PKD2



PKHD1



COQ2



MMACHC



GJB1



BEST1



SGCG



BCKDHA



LDLR



NPHP3



SLC25A20



ACADS



DYNC1H1



KCTD7



MAPT



FIG4



TREX1



MMAB



PQBP1



GRIN2A



COL4A5



MMAA



MKKS



RPE65



GBE1



NDP



HSD17B10



GATA1



APOB



TTC8



SPG7



PDX1



GABRA1



APTX



IKBKAP



NEFL



PEX6



COL11A1



TBC1D24



TGFB2



CRX



APOE



GUCY2D



PHOX2B



ISPD



ATP1A2



ATP13A2



ATL1



SYNE1



ATXN2



SLC6A8



ALMS1



HNF4A



AHI1



ACAD9



PRKAR1A



SNRPN



COL4A1



NOTCH1



SLC25A22



GLDC



ADGRV1



GALT



PEX26



TRDN



PHF6



PNPO



KCNT1



MTM1



COX15



SLC4A1



RRM2B



PRSS1



TPM3



BBS10



BAP1



BCS1L



CDH23



MRE11



PCCA



TBX5



MPL



PAH



SPTAN1



SCN8A



AMT



ASS1



PSEN2



CACNA1S



USH1C



FANCA



CYP21A2



FGD1



PEX12



SLC2A10



WDR62



FAH



GLI3



RUNX1



ANKRD1



GNPTAB



SLC25A4



SERPINA1



RELN



BARD1



RAPSN



DKC1



CSTB



SGCE



F8



KCNJ8



MYPN



MVK



PEX10



REEP1



CRB1



CHRNA1



RBM20



PCCB



BCOR



NLRP3



HBA1



EPM2A



SKI



GATA2



MYLK



FANCB



TYR



ABCB4



C12orf65



PEX2



LRP5



TTC21B



SLC25A13



HSPB1



HSPB8



MPV17



SPAST



SLC37A4



IQCB1



IDUA



EYA4



KCNA1



PGK1



CYP1B1



WHRN



SMARCA4



TERC



ADSL



DMPK



ATXN1



ATP6AP2



SYNGAP1



RDH12



TARDBP



KMT2D



PRKN



NPHP4



TK2



NHLRC1



GJA1



SUCLG1



GATA4



NDUFA1



COL4A3



ATXN3



VWF



TH



DBT



KIF1A



MMADHC



MID1



PKD1



AP3B1



CHRNA4



DNAJB6



APP



SHH



FA2H



CHRNB2



EDN3



SLC16A2



ELANE



FUS



INS



RPS6KA3



INVS



MYOZ2



TNNT1



ALK



TMEM70



CACNB4



JAK2



CNGB3



SPINK1



AGXT



PAX3



MCOLN1



PEX5



ASPM



DGUOK



IGHMBP2



CFH



SOD1



TUBA1A



DOLK



PROM1



SYN1



HMGCL



KDM5C



RAB39B



DNAJC5



AUH



SHOX



ATXN7



CENPJ



SRPX2



SOX10



CYP2D6



DCTN1



TBX1



ALDOB



ARL6



BBS12



COQ8A



TWIST1



RECQL4



OTX2



PC



DPAGT1



TP63



GP1BA



ARG1



POLD1



SACS



AKT1



PEX3



SMC3



OCA2



CYP2C19



RMRP



IL2RG



DNAH5



SPG11



NDRG1



COL4A4



FOXC1



BMPR2



MCCC2



MAX



F9



ERCC6



C9orf72



TYMP



RAI1



AIPL1



MCCC1



SLC25A19



COL9A1



BTK



P3H1



PDSS2



PCNT



NOTCH2



ATP8B1



ATP1A3



ETHE1



HEXB



SLC25A15



CP



COL9A2



CHRNA2



CHRNE



CUL4B



DOK7



CHRND



GUSB



SLC19A3



IVD



SH3TC2



EFHC1



IMPDH1



CRTAP



CYP27A1



HSPD1



SOX2



SDCCAG8



CYP2C9



ALS2



RPS19



GOSR2



RARS2



GFAP



PEX14



CYP11B1



GMPPB



BBS4



SGSH



GJC2



GLUD1



GATM



TMEM127



RPGRIP1



PDGFRA



LGI1



MT-ATP6



ADAMTS13



BBS5



WDR45



MTMR2



GATA6



BBS7



LITAF



POLG2



ABCB11



PRX



ALG2



ABCC6



RNASEH2B



FANCG



ADA



SIL1



RP2



RASA1



NTRK1



TNFRSF1A



SCNN1B



CHAT



USH1G



FLNB



DNAI1



CFL2



OPTN



NDUFS4



ARL13B



BBS9



TOR1A



LRPPRC



ATPAF2



SAMHD1



TSEN54



NPHS2



TSFM



HBA2



GALNS



FKBP14



CHST14



FOXRED1



TRPM4



NHS



RNASEH2A



RNASEH2C



ADGRG1



MT-RNR1



AGK



CEP152



ASL



SNCA



GRIN2B



DTNA



SIX1



CPS1



KIF7



AIFM1



PDHX



NAGLU



MT-TL1



NSDHL



HDAC8



HGSNAT



LRRK2



SBF2



RAB7A



SCNN1G



LRAT



DARS2



KIF5A



RIT1



PCSK9



GFM1



PINK1



NPHS1



ARSB



NDUFS7



POLE



PFKM



SCN2B



IDH2



FBLN5



INPP5E



PDSS1



GABRD



ATP6V0A2



PRICKLE1



ACAT1



SOX9



CACNA2D1



G6PD



SPG20



SCARB2



NLGN3



ANOS1



NLGN4X



GABRB3



HAX1



AFG3L2



GJB3



TINF2



KRIT1



GPR143



CDC73



EDNRB



MLYCD



AARS2



JAK3



SDHAF1



JPH2



NDUFV1



PEX13



PLCB1



ABHD12



PEX16



IRF6



SUMF1



BSND



DAG1



HLCS



ATR



EGFR



AFF2



EZH2



PEX19



ABCA3



PAK3



NDUFS1



PHYH



PRKCG



TMPO



TULP1



COMP



MPI



MYLK2



HESX1



YARS



BIN1



DPM3



LYST



AARS



SIX3



ACTG1



C19orf12



PDHB



COQ9



MLC1



NODAL



DPYD



CHM



DPM1



LIPA



SFTPC



DLAT



VRK1



TUBB2B



ATP6V1B1



HSD17B4



CERKL



EP300



SLC12A3



GATA3



FANCE



FGD4



CFI



SCN10A



COLQ



COX6B1



FKBP10



EXT1



ADAMTS2



SBDS



CD46



TGIF1



SALL1



ERCC4



KIF1B



SLC17A5



WNK1



KCNA5



ARFGEF2



FANCF



ELOVL4



SALL4



CYP7B1



KARS



GRIA3



ALDH5A1



SPR



CLCN1



HCCS



GNS



EIF2AK3



PUS1



PDE6B



PLOD2



PAX2



DHDDS



WDR19



ALG6



PPARG



VAPB



CHD2



RP1



PSAP



WRN



LMBRD1



INSR



CEBPA



LPIN1



SMS



MT-TK



PARK7



SUFU



UMOD



PRNP



AGA



RAD50



FUCA1



SLC39A13



NDUFA2



ISCU



MT-TS1



SEMA4A



FOXP3



TACO1



LIG4



AIRE



SRY



KBTBD13



EIF2B5



MT-ND1



IKBKG



DICER1



TRMU



MUSK



SLC25A3



OTOF



POMK



TBP



RAG2



UPF3B



EDA



RLBP1



RAB3GAP1



LAMB2



CEP41



RAD21



KDM6A



MCPH1



CABP4



SPATA7



MTRR



LAMA4



EFEMP2



NDUFS8



GALK1



SAG



LCA5



NR2E3



EXT2



GCSH



PPIB



PORCN



EHMT1



CTNNB1



CTNS



TFR2



C3



HCN1



EIF2B1



SLX4



POU3F4



WDPCP



INF2



LIAS



CHRNB1



ACTB



AP1S2



PHEX



SPTB



NEUROD1



RS1



NPPA



SOX3



FGF23



MAN2B1



DNAH11



ERCC2



DGKE



CCM2



NDUFAF2



EVC



RAG1



HPS1



NDUFS3



NDUFS2



ZIC2



FGF8



LPL



FASTKD2



TCTN2



CACNA1D



HPS4



CACNA1F



CLCN5



GJA5



SYP



GP1BB



FANCL



ACSL4



IDH1



CLCNKB



CISD2



ROR2



NEU1



GATAD1



MYH3



NDE1



PRPF31



ABCG5



NKX2-1



PGM1



TMEM237



FBP1



CDK5RAP2



NDUFAF5



ZFYVE26



DPM2



PHKA1



MT-ND6



STIL



TUBB3



BICD2



IQSEC2



SPTA1



ITGA7



QDPR



TJP2



PTS



EIF2B3



NOD2



GLRA1



CSF1R



PRF1



ATN1



PAX4



GPSM2



CHMP2B



CFB



EYS



FANCI



ST3GAL3



AGPAT2



PDP1



IL7R



HK1



PNPLA2



RAB27A



DCLRE1C



MC4R



GYS2



B9D1



SCNN1A



ANG



ENPP1



PRPF8



SFTPB



FANCM



AXIN2



LMX1B



NHEJ1



SYNE2



TTC19



PROP1



MAGT1



COL7A1



FANCD2



FSCN2



NDUFAF1



MT-ND4



KCNJ1



COL12A1



CNGA3



STAT3



TYRP1



NDUFS6



GUCA1B



SLC2A2



SIX5



ADAR



SLC33A1



CCDC39



AMACR



GAN



HFE2



B3GLCT



EFNB1



UQCRB



SLC12A6



FGA



HPS3



XRCC2



MTR



C8orf37



ACTN4



EVC2



THAP1



TRPS1



IDH3B



RUNX2



LAMB3



SH2D1A



GDI1



TMC1



DNMT1



PDCD10



MRPS22



LAMA3



TOPORS



CHKB



MTPAP



CYP17A1



POMGNT2



SLC12A1



ZIC3



GLI2



RD3



ALAS2



RPL35A



CNGB1



LDLRAP1



DEPDC5



THBD



DYRK1A



SLC19A2



DNAI2



PGAM2



PNKD



ASAH1



WDR35



VKORC1



DOCK8



PHGDH



SLC45A2



GP9



CCDC78



SPTLC1



IL1RAPL1



SLC35C1



UBE2A



NR0B1



CAVIN1



ACOX1



AGRN



CA4



COL9A3



CNGA1



LAMC2



DTNBP1



EIF2B2



TTPA



FLVCR1



MYH14



ERBB2



ITGB3



VLDLR



WASHC5



NDUFA11



C2orf71



PTCHD1



NRL



ALDH4A1



RSPH9



ATP5E



GK



CTDP1



ABL1



TCTN1



ANK1



CTSA



SLC40A1



AKT3



B4GAT1



ZMPSTE24



MERTK



EIF2B4



ERCC8



NUBPL



PPOX



PDLIM3



PNPLA6



TNXB



PRKG1



FOXH1



COG7



RPL11



GPHN



ABCG8



PDE6C



B4GALT7



G6PC3



GNA11



CLCN2



NME8



KCNJ13



HEPACAM



SLCO1B1



UQCRQ



NDUFAF4



TMEM138



MT-ND5



NDUFAF3



HMBS



NHP2



IFITM5



MBTPS2



SMN2



PDE6A



VSX2



MYO6



CPOX



ALG13



CCDC40



ALDH3A2



NIPA1



TSHR



ZNF423



SQSTM1



MOCS2



L2HGDH



SCO1



TUBB4A



TCOF1



MOCS1



MTO1



CIB2



HINT1



KIAA2022



ERCC3



PITX3



PRPF3



DNM1L



TCTN3



FHL2



CA2



GRHPR



PLEKHG5



CDON



KLHL40



TSEN2



SLC1A3



RGR



NEBL



C5orf42



HPS6



GFI1



MYCN



LZTR1



BRWD3



TSEN34



F11



SNRNP200



GNAT2



ALG1



TMEM126A



SP7



KLHL7



TUFM



DLG3



DNAAF2



DNAAF1



VPS13A



NOP10



TMEM5



MCEE



STXBP2



MED25



SHANK3



SLC3A1



TECTA



COX10



CHRNG



RDH5



CDHR1



PHF8



RPL5



MAOA



GFPT1



RAB3GAP2



CALM2



NAGS



POLR1C



HSD3B2



AMPD1



BUB1B



NEK8



TUBA8



B3GALNT2



FLT3



MATR3



KRT5



GDF6



GREM1



AVPR2



DNAL1



ZDHHC9



CTC1



ALDOA



NR5A1



CYBB



FTSJ1



BLOC1S3



EBP



DCAF17



SPG21



ACAD8



ABCB7



F12



GLRB



GLIS2



EXOSC3



HUWE1



BMP4



TMIE



GNPTG



RPS26



ITGA2B



LRSAM1



SLC6A3



ALDH18A1



SERPINC1



KLF11



F7



RPS10



WNT10A



NFIX



MGAT2



ACSF3



RBBP8



CFHR5



COQ6



UBQLN2



CDKN1B



SUOX



FAM126A



COG8



NDUFA10



SMARCE1



ALG8



GSS



EPB42



RPL10



DNAJC19



NAA10



KCNMA1



RPS24



STX11



ALG3



XK



MFRP



TMPRSS3



TSPAN7



SERPINH1



IMPG2



ALG12



SERPINE1



SLC16A1



TCIRG1



STIM1



ETV6



CLCN7



GDF2



SLC35A1



FAM161A



ARID1B



TMEM231



SLC35A2



NGF



COX4I2



POU1F1



GLIS3



TAF1



PNP



POMC



KIF1BP



BLK



YARS2



TCN2



UNC13D



HAMP



HOGA1



ACADSB



B4GALT1



MANBA



KAT6B



RSPH4A



ACE



EDAR



WWOX



FARS2



GNAQ



GNPAT



ANKH



ENO3



FRAS1



RANGRF



GALE



TREM2



CD3D



LEP



TFG



IER3IP1



DYNC2H1



NPM1



KMT2A



CD40LG



PYGL



MT-CYB



DFNB59



MRPS16



RTN2



KCNE5



MATN3



TAT



NDUFV2



CDAN1



STS



CAV1



B3GALT6



CTSK



CALR3



KCNV2



AP4M1



SERPING1



GYS1



HPS5



ST3GAL5



SLC6A5



ARID1A



PRKRA



COG1



COL4A2



EFEMP1



PIK3R2



MTFMT



SEPT9



FOXP1



NDUFAF6



ROM1



KRT14



SLC25A12



SEC23B



TNNI2



CD3E



HPD



PHKB



AIP



FZD4



XPNPEP3



CEP164



ITGB4



SLMAP



PABPN1



TBCE



GHR



NOG



CACNA2D4



ALG9



FOXL2



TYROBP



THRB



AP4E1



BDNF



AKT2



DSPP



MPDU1



EDARADD



TPMT



SPTBN2



BLOC1S6



FGF14



CTSF



PRCD



SRD5A3



PRPF6



TRAPPC11



PHKA2



COCH



AGPS



EARS2



FOXE3



IGBP1



RBP3



PKLR



PIGA



MAT1A



SPTLC2



CEP63



FBXO7



SETBP1



OTOA



RTEL1



PTF1A



LEPR



SMARCAL1



SCP2



PCBD1



DMP1



MOGS



CNTN1



TNPO3



POLR3A



SLC46A1



FOXI1



MYO15A



KCNQ4



MYOC



PYCR1



APOA5



GRHL2



POR



AICDA



KISS1R



PRDM16



ARSE



LHFPL5



PDE6G



HARS



SNAI2



VCAN



SMPX



CSF3R



COL17A1



LOXHD1



MTTP



SERPINF1



PROKR2



GNRHR



D2HGDH



B9D2



ZAP70



AP5Z1



CTNNA3



CSF2RA



SLC34A3



ZNF513



TNFRSF11A



CTRC



RP9



HSPG2



KANSL1



RPS7



TRIOBP



CEL



SHROOM4



SLC7A7



RFT1



ADAMTSL4



ABCA12



ABAT



LPIN2



ERCC5



HGF



PROC



LHX4



ROGDI



ABCA1



DIABLO



ESCO2



PRDM5



PHKG2



FREM1



PRODH



DIS3L2



RDX



WRAP53



MC1R



ACVR1



ZNF711



IFT80



ACVR2B



EFTUD2



LTBP2



MEGF10



RAB18



CLDN14



FLT4



CCT5



SRCAP



ESRRB



PDZD7



NEK1



NR3C2



TBX20



DNAJB2



FAS



ATXN10



CFHR1



GDF5



PSTPIP1



ARHGEF6



TDP1



GUCA1A



OXCT1



PPP2R2B



AQP2



TRPC6



MARVELD2



FECH



OAT



PEX11B



PRICKLE2



APOC2



PDGFRB



CACNA1H



LHCGR



SARS2



LRTOMT



COL10A1



XIAP



UNG



MGME1



SLC26A5



CYBA



PITPNM3



PTH1R



TIMP3



DRD2



PDE6H



ALX4



TXNRD2



OBSL1



ORC1



GH1



CSPP1



LEFTY2



CCDC50



ABCD4



DIAPH1



CDH3



CHCHD10



PAX8



GDNF



MT-CO1



HARS2



HTRA1



BMP1



MSRB3



ZDHHC15



CAVIN4



AP4S1



CFHR3



ACADL



NDUFA9



MSX1



MYO3A



CYP11B2



CTF1



MAK



AP4B1



IFT122



ABHD5



MARS



A2ML1



CHST3



CYLD



GDF1



XPA



MT-TH



TPRN



MT-TQ



POU4F3



XPC



GRIN1



GIPC3



CYP27B1



POLR1D



LHX3



TGFB1



TOR1AIP1



CNBP



GM2A



DDHD2



TRPM1



BCKDK



DNAAF3



HSD11B2



ADAM9



CLCNKA



NDUFB3



LAS1L



MAGI2



ANKRD11



NMNAT1



ZFYVE27



DNMT3A



PROK2



SMARCA2



GFER



POLR3B



NDUFA12



PLCE1



STRA6



EMX2



HMGCS2



ASCL1



COMT



PROS1



KCNC3



ILK



FGB



C10orf11



ILDR1



ANKRD26



GRXCR1



SZT2



HNRNPDL



KIF11



FGG



DDC



TTBK2



FREM2



ZNF469



TUSC3



TFAP2A



DLL3



CLIC2



GDF3



MT-TS2



CYP3A5



AHCY



LDHA



SLC52A3



PRKCSH



ACY1



ACO2



KCNK3



AMER1



WNT1



MARS2



NYX



VPS35



UROS



COG6



REN



AVP



MTOR



TBX3



RBM10



PFN1



TPO



MYBPC1



SERPINB6



PTPRC



H19



ABCB6



WNT7A



MYO5A



CCDC88C



ATP6V0A4



OSTM1



SRD5A2



CDT1



DFNA5



ESPN



MYF6



USB1



DDOST



CRYM



APOA1



ATXN8OS



AGTR2



SLC17A8



MSX2



DST



LTBP4



KLHL3



AAAS



RFX6



LBR



CYP3A4



F13A1



RAX2



RAC2



PREPL



ERLIN2



ANK3



NFU1



LRP4



TNFRSF13B



TNFSF11



SNAP29



LAMC3



RBM8A



ORC6



GRM6



COG5



ORC4



PDYN



CRELD1



SLC5A7



ITGA3



SPINK5



WNT4



ENAM



C1QTNF5



PDK3



HTRA2



GNB4



WNK4



COG4



MT-TI



HSPB3



MT-TL2



HCFC1



POT1



ICOS



SIGMAR1



ATP2A1



GNAT1



SOS2



CTSC



FOXP2



TMEM165



CXCR4



SH3BP2



TACR3



CFC1



ABCC2



DNAJC6



DHODH



CPA6



AK2



HOXD13



VPS45



PLOD3



KRT1



MT-ATP8



DNAAF5



TGM1



TSPAN12



IFT172



CD2AP



MRPL3



LIFR



RIMS1



CNNM4



CDC6



F10



FOXC2



STAT5B



PIK3R1



ORAI1



ZNF81



ZFP57



CYP24A1



GLE1



COL18A1



TIA1



RPL26



GNAO1



LCAT



VDR



ANO10



TNNT3



LZTFL1



COL4A6



SHANK2










REFERENCES



  • Aoki et al., “The RAS/MAPK Syndromes: Novel Roles of the RAS Pathway in Human Genetic Disorders,” Human Mutation, 2008.

  • KARCZEWSKI et al., “Analysis of protein-coding genetic variation in 60,706 humans,” Nature, 2016.

  • LANDRUM et al., “ClinVar: public archive of interpretations of clinically relevant variants,” Nucleic Acids Res., 2015.

  • MAXWELL et al., “Evaluation of ACMG-Guideline-Based Variant Classification of Cancer Susceptibility and Non-Cancer-Associated Genes in Families Affected by Breast Cancer,” Am. J. Hum. Genet., 2016.

  • MYERS et al., “The lipid phosphatase activity of PTEN is critical for its tumor supressor function,” Proc. Natl. Acad. Sci. U.S.A, 1998.

  • MYERS et al., “P-TEN, the tumor suppressor from human chromosome 10q23, is a dual-specificity phosphatase,” Proc. Natl. Acad. Sci. U.S.A, 1997.

  • H E et al., “Cowden syndrome-related mutations in PTEN associate with enhanced proteasome activity,” Cancer Res., 2013.

  • HEIKKINEN et al., “Variants on the promoter region of PTEN affect breast cancer progression and patient survival,” Breast Cancer Res., 2011.

  • JOHNSTON et al., “Conformational stability and catalytic activity of PTEN variants linked to cancers and autism spectrum disorders,” Biochemistry, 2015.

  • MARKKANEN et al., “DNA Damage and Repair in Schizophrenia and Autism: Implications for Cancer Comorbidity and Beyond,” Int. J. Mol. Sci., 2016.

  • SCHARNER et al., “Genotype—phenotype correlations in laminopathies: how does fate translate?,” Biochem. Soc. Trans., 2010.

  • ARAYA et al., “Deep mutational scanning: assessing protein function on a massive scale,” Trends Biotechnol., 2011.

  • SHENDURE et al., “Massively Parallel Genetics,” Genetics, 2016.

  • KELSIC et al., “RNA Structural Determinants of Optimal Codons Revealed by MAGE-Seq,” Cell Syst, 2016.

  • PATWARDHAN et al., “High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis,” Nat. Biotechnol., 2009.

  • BUENROSTRO et al., “Quantitative analysis of RNA-protein interactions on a massively parallel array reveals biophysical and evolutionary landscapes,” Nat. Biotechnol., 2014.

  • GUENTHER et al., “Hidden specificity in an apparently nonspecific RNA-binding protein,” Nature, 2013.

  • ARAYA et al., “A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function,” Proc. Natl. Acad. Sci. U.S.A, 2012.

  • FOWLER et al., “High-resolution mapping of protein sequence-function relationships,” Nat. Methods, 2010.

  • MAJITHIA et al., “Prospective functional classification of all possible missense variants in PPARG,” Nat. Genet., 2016.

  • STARITA et al., “Massively Parallel Functional Analysis of BRCA1 RING Domain Variants,” Genetics, 2015.

  • BUENROSTRO et al., “Single-cell chromatin accessibility reveals principles of regulatory variation,” Nature, 2015.

  • CUSANOVICH et al., “Multiplex single-cell profiling of chromatin accessibility by combinatorial cellular indexing,” Science, 2015.

  • CAO et al., “Comprehensive single cell transcriptional profiling of a multicellular organism by combinatorial indexing,” bioRxiv, 2017.

  • ZHENG et al., “Massively parallel digital transcriptional profiling of single cells,” Nat. Commun., 2017.

  • DATLINGER et al., “Pooled CRISPR screening with single-cell transcriptome readout,” Nat. Methods, 2017.

  • JAITIN et al., “Dissecting Immune Circuits by Linking CRISPR-Pooled Screens with Single-Cell RNA-Seq,” Cell, 2016.

  • ADAMSON et al., “A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response,” Cell, 2016.

  • DIXIT et al., “Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens,” Cell, 2016.

  • MACOSKO et al., “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets,” Cell, 2015.

  • GAWAD et al., “Single-cell genome sequencing: current state of the science,” Nat. Rev. Genet., 2016.

  • TANAY et al., “Scaling single-cell genomics from phenomenology to mechanism,” Nature, 2017.

  • SCHWARTZMAN et al., “Single-cell epigenomics: techniques and emerging applications,” Nat. Rev. Genet., 2015.

  • BUZDIN et al., “The OncoFinder algorithm for minimizing the errors introduced by the high-throughput methods of transcriptome analysis,” Front Mol Biosci, 2014.

  • MACOSKO et al., “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets,” Cell, 2015.

  • WHITFIELD et al., “Identification of genes periodically expressed in the human cell cycle and their expression in tumors,” Mol. Biol. Cell, 2002.

  • PAN et al., “Using input dependent weights for model combination and model selection with multiple sources of data,” Stat. Sin., 2006.

  • EFRON et al., “Improvements on Cross-Validation: The 632+Bootstrap Method,” J. Am. Stat. Assoc., 1997.

  • EFRON, “How Biased is the Apparent Error Rate of a Prediction Rule?,” J. Am. Stat. Assoc., 1986.

  • EFRON, “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation,” J. Am. Stat. Assoc., 1983.

  • SHEN et al., “Adaptive Model Selection and Assessment for Exponential Family Distributions,” Technometrics, 2004.

  • SHEN et al., “Adaptive Model Selection,” J. Am. Stat. Assoc., 2002.

  • GEORGE et al., “Calibration and Empirical Bayes Variable Selection,” Biometrika, 2000.

  • RIPLEY et al., “Pattern Recognition and Neural Networks,” Cambridge University Press, 2008.

  • HASTIE et al., “The Elements of Statistical Learning. Data Mining, Inference, and Prediction,” Springer, 2001.

  • BURNHAM et al., “Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach,” Springer, 2003.

  • YUVAL, “Bootstrapping with Noise: An Effective Regularization Technique,” Connection Science, 1996.

  • AMENDOLA et al., “Performance of ACMG-AMP Variant-Interpretation Guidelines among Nine Laboratories in the Clinical Sequencing Exploratory Research Consortium,” Am. J. Hum. Genet., 2016.

  • BERGER, et al., “High-throughput Phenotyping of Lung Cancer Somatic Mutations,” Cancer Cell, 2016 30(2); pp. 214-228.

  • MACOSKO, et al., “Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets,” Cell, 2015 161(5); pp. 1202-1214.

  • STARITA et al., “Deep Mutational Scanning: A Highly Parallel Method to Measure the Effects of Mutation on Protein Function,” Cold Spring Harb Protoc, 2015(8); pp. 711-714.

  • SHENDURE et al., “A framework for determining the relative effect of genetic variants,” U.S. patent application Ser. No. 15/023,355, filed Mar. 18, 2016.

  • REGEV et al., “A droplet-based method and apparatus for composite single-cell nucleic acid analysis,” International Patent Publication No. WO 2016/040476, published Mar. 17, 2016.

  • KALIA S S, et al., “Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics,” Genet Med., 2016.

  • FUTREAL A P, et al., “A census of human cancer genes,” Nat Rev Cancer, 2004 4(3); pp. 177-183.

  • LAWRENCE M S, et al., “Discovery and saturation analysis of cancer genes across 21 tumour types,” Nature, 2014 505(7484); pp. 495-501.

  • WHIRL-CARRILLO et al., “Pharmacogenomics knowledge for personalized medicine,” Clin Pharmacol Ther, 2012 92(4); pp. 414-417.

  • RUBINSTEIN et al., “The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency,” Nucleic Acids Res, 2013 4; pp. D925-35.

  • SAMOCHA K E, et al. (2017) “Regional missense constraint improves variant deleteriousness prediction,” bioRxiv:148353.

  • Kitzman, J. O., Starita, L. M., Lo, R. S., Fields, S. & Shendure, J. Massively parallel single-amino-acid mutagenesis. Nat. Methods 12, 203-206 (2015).

  • Findlay, G. M., Boyle, E. a., Hause, R. J., Klein, J. C., and Shendure, J. (2014). Saturation editing of genomic regions by multiplex homology-directed repair. Nature 513, 1-2.

  • Firnberg, E. & Ostermeier, M. PFunkel: Efficient, Expansive, User-Defined Mutagenesis. PLoS One 7, 1-10 (2012).

  • Wrenbeck, E. E. et al. Plasmid-based one-pot saturation mutagenesis. Nat. Methods 13, 928-930 (2016).

  • Wissink, E. M., Fogarty, E. A. & Grimson, A. High-throughput discovery of post-transcriptional cis-regulatory elements. BMC Genomics 17, 1-14 (2016).

  • Araya et al. 2016, U.S. Patent Application 20160378915A1.


Claims
  • 1.-137. (canceled)
  • 138. A method for determining a phenotypic impact of a target molecular variant, the method comprising: receiving a plurality of samples, wherein the plurality of samples comprises a plurality of molecular variants and each sample comprises a variant in a gene,wherein the plurality of molecular variants is divided into two groups: a. a Truth Set comprising molecular variants with known phenotypic impacts, andb. a Target Set comprising molecular variants with unknown phenotypic impacts, wherein the Target Set comprises the target molecular variant;training a machine learning model using a known association between the molecular variants in the Truth Set and the known phenotypic impacts, wherein the known association is based on a plurality of dependent features assayed using a functional assay, the functional assay generating a molecular measurement or a derivative of the molecular measurement for each molecular variant in the Truth Set; anddetermining the phenotypic impact of the target molecular variant using the trained machine learning model.
  • 139. The method of claim 138, wherein the plurality of samples comprises single cells, cellular compartments, subcellular compartments, or synthetic compartments.
  • 140. The method of claim 138, wherein the plurality of molecular variants comprises coding or non-coding variants within previously identified mutational hotspots of functional elements, genes, and pathways associated with other clinically valuable genes, mutational hotspots of functional elements, genes, and pathways associated with Mendelian disorders, pathways associated with known cancer drivers, or pathways associated with variation in drug response.
  • 141. The method of claim 138, wherein the plurality of molecular variants is derived based on clinical databases, phenotype databases, population databases, molecular annotation databases, or functional databases of variants, subjects, or populations or produced using a mutagenesis assay.
  • 142. The method of claim 138, wherein the known phenotypic impacts of the molecular variants in the Truth Set and the unknown phenotypic impacts of the target molecular variants in the Target Set measure pathogenicity, functionality, or relative effect of the molecular variant.
  • 143. The method of claim 138, wherein the molecular measurement further comprises locus-specific measurements of gene expression, protein expression, chromatin accessibility, epigenetic modification, regulatory activity, post-transcriptional processing, post-translational modification, mutation status, mutation burden, or mutation rate of molecules within each sample in the plurality of samples.
  • 144. The method of claim 138, wherein the machine learning model is a supervised learning model.
  • 145. The method of claim 138, wherein the derivative of the molecular measurement is generated using a plurality of Artificial Neural Networks (ANNs), wherein the plurality of ANNs comprises: a. a first ANN to generate a database of molecular measurements for the Truth Set,b. a second ANN to generate a plurality of associations between each of the molecular measurements in the database and one or more from the group consisting of molecular states, phenotypes, and genomics metrics using statistical methods, andc. a third ANN to generate the derivative of the molecular measurement by reducing dimensionality and removing noise from an association corresponding to the molecular measurement,wherein the derivative of the molecular measurement is used to determine the phenotypic impact of the target molecular variant.
  • 146. The method of claim 138, wherein the known association is based on a plurality of independent features that are not assayed for each molecular variant in the Truth Set and wherein the plurality of independent features comprises one or more of evolutionary, population, annotation-based, structural, dynamical, physicochemical features associated with variants, genomic coordinates, transcript coordinates, translated coordinates, and amino acids.
  • 147. The method of claim 138, wherein the method is used to inform a test subject's lifetime risk of developing cancer, wherein the test subject has the target molecular variant.
  • 148. The method of claim 138, wherein the method is used to identify significantly mutated regions and significantly mutated networks by identifying phenotype-associated mutation density.
  • 149. A system for determining a phenotypic impact of a target molecular variant, the system comprising: at least one computer hardware processor; andat least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform:training a machine learning model using a known association between molecular variants in a Truth Set and known phenotypic impacts, wherein the known association is based on a plurality of dependent features assayed using a functional assay, the functional assay generating a molecular measurement or a derivative of the molecular measurement for each sample in the Truth Set; anddetermining the phenotypic impact of the target molecular variant using the trained machine learning model.
  • 150. The system of claim 149, wherein the known phenotypic impacts of the molecular variants in the Truth Set and the phenotypic impact of the target molecular variant measure pathogenicity, functionality, or relative effect of the molecular variant.
  • 151. The system of claim 149, wherein the molecular measurement further comprises locus-specific measurements of gene expression, protein expression, chromatin accessibility, epigenetic modification, regulatory activity, post-transcriptional processing, post-translational modification, mutation status, mutation burden, or mutation rate of molecules within each sample in the plurality of samples.
  • 152. The system of claim 149, wherein the machine learning model is a supervised learning model.
  • 153. The system of claim 149, wherein the derivative of the molecular measurement is generated using a plurality of Artificial Neural Networks (ANNs), wherein the plurality of ANNs comprises: a. a first ANN to generate a database of molecular measurements for the Truth Set,b. a second ANN to generate a plurality of associations between each of the molecular measurements in the database and one or more from the group consisting of molecular states, phenotypes, and genomics metrics using statistical methods, andc. a third ANN to generate the derivative of the molecular measurement by reducing dimensionality and removing noise from an association corresponding to the molecular measurement,wherein the derivative of the molecular measurement is used to determine the phenotypic impact of the target molecular variant.
  • 154. The system of claim 149, wherein the known association is based on a plurality of independent features that are not assayed for each sample in the Truth Set and wherein the plurality of independent features comprises one or more of evolutionary, population, annotation-based, structural, dynamical, physicochemical features associated with variants, genomic coordinates, transcript coordinates, translated coordinates, and amino acids.
  • 155. The system of claim 149, wherein the system is used to inform a test subject's lifetime risk of developing cancer, wherein the test subject has the target molecular variant.
  • 156. The system of claim 149, wherein the system is used to identify significantly mutated regions and significantly mutated networks by identifying phenotype-associated mutation density.
  • 157. At least one non-transitory computer readable storage medium storing processor executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform: training a machine learning model using a known association between molecular variants in a Truth Set and known phenotypic impacts, wherein the known association is based on a plurality of dependent features assayed using a functional assay, the functional assay generating a molecular measurement or derivatives of the molecular measurement for each sample in the Truth Set; anddetermining a phenotypic impact of a target molecular variant using the trained machine learning model.
CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 16/011,753, filed Jun. 19, 2018, which claims priority to U.S. Provisional Patent Application No. 62/521,759, filed on Jun. 19, 2017, now expired, and U.S. Provisional Patent Application No. 62/640,432, filed on Mar. 8, 2018, now expired, all of which are herein incorporated by reference in their entireties.

Provisional Applications (2)
Number Date Country
62640432 Mar 2018 US
62521759 Jun 2017 US
Continuations (1)
Number Date Country
Parent 16011753 Jun 2018 US
Child 18081459 US