The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using deep convolutional neural networks to analyze ordered data.
The following are incorporated by reference for all purposes as if fully set forth herein:
Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018);
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535-548 (2019);
U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEP PATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);
U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIER BASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-2/IP-1612-PRV);
U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISED LEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filed Oct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);
U.S. Patent Application No. 62/582,898, titled “PATHOGENICITY CLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM 1000-4/IP-1618-PRV);
U.S. patent application Ser. No. 16/160,903, titled “DEEP LEARNING-BASED TECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);
U.S. patent application Ser. No. 16/160,986, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-6/IP-1612-US);
U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISED LEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1000-7/IP-1613-US);
U.S. patent application Ser. No. 16/160,978, titled “DEEP LEARNING-BASED SPLICE SITE CLASSIFICATION,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM 1001-4/IP-1680-US);
U.S. patent application Ser. No. 16/407,149, titled “DEEP LEARNING-BASED TECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed May 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US);
U.S. patent application Ser. No. 17/232,056, titled “DEEP CONVOLUTIONAL NEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1037-2/IP-2051-US);
U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEIN VOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM 1047-1/IP-2142-PRV);
U.S. Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATION FOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM 1048-1/IP-2143-PRV);
U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIAL INTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D) STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM 1037-3/IP-2051A-US);
U.S. patent application Ser. No. 17/975,536, titled “MASK PATTERN FOR PROTEIN LANGUAGE MODELS,” filed on Oct. 27, 2022, (Atty. Docket No. IP-2296-US1);
U.S. patent application Ser. No. 17/947,049, titled “DEEP LEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed on Sep. 16, 2022, (Atty. Docket No. IP-2299-US);
U.S. Provisional Patent Application No. 63/253,122, titled “PROTEIN STRUCTURE-BASED PROTEIN LANGUAGE MODELS,” filed Oct. 6, 2021 (Attorney Docket No. ILLM 1050-1/IP-2164-PRV);
U.S. Provisional Patent Application No. 63/281,579, titled “PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USING THREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1060-1/IP-2270-PRV); and
U.S. Provisional Patent Application No. 63/281,592, titled “COMBINED AND TRANSFER LEARNING OF A VARIANT PATHOGENICITY PREDICTOR USING GAPED AND NON-GAPED PROTEIN SAMPLES,” filed Nov. 19, 2021 (Attorney Docket No. ILLM 1061-1/IP-2271-PRV).
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Interpreting the effects of human genetic variants and their impact on disease risk is a foundational component of personalized genomic medicine. Out of more than 70 million possible nonsynonymous variants in the human genome, less than one percent are annotated and the remainder are variants of unknown clinical significance. In particular, approaches for the classification of human genetic variants with rare frequency are of consequence, due to the correlation between variant rarity and variant pathogenicity. However, the analysis of rare variants is intrinsically limited by low frequency in the human population. One strategy for interpreting the clinical significance of human genetic variants involves the use of information from closely related primate species to infer the pathogenicity of orthologous human variants. By conducting population sequencing studies in closely related non-human primate species, some models can catalog common variants and rule these out as pathogenic in human, analogous to how sequencing more diverse human populations has helped to advance clinical variant interpretation.
The overlap of evolutionary biology and genetic medicine includes the detection of species-specific and lineage-specific evolutionary adaptation and selective pressures. As whole genome sequencing has increased in efficiency, biotechnology firms and researchers have sequenced whole genomes (or large percentages of genomes) for thousands of humans and several samples from primate and other vertebrate species. This data can be processed to uncover evidence for the origin of modern human genetic traits and the variation of these traits. In comparison to human populations, for instance, primates exhibit relatively fewer missense mutations-consistent with the majority of newly-arising human missense mutations being removed by natural selection due to their deleteriousness. Despite a common evolutionary lineage, humans and primates (or various other species) comprise genes that are subject to varying degrees of selective constraint. Consequently, the comparative study of evolutionary differences, such as comparative selective constraint, has a countless number of practical applications to augment the characterization of human genetic variants.
Because humans and certain primate species exhibit closely genetics and protein sequences, some pathogenicity models have been used to analyze related human and primate protein sequences to infer the pathogenicity of orthologous human variants. For example, end-to-end deep learning approaches for variant effect predictions are applied to predict the pathogenicity of missense variants from protein sequence and sequence conservation data (See Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as “PrimateAI”). PrimateAI uses deep neural networks trained on variants of known pathogenicity with data augmentation using cross-species information. PrimateAI in particular uses sequences of wild-type and mutant proteins to compare the difference and decide the pathogenicity of mutations using the trained deep neural networks. Such an approach that utilizes the protein sequences for pathogenicity prediction is promising because it can avoid the problem of circularity and overfitting to previous knowledge. However, compared to the adequate number of data to train the deep neural networks effectively, the number of clinical data available in ClinVar is relatively small. To overcome this data scarcity, PrimateAI uses common human variants and variants from primates as benign data while simulated variants based on trinucleotide context were used as unlabeled data.
PrimateAI outperforms prior methods when trained directly upon sequence alignments. PrimateAI learns important protein domains, conserved amino acid positions, and sequence dependencies directly from the training data consisting of about 120,000 human samples. PrimateAI substantially exceeds the performance of other variant pathogenicity prediction tools in differentiating benign and pathogenic de-novo mutations in candidate developmental disorder genes, and in reproducing prior knowledge in ClinVar. These results suggest that PrimateAI is an important step forward for variant classification tools that may lessen the reliance of clinical reporting on prior knowledge.
Despite the utility of PrimateAI in predicting variant pathogenicity, PrimateAI's pathogenicity predictions do not directly exhibit or indicate different selective constraints exerted on genes of human and non-human primate (or other) species. Therefore, an opportunity arises to compare natural selection acting on individual genes across the primate lineage and identify genes with differential selective constraint in humans and non-human primates. More accurate variant pathogenicity prediction may result or be used to better evaluate pathogenicity predictions.
In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The detailed description of various implementations will be better understood when read in conjunction with the appended drawings. To the extent that the figures illustrate diagrams of the functional blocks of the various implementations, the functional blocks are not necessarily indicative of the division between hardware circuitry. Thus, for example, one or more of the functional blocks (e.g., modules, processors, or memories) may be implemented in a single piece of hardware (e.g., a general purpose signal processor or a block of random access memory, hard disk, or the like) or multiple pieces of hardware. Similarly, the programs may be stand-alone programs, may be incorporated as subroutines in an operating system, may be functions in an installed software package, and the like. It should be understood that the various implementations are not limited to the arrangements and instrumentality shown in the drawings.
The processing engines and databases of the figures, designated as modules, can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved. The modules in the figures can also be thought of as flowchart steps in a method. A module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.
The technologies disclosed can be used to identify genes with differential selective constraint between orthologous species. The technology disclosed can be used to augment the study of clinical disease in a target species by comparing the selective forces acting on individual genes in the target species and other closely-related non-target species. Identifying genes with differential selective constraint between species is complicated by species-specific differences in diversity, demographic factors, allelic frequency, and sample size between species cohorts. To address these challenges, the methods and systems disclosed comprise a population genetic-based approach to estimate statistical measures of selective constraint and leverage the estimated statistical measures of selective constraint to determine if purifying selection against nonsynonymous mutations in a target species is stronger or weaker than observed in closely-related non-target species. Furthermore, the methods and systems disclosed comprise strategies to control for outlier genes that have an abnormal distribution of variants because of data quality issues.
As mentioned above, in some embodiments, a PrimateAI model (e.g., PrimateAI 1D, PrimateAI 2D, or PrimateAI 3D) or another variant pathogenicity prediction model comprises neural networks trained on variants of known pathogenicity for learning important protein domains, conserved amino acid positions, and sequence dependencies directly from the training data. In some cases, the training data comprises about 120,000 human samples. To substantiate or validate predictions from PrimateAI or another variant pathogenicity prediction model, the disclosed methods and systems can compare selective constraint across species. For instance, the disclosed systems can compare selective constraint for a target species (e.g., humans) against selective constraint for one or more non-target species (e.g., primates) to detect a difference between the species-specific measures of selective constraint (e.g., differential selective constraint). When the disclosed systems can detect genes with high measures of selective constraint (e.g., constrained genes) in both target (e.g., human) and non-target (e.g., primate) species, in some embodiments, the disclosed systems can validate the pathogenicity predictions of PrimateAI. In some embodiments, the disclosed methods and systems perform or utilize the techniques described in U.S. patent application Ser. No. 17/975,536 and/or U.S. patent application Ser. No. 17/947,049, both of which are incorporated by reference above.
To validate a pathogenicity prediction or a pathogenicity score for a variant, in some embodiments, the disclosed systems determine an indicator of a differential selective constraint for a gene of interest between a target species and one or more non-target species. Additionally, or alternatively, the disclosed systems determine a deviation metric to approximate a magnitude and a direction of differential selective constraint between a target species and one or more non-target species. By determining an indicator of differential selective constraint or a deviation metric, the disclosed systems provide a supplementary metric indicating a confidence or reliability of a pathogenicity prediction or pathogenicity score for a variant amino-acid sequence output by a PrimateAI model or another variant pathogenicity prediction model.
New applications of evolutionary biology and population genetics inform the characterization of genetic mechanisms contributing to human health outcomes. New, rapidly developing methods augment traditional genetic medicine with new contextual data on the evolutionary history of human traits to shed light on the phylogeny (e.g., evolutionary origin and comparative adaptation of traits throughout speciation) and adaptive significance (e.g., mechanism and interaction dynamics that aid in explaining why natural selection has not further depleted pathogenic alleles). The ability of these interdisciplinary studies to produce useful insights on phenomena that have long confounded the scientific community has established evolutionary medicine as a field of study in its own right.
While the increase in available biological sequence data has advanced the accuracy and efficiency of clinical variant interpretation, there are still tens of millions of possible protein-variants in the human genome with unknown clinical significance, many of which are rare variants with the potential to be highly penetrant and deleterious to health. Despite the high penetrance of rare deleterious variants, their rarity limits their ability to explain phenotypic variance to a small fraction of the population, since the majority of individuals do not carry a rare deleterious variant for a given phenotype. Further complicating matters, genetic disorders caused by variants in multiple genes (polygenic genetic disorders) outnumber those caused by variants in a single gene (monogenic genetic disorders). Hence, a polygenic genetic disorder characterized by a particular disease phenotype can often arise through a range of many non-overlapping genotype combinations. Similarly, a countless number of genetic disorders are observed within the human population with severely limited insight as to their genetic and molecular mechanism of pathogenesis. Characterization of some genetic disorders is limited by extensive polygenic complexity, such as major depressive disorder. Other genetic disorders are challenging to characterize due to their rarity in the population, such as ectodermal dysplasia. By contrast, in some cases, outlier genes can be an indication of a monogenic genetic disorder. By determining an indicator of differential selective constraint or a deviation metric, the disclosed systems provide an indicator or metric that facilitates determining recessive inheritance. For instance, the disclosed indicator of differential selective constraint or deviation metric can indicate when a gene of interest is not (or less) subject to selective constraint and, therefore, more likely to exhibit recessive alleles that can be inherited.
Evolutionary perspectives on genetic disease may provide scalable approaches for interpreting the effects of human genetic variants and their impact on disease risk by using information from closely-related primate species to infer the pathogenicity of orthologous human variants. In one potential application of the technology disclosed, human genetic variants can be mapped to a reference genome from a closely-related primate species. The pathogenicity of these human genetic variants can be inferred based on their allele frequencies in other primate populations. This technique can be applied to systematically catalog common variants and rule these out as unlikely to be pathogenic in humans, analogous to how sequencing more diverse human populations has advanced clinical variant interpretation.
However, pathogenicity within a closely-related primate species does not always correlate with pathogenicity in humans due to a plethora of differences between species, including both factors that are quite easy to observe (e.g., differences in environment, demography, and non-overlapping traits inherent to speciation) and more complex genetic factors (e.g., rates of background mutation or surrounding sequence context for a particular allele) that require more complex analysis. Thus, the extent to which evolutionarily-related species can be used to infer knowledge about a proximate species is limited by the extent to which one understands the evolutionary relationship between the cohort of species under analysis. Understanding how forces of natural selection exerted upon species influence the similarity and divergence of orthologous traits between species is a key component for understanding the evolutionary context for these traits. Hence, the comparison of selective constraint on a particular gene between orthologous species provides a rich source of information that is applicable to understanding the clinical significance of that particular gene.
In one or more embodiments, the methods and systems disclosed herein provide improvement or advantages over prior systems. For example, the disclosed methods and systems are more computationally efficient than prior systems. In some cases, to identify differential selection between a target species and a non-target species, some prior systems determine or calculate mutation rates for every gene of a target species and/or a non-target species. For genetically complex species, such as humans and other primates, calculating mutation rates for such large numbers of genes can be computationally expensive, requiring substantial processing power, memory, and computing time. By contrast, the methods and systems disclosed herein use an empirical phase approach to share information across multiple genes and/or species. Indeed, as described below, the disclosed methods and systems can implement a population genetic model that includes two phases: 1) modeling counts of synonymous segregating sites to learn a neutral background distribution of mutation rates per gene per species, and 2) applying the learned neutral background distribution to estimate the average selection per gene across species. By using the empirical phase approach, the disclosed methods and systems are more computationally efficient than prior systems, expending less processing power, memory, and computing time.
As set forth below, in certain implementations, the methods and systems disclosed comprise a technology for identifying differential selection between a target species and closely-related non-target species at gene resolution. In certain cases, the methods and systems disclosed further comprise a model configured to quantify the broad-scale similarity of natural selection between a target species and one or more closely-related non-target species and to identify genes evolving subject to significantly different selective pressure in the target species as compared to the cohort of closely-related non-target species. Moreover, the methods and systems disclosed may be configured to aggregate a plurality of non-target species into a cohort that can be described by a shared single statistical metric or a distribution of statistical metrics in some implementations. In other implementations, the disclosed model can be applied to a plurality of genes at species resolution to share information about mutation rate and data quality across genes. By sharing information about mutation rate and data quality across genes at species resolution, as well as sharing information about selective constraint and data quality across species at gene resolution, in some embodiments, the disclosed methods and systems can control for outlier genes that have an abnormal distribution of variants.
Consequently, the disclosed methods and systems can improve data quality over prior systems that make assumptions on quality of data fed into predictive models, such as PrimateAI. Indeed, because different species have different distributions of mutation rates, and because the disclosed systems and methods account for multiple species, the techniques employed by the disclosed methods and systems address data quality issues for data input into a model, resulting in more accurate predictions from models such as PrimateAI. In some embodiments, the disclosed methods and systems perform or utilize the techniques described in U.S. patent application Ser. No. 17/975,536 and/or U.S. patent application Ser. No. 17/947,049, both of which are incorporated by reference above. Thus, the methods and systems disclosed are applicable to a range of species populations of varying sample size, cohort size, and data quality. Indeed, in some embodiments, the disclosed methods and systems generate and/or apply a different, unique model for each different target species, such as humans, chimpanzees, orangutans, or some other primate (or non-primate) target species. For instance, if the disclosed methods and systems are applies to generate selection coefficients (or to determine selective constraint) for sixty different target species, the methods and systems would generate and apply sixty different population genetic models.
In one implementation of the technology disclosed, differential selective constraint per gene between the target species and non-target species is determined by estimating the population-scaled selection coefficient per gene in the target species and the average population-scaled selection coefficient per gene across a cohort of non-target species, followed by applying a likelihood ratio test to determine if selection against protein-changing mutations is different between the target species and the non-target species. In another implementation of the technology disclosed, differential selective constraint per gene between the target species and non-target species is determined by estimating the relationship between the depletion of protein-changing mutations in the target species and the depletion of protein-changing mutations in the cohort of non-target species, followed by analysis to determine if the depletion of protein-changing mutations in the target species is significantly different from what is expected based on the depletion of protein-changing mutations in the cohort of non-target species. In many implementations, a combination of both described approaches for estimating selective constraint are used for more robust validation.
The following disclosure is organized as follows. First, a brief summary of the terminology used herein is given as a guide to aid in navigating the description of the technology disclosed. Next, a first system that can be used to implement the technology disclosed is introduced, according to one embodiment of the technology disclosed. To further elaborate upon the input and output data types described, a plurality of exemplary genetic sequences and related sample calculations are presented, followed by a description of one approach by which the disclosed technology employs these genetic concepts and methodologies to detect differential selective constraint between species, in accordance with the first system presented.
A second system is then introduced that can also be used to implement the technology disclosed, according to another embodiment of the technology disclosed. Again, a plurality of exemplary genetic sequences and related sample calculations are presented as a briefing prior to describing another approach by which the disclosed technology is used to detect differential selective constraint between species, in accordance with the second system presented.
Finally, a representative sample of performance results obtained by various embodiments of the technology disclosed are detailed as objective indica of inventiveness and non-obviousness.
The description herein refers to a plurality of implementations of the disclosed methods and systems, wherein a target species and a cohort of non-target species are compared to identify differential selective constraint between orthologous genes. Definitions of terminology, terminology to be understood as synonymous, and terminology not to be understood as synonymous will now be presented.
A target species is a species of interest under analysis by the technology disclosed. A variety of objectives may influence the selection of a species as the target species, such as the relevance of the selected species to a particular field of study within medicine, ecology, and so on. In many implementations, the target species is selected to be human due to the significant clinical implication of human genetic variants. However, the target species need not always be human and neither the selected species nor the reasoning for selection of a target species should be considered a limitation of the technology disclosed. Correspondingly, a non-target species is a species of interest under comparative analysis to the target species by the technology disclosed, wherein the target species and a particular non-target species are non-overlapping. In many implementations, the non-target species is selected to be an evolutionarily proximate species relative to the target species. However, a particular species need not be evolutionarily proximate to the target species to be selected as a non-target species and evolutionary distance to the selected target species should not be considered a limitation of the technology disclosed.
Herein, an evolutionarily proximate species may also be referred to synonymously as a closely-related species, orthologous species, or homologous species, wherein homology refers to similarity of the structure, physiology, or development of different species of organisms based on their shared evolutionary ancestry and orthology refers to a particular class of homology wherein homology is retained following speciation. Analogous to the homology of two or more species, genes may also be classified as homologous or orthologous. Herein, the terms “homologous genes”, “orthologous genes”, “homologs”, and “orthologs” may be used synonymously.
The technology disclosed generally relates to the comparison of target species and non-target species. In some implementations, the target species may be compared to a single non-target species. In other implementations, the target species may be compared to an aggregated cohort of two or more non-target species. A cohort of non-target species may be aggregated by a variety of methods (e.g., pooling values, averaging values, fitting a distribution of values, and many other descriptive statistics and functions) without deviating from the scope of the technology disclosed and as such, should not be considered a limitation. Frequently, the terms “cohort of non-target species”, “one or more non-target species”, “plurality of non-target species”, or simply “non-target species” are used interchangeably herein to describe a plurality of embodiments of the disclosed technology. Moreover, many implementations of one or more components of the disclosed technology are explained in the context of a plurality of non-target species for simplicity. Nonetheless, it is to be understood that the technology disclosed may be implemented for any number of non-target species (for example, calculations wherein non-target species data is pooled into a single cohort, a single non-target species forms a cohort of one).
In some embodiments, the technology disclosed further comprises a plurality of statistical models and probability distributions applied to the genetic data of species under analysis. A number of systems, algorithms, models, et cetera may be leveraged to the same end; however, in the interest of conciseness, a limited number of example models are described in detail herein as a user skilled in the art will recognize the ways in which the disclosed models may be altered or augmented in accordance with the technology disclosed. These models include, but are not limited to, population genetics models, regression models, classification models, significance tests, machine learning classifiers, or correction methods.
The statistical models disclosed within various implementations of the technology disclosed process data related to genes and genetic variation within genetic sequences. A particular gene is said to exist in a number of genotypes, or variations of the gene with non-identical sequences (also referred to as a variant). Genotypes may also be referred to synonymously as alleles. Sometimes within scientific literature, a narrow definition of an allele may be used to define a genotype comprises a pair of alleles. Herein, the broader definition of an allele is used such that “genotype”, “allele”, “variant”, and “genetic variant” are all used synonymously.
Variants are differentiated from one another by their respective mutations, or changes to the nucleic acid sequence within that gene or genomic region. For simplicity, this disclosure provides single nucleotide polymorphisms as an example of a type of mutation, but other classes of mutation exist. The terms “single nucleotide polymorphism”, “single nucleotide variant”, “segregating site”, “segregating variant”, “mutation”, or “segregating mutation” may all be used herein and should be understood as equivalent terminology. Mutations that do not change the protein generated by a particular gene are also referred to as synonymous mutations, whereas protein-changing mutations are referred to as nonsynonymous. For a position within a genetic sequence that does not have a known variation within a population, the position may be described as “fixed” or “conserved”, contrasted by “non-conserved” positions within a genetic sequence with known variation.
System 100 comprises a process 101 configured to obtain counts of segregating sites per gene per species and fit a Poisson Random Field (PRF) model 104 to the observed count of segregating sites, wherein the observed count of segregating sites is a Poisson random variable with a mean value determined by a mutation rate, a demography, and a sample size. The system 100 uses the PRF model 104 to learn a neutral background distribution of mutation rates per gene per species. The system 100 further applies the neutral background distribution of mutation rates per gene per species to estimate the average population-scaled selection coefficient per gene across species, wherein the average population-scaled selection coefficient per gene across species for a single species is equal to the estimated population-scaled selection coefficient per gene learned from the PRF model per gene, and wherein the average population-scaled selection coefficient per gene across species for two or more species is equal to the average value of the sum of each estimated population-scaled selection coefficient per gene per species learned from the PRF model per gene per species.
In some embodiments, the system further applies process 101 to a target species I to determine an observed count of segregating sites in gene g and species i from database 102. As part of the process 101, the system 100 also fits the PRF model 104 to the count of segregating sites in gene g and species i. In addition, the system 100 applies or employs the PRF model 104 to learn the background distribution of mutation rates for gene g and species i 106. Based on the background distribution of mutation rates for gene g and species i 106, the system 100 further estimates an average selection coefficient for gene g and species i 108.
Additionally, in some cases, the system 100 applies process 101 to a plurality of non-target species {s1 . . . sn} to determine an observed count of segregating sites in gene g and non-target species s1 from database 122. The system 100 fits the obtained observed counts of segregating sites in gene g and species s1 to PRF model 123 to the count of segregating sites in gene g and non-target species s1. The system 100 further applies or employs the PRF model 123 to learn the background distribution of mutation rates for gene g and non-target species s1124. Based on the background distribution of mutation rates for gene g and non-target species s1124, the system 100 further estimates a selection coefficient for gene g and non-target species s1125. In a similar fashion, the system 100 applies the process 101 to non-target species s2 (wherein process 101 is followed as previously described, correspondingly to database 132, PRF model 133, background distribution 134, and selection coefficient 145) through non-target species sn (wherein process 101 is followed as previously described, correspondingly to database 142, PRF model 143, background distribution 144, and selection coefficient 145).
Next, the system 100 determines or computes the average selection coefficient for gene g across non-target species {s1 . . . sn} 148. For instance, the system 100 determines an average of the sum of each estimated selection coefficient for gene g per species (e.g., coefficients 125 through 145).
In some embodiments, the system 100 tests a pair of average population-scaled selection coefficients per gene across species, such as the selection coefficient for gene g in target species i 108 and the average selection coefficient for gene g across non-target species {s1 . . . sn} 148, for significance via significance test 128. Significance test 128 may be a likelihood-ratio test, a chi-squared test, a t-test, and so on. A user skilled in the art will recognize that the system 100 may implement a plurality of significance tests (with the appropriate data standardizations to meet necessary statistical assumptions) as significance test 128.
In applying the significance test 128, the system 100 can utilize a null model (or null hypothesis) such that the selection coefficient for gene g in target species i 108 and the average selection coefficient for gene g across non-target species {s1 . . . sn} 148 are not significantly different. Conversely, in some embodiments, the system 100 can utilize an alternative model (or alternative hypothesis) when applying the significance test 128 such that the selection coefficient for gene g in target species i 108 and the average selection coefficient for gene g across non-target species {s1 . . . sn} 148 are significantly different. For the significance test 128, significance is defined by a pre-determined alpha level. If the system 100 determines a difference between the selection coefficient for gene g in target species i 108 and the average selection coefficient for gene g across non-target species {s1 . . . sn} 148 to be significantly different (e.g., to satisfy a threshold alpha level), the system 100 can thus determine that there is differential selective constraint for gene g between target species i and non-target species {s1 . . . sn} 129. If the system 100 determines a difference between the selection coefficient for gene g in target species i 108 and the average selection coefficient for gene g across non-target species {s1 . . . sn} 148 to not be significantly different (e.g., to not satisfy a threshold alpha level), the system 100 can thus determine that there is no differential selective constraint for gene g between target species i and non-target species {s1 . . . sn} 139.
Further continuing with the description of system 100, the components of the system 100 illustrated in
While system 100 is described herein with reference to particular blocks, it is to be understood that the blocks are defined for convenience of description and are not intended to require a particular physical arrangement of component parts. Further, the blocks need not correspond to physically distinct components. To the extent that physical distinct components are used, connections between components can be wired and/or wireless as desired. The different elements or components can be combined into single software modules and multiple software modules can run on the same hardware.
Prior to the discussion of the series of computations necessary to obtain an estimate of the neutral background distribution of mutation rates from a count of segregating sites and consequently obtain a selection coefficient for a particular gene, this disclosure describes the structure and implications of genetic variants, segregating sites, mutation, and selection.
Genetic Variation
The disclosed methods and systems of predicting differential selective constraint on a gene-by-gene basis involves data associated with sequenced genomic data from individuals belonging to a particular species. The discussion now turns to details of genomic data, their associated features, and the relationship between genotype and phenotype for a particular individual. The following description of
The system 100 can digitally transcribe and translate reference genetic sequence B 310 into protein B 314, the wild type protein composition for the particular gene comprising reference genetic sequence B 310. In some cases, individuals possessing wild type protein B 314 will present with phenotype B 316, a healthy phenotype. The system 100 further determines that variant B.1320 results in mutant protein B.1324, where mutant protein B.1324 comprises a missense mutation resulting in a differing protein structure and function from the wild type protein B 314. In some cases, individuals possessing missense protein B.1324 will present with phenotype B.1326, a disease phenotype. In addition, the system 100 determines that variant B.2330 results in mutant protein B.2334, where mutant protein B.2334 comprises a synonymous mutation resulting in no change to protein structure and function as compared to the wild type protein B 314 (i.e., no amino acid changes occurred in response to the single nucleotide variant in position five of variant B.2330). In some cases, individuals possessing synonymous protein B.2334 will present with phenotype B.2336, a healthy phenotype similar to phenotype B 316. Further, the system 100 determines that variant B.3340 results in mutant protein B.3344, where mutant protein B.3344 comprises a nonsense mutation resulting in a truncated, nonfunctional protein structure and function as compared to wild type protein B 314. In some cases, individuals possessing nonsense protein B.3344 will present with phenotype B.3346, likely resulting in a nonviable embryo or significantly reduced lifespan.
A person skilled in the art will recognize that variants B.1320, B.2330, and B.3340 are listed as simplified examples and potential phenotypic representations span a wide spectrum rather than a limited number of discrete representations. Moreover, a person skilled in the art will also recognize that many phenotypic responses occur due a plurality of variants within a single gene, or a combined polygenic variant effect. Many implementations of the technology disclosed specifically address polygenic risk scoring for a particular phenotypic response associated with severe genetic disorders known to cause substantial detriment to an individual's quality of life and/or life expectancy.
Additionally, the system 100 determines that variant C.3380 also comprises a single nucleotide polymorphism in position b5 and results in synonymous protein C.3382. Although Variant C.3380 has a single nucleotide polymorphism in the same position as Variant C.2370, the system 100 classifies them as separate variants because the substituted nucleic acid base is not the same. However, because both protein C.3382 and protein C.2372 are both synonymous to wild-type, the system 100 also determines that b5 is in a “wobble” position within a codon that is translated into an amino acid within protein C 352. In some cases, the wobble effect refers to the redundancy within the genetic code that allows for a certain degree of variation within the third position of a codon that will not affect the amino acid that will be translated from the codon.
As further illustrated in
Continuing the discussion of
As further illustrated in
As illustrated in
The system 100 can further make determinations (or inferences) about the health impact 404 of mutation 401 in genomic region D 400. For example, the system 100 determines, detects, or identifies a lack of tolerance for mutation 401 from wild type variant D.1410 to nonsynonymous variant D.2414. As a result of the lack of tolerance, the system 100 can determine that genomic region D 400 is implicated in a particular disease or health outcome. In some embodiments, the system 100 can determine or estimate that nonsynonymous variant D.2414 is a pathogenic variant for genomic region D 400.
While the discussion of genetic variation in the context of modern medicine often focuses on the deleterious effect of pathogenic variants versus benign variants, beneficial variants also exist in the broader context of evolutionary processes.
As further illustrated in
The system 100 can further make inferences about the health impact 424 of mutation 421 in genomic region E 420. The advantage gained via mutation 421 from wild type variant E.1430 to nonsynonymous variant E.2434 suggests that if genomic region E 420 is implicated in a particular disease or health outcome, the system 100 can determine or estimate that nonsynonymous variant E.2434 is a protective, beneficial variant for genomic region E 420.
In some embodiments, the system 100 can further make inferences about the health impact 444 of mutation 441 in genomic region F 440. In some cases, the lack of phenotype change resulting from mutation 441 from wild type variant F.1450 to nonsynonymous variant F.2454 makes it difficult to determine or estimate if genomic region F 440 is implicated in a particular disease or health outcome without further information. In the event that genomic region F 440 is shown to be implicated in a particular disease or health outcome, the system 100 can determine or estimate that nonsynonymous variant F.2454 is a benign variant for genomic region F 440.
As further illustrated in
Existing systems are similarly limited in their ability to make inferences about the health impact 464 of mutation 461 in genomic region G 460. In the event that genomic region G 460 is shown to be implicated in a particular disease or health outcome, the system 100 can estimate that synonymous variant G.2474 is likely to be a benign variant for genomic region G 460. However, newly-developing research is changing the paradigm for the impact of synonymous variation and suggests that certain genomic diseases, such as cystic fibrosis, may be associated with splicing dysfunction caused by variants that otherwise do not change the amino acid sequence of a resulting protein.
Concepts within evolutionary genetics, such as fitness and selective pressure, can also be quantified, and this quantification is a substantial component of the analysis performed within evolutionary medicine technologies.
Relative Fitness and Selection Coefficient as Measures of Selective Constraint
As further shown in
The system 100 can further apply the relative fitness of each variant to determine the selection coefficient 562 of each variant, where the selection coefficient 562 is equal to 1 minus the relative fitness of a particular variant. In some cases, the selection coefficient of a particular variant for a particular gene is a measure of differences in relative fitness 582. Thus, given the relative fitness values, the system 100 can determine or compute that the selection coefficient of variant V1504 is equal to 0, the selection coefficient of variant V2506 is equal to 0.13, the selection coefficient of variant V3508 is equal to 0.88, and the selection coefficient of variant V4510 is equal to −0.05. Because the system 100 set the wild type allele 502 as the point of reference, the system 100 determines that the selection coefficient is 0, thus identifying that there is no selective pressure acting on wild type allele 502. Similarly, because the selection coefficient of variant V1504 is also equal to 0, the system 100 can determine or infer that no selective pressure is acting on variant V1504.
As shown, positive selective pressure is acting on both variant V2506 and variant V3508, but the system 100 determines, based on the selection coefficients, that the negative selective pressure acting on variant V3508 is stronger and will deplete the frequency of variant V3508 within the genetic population faster than that of variant V2506. In contrast, the system 100 determines, based on the selection coefficient of variant V4510, that a weak positive selective pressure will gradually amplify the frequency of variant V4510 within the genetic population.
To determine the differential selective constraint acting on a gene between homologous species, the system 100 can determine or approximate a single selection coefficient value for all variants within the particular gene per species. The system 100 can further compare the selection coefficients to identify the extent of differential selective constraint between species.
The above examples in
Some implementations of the system 100 comprise an explicit population genetic model configured to further perform operations as part of a method that includes two phases: First, modelling the counts of synonymous segregating sites to learn a neutral background distribution of mutation rates per gene per species. Second, applying that neutral background distribution to estimate the average selection per gene across species.
Population Genetic Model to Estimate Mutation Rates Per Gene Per Species
As shown in
As shown, the system 100 determines or generates a best fitting demographic model where Xiygk is the number of mutations of type k (k=0 is synonymous, k=1 is missense) in gene g of species i, and where θig=4Niμg is the per site population scaled mutation rate, and further where Lgk is the number of sites of type k in gene g. Next, the system 100 can determine or compute pi(γig). Relating to pi(γig), in some cases, the system 100 determines θigpi(γ) as approximately the probability that a site in gene g with population scaled selection coefficient γig=2Nisg is segregating in a sample from species i.
Accordingly, the system 100 determines the distribution of Xijk to be Poisson with mean θigLgkpi(γig), such that the Poisson distribution of the observed number of segregating synonymous variants per gene per species is as shown in distribution 702.
In certain embodiments, due to a combination of true variation in mutation rate and data quality across the genome, the system 100 determines that different genes have a different effective per base-pair mutation rate. The system 100 accordingly implements or utilizes strategies to generate or create a robust estimate with few parameters per species. To accommodate this, the system 100 adopts a Gamma distributed prior on θig, and applies the Gamma distributed prior to synonymous sites (i.e., k=0, γig=0) and further integrates over the Gamma distributed prior to result in a scaled negative-binomial distribution 722.
As further illustrated in
Modeling Selection Across Species
As further shown in
Estimation of Average Selection Coefficients Per Gene
As just mentioned, and as illustrated in
In some embodiments, the system 100 determines the difference in estimated γig=2Nisig between species to be due to the Ni being different. In these or other embodiments, the system 100 determines that sig≡sg is identical across species for gene g. For each gene, the system 100 performs step 728 to estimate γig across species to obtain
As further illustrated in
Identifying Differential Selective Constraint Between Target and Non-Target Species Leveraging Selection Coefficients
In some embodiments, the system 100 utilizes the alternative model illustrated in step 812 to determine γgh≠
As further illustrated in
Computer System for Implementation of an Explicit Population Genetic Model of Selection
In one implementation, the system 100 is communicably linked to the storage subsystem 910 and the user interface input devices 920 (and/or other components of the computer system 900).
User interface input devices 920 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 900.
User interface output devices 928 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 900 to the user or to another machine or computer system.
As further shown in
Deep learning processors 930 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning Processors 930 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™ Examples of deep learning processors 930 include Google's Tensor Processing Unit (TPU)™ rackmount solutions like GX4 Rackmount Series™, GX9 Rackmount Series™, NVIDIA DGX-1™ Microsoft′ Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon Processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamicIQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.
As further illustrated in
Additionally, the bus subsystem 922 provides a mechanism for letting the various components and subsystems of computer system 900 communicate with each other as intended. Although bus subsystem 922 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 900 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 900 depicted in
System Overview: Poisson Generalized Linear Mixed Model of Selection
As illustrated in
As further illustrated in
In some embodiments, the system 1000 tests the deviation metric 1008 for significance via significance test 1028. Significance test 1028 may be a Z-test, likelihood-ratio test, a t-test, or some other test, as indicated above in relation to
In applying the significance test 1028, the system 1000 can utilize a null model such that the deviation metric 1008 for target species i and the cohort of non-target species {s1 . . . sn} is equal to zero. Conversely, when applying the significance test 1028, in some embodiments, the system 1000 can utilize an alternative model such that the deviation metric 1008 for target species i and the cohort of non-target species {s1 . . . sn} is not equal to zero. As mentioned above, significance is defined by a pre-determined alpha level. If the system 1000 determines that the deviation metric 1008 for target species i and the cohort of non-target species {s1 . . . sn} is not equal to zero, the system 1000 can detect or determine differential selective constraint for gene g between target species i and non-target species {s1 . . . sn} 1048. However, if the system 1000 determines that deviation metric 1008 for target species i and the cohort of non-target species {s1 . . . sn} is not equal to zero, the system 1000 may detect or determine no differential selective constraint for gene g between target species i and non-target species {s1 . . . sn} 1049.
Further continuing with the description of system 1000, the components of the system 1000 illustrated in
While system 1000 is described herein with reference to particular blocks, it is to be understood that the blocks are defined for convenience of description and are not intended to require a particular physical arrangement of component parts. Further, the blocks need not correspond to physically distinct components. To the extent that physical distinct components are used, connections between components can be wired and/or wireless as desired. The different elements or components can be combined into single software modules and multiple software modules can run on the same hardware.
Missense-to-Synonymous Ratio as a Measure of Selective Constraint
The following description of
As further illustrated in
Additionally, genomic region J 1140 possesses a length of N wherein each sequence position bn comprises a nucleotide base of either thymine, adenine, guanine, or cytosine. The system 1000 can receive data indicating or determine the allelic count of segregating sites 1141 within genomic region J 1140 to include 8 missense variants and 8 synonymous variants. Thus, system 1000 can also determine that the MSR 1142 of genomic region J 1140 is equal 1. Using the MSR 1142 as the measure of selection, the system 1000 can determine or infer 1143 that genomic region J 1140 is either under neutral selection (i.e., not affected by selective pressure or constraint), or that genomic region J 1140 experiences a combination of positive and negative selective pressures at distinct sites within the region that have a balancing effect on one another.
As in above examples in
Some implementations of the system 1000 comprise an explicit population genetic model configured to fit a curve approximating the relationship between human and primate MSR using a Poisson generalized linear mixed model and further configured to identify genes where the observed human MSR deviated significantly from what would have been expected given the gene's MSR in primates. These implementations are further configured to adjust for gene length to account for shorter genes having more variability in their MSR than longer genes.
Poisson Generalized Linear Mixed Model to Estimate Depletion of Missense Variation Per Gene
As further illustrated in
As shown, the system 1000 can determine and apply the estimates of ∈g to build model 1242 for the human data Xgk(H) ≡Xhgk. In some embodiments, the terms or parameters of the model 1242, β1(H), β2(H), δg(H), and Lgk have the same interpretation as described above for primate model 1222. However, the system 1000 applies the parameters of the model 1242 to human data. As shown, the system 1000 determines, detects, or identifies additional fixed effects β2, β3, and β4 in 1242, where the additional fixed effects model a nonlinear relationship between the missense depletion in primates and the missense depletion in humans. Thus, the system 1000 can utilize the remaining random effect, ηg, as the deviation of the observed depletion of missense variants in humans compared to what would be expected based on the primate depletion. In particular, the system 1000 determines that ηg<0 indicates that a gene has even fewer missense variants than would be expected based on primates. Consequently, the system 1000 thus determines that there is stronger constraint in humans than in primates. Conversely, the system 1000 can determine that ηg>0 indicates an excess of missense variants compared to the expectation based on primates, and the system 1000 thus determines that there is relaxed constraint in humans compared to primates.
Identifying Differential Selective Constraint Between Target and Non-Target Species Leveraging Depletion of Missense Variation
As illustrated in
Computer System for Implementation of a Poisson Generalized Linear Mixed Model of Selection
In one implementation, the system 1000 is communicably linked to the storage subsystem 1410 and the user interface input devices 1420 (and/or other components of the computer system 1400).
User interface input devices 1420 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 1400.
User interface output devices 1428 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computer system 1400 to the user or to another machine or computer system.
As further shown in
Deep learning processors 1430 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs). Deep learning Processors 1430 can be hosted by a deep learning cloud platform such as Google Cloud Platform™, Xilinx™, and Cirrascale™ Examples of deep learning processors 1430 include Google's Tensor Processing Unit (TPU)™ rackmount solutions like GX4 Rackmount Series™, GX14 Rackmount Series™, NVIDIA DGX-1™ Microsoft′ Stratix V FPGA™, Graphcore's Intelligent Processor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragon Processors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSON TX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM's DynamicIQ™, TBM TrueNorth™, Lambda GPU Server with Testa V100s™, and others.
As further illustrated in
Additionally, the bus subsystem 1422 provides a mechanism for letting the various components and subsystems of computer system 1400 communicate with each other as intended. Although bus subsystem 1422 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
Computer system 1400 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 1400 depicted in
Performance Measure Results as Objective Indicia of Non-Obviousness and Inventiveness
The discussion thus far has covered a plurality of implementations of the technology disclosed for identifying genes with differential selective constraint between a target species and a plurality of non-target species. The discussion now turns to performance results of various implementations of the technology disclosed.
The technology disclosed, in particularly, the clauses disclosed in this section, can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
One or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. Yet further, in another aspect, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
The clauses described in this section can be combined as features. In the interest of conciseness, the combinations of features are not individually enumerated and are not repeated with each base set of features. The reader will understand how features identified in the clauses described in this section can readily be combined with sets of base features identified as implementations in other sections of this application. These clauses are not meant to be mutually exclusive, exhaustive, or restrictive; and the technology disclosed is not limited to these clauses but rather encompasses all possible combinations, modifications, and variations within the scope of the claimed technology and its equivalents.
Other implementations of the clauses described in this section can include a non-transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section. Yet another implementation of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.
We disclose the following clauses:
1. A computer-implemented method of detecting differential selective constraint for a gene of interest between a target species and one or more non-target species, the method comprising:
This application claims the benefit of and priority to the following: U.S. Provisional Patent Application No. 63/294,813, titled “PERIODIC MASK PATTERN FOR REVELATION LANGUAGE MODELS,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1063-1/IP-2296-PRV); U.S. Provisional Patent Application No. 63/294,816, titled “CLASSIFYING MILLIONS OF VARIANTS OF UNCERTAIN SIGNIFICANCE USING PRIMATE SEQUENCING AND DEEP LEARNING,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1064-1/IP-2297-PRV); U.S. Provisional Patent Application No. 63/294,820, titled “IDENTIFYING GENES WITH DIFFERENTIAL SELECTIVE CONSTRAINT BETWEEN HUMANS AND NON-HUMAN PRIMATES,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1065-1/IP-2298-PRV); U.S. Provisional Patent Application No. 63/294,827, titled “DEEP LEARNING NETWORK FOR EVOLUTIONARY CONSERVATION,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1066-1/IP-2299-PRV); U.S. Provisional Patent Application No. 63/294,828, titled “INTER-MODEL PREDICTION SCORE RECALIBRATION,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1067-1/IP-2301-PRV); and U.S. Provisional Patent Application No. 63/294,830, titled “SPECIES-DIFFERENTIABLE EVOLUTIONARY PROFILES,” filed Dec. 29, 2021 (Attorney Docket No. ILLM 1068-1/IP-2302-PRV). U.S. Provisional Patent Application No. 63/294,710, titled “APPROPRIATE REFERENCE GENOMES FOR TARGET SPECIES,” filed Dec. 29, 2021 (Attorney Docket No. IP-2300-PRV). The priority applications are incorporated by reference as if fully set forth herein.
Number | Date | Country | |
---|---|---|---|
63294820 | Dec 2021 | US |