The human genome is inherited in large blocks from parental genomes, generated through a DNA-sequence-dependent shuffling process called recombination. The non-random nature of recombination breakpoints producing these genomic blocks results in correlative genotype relationships across genetic variants, known as linkage disequilibrium. Thus, genotypes for a small subset (1%-10%) of observed common genetic variants can be used to infer the genotype status of unobserved but known genetic variation sites across the genome (on the order of ˜1M of >10M sites)1,2. This process, called genotype imputation, allows for the generation of nearly the full complement of known common genetic variation at a fraction of the cost of direct genotyping or sequencing. Given the massive scale of genotyping required for genome-wide association studies or implementation of genetically-informed population health initiatives, genotype imputation is an essential approach in population genetics.
Standard approaches to genotype imputation utilize Hidden Markov Models (HMM)3-5 distributed alongside large whole genome sequencing (WGS)-based reference panels6. In general terms, these imputation algorithms use genetic variants shared between to-be-imputed genomes and the reference panel and apply HMM to impute the missing genotypes per sample7. Genotyped variants are the observed states of the HMM, whereas the to-be-imputed genetic variants present in the reference panel are the hidden states. The HMM algorithm depends on recombination rates, mutation rates, and/or genotype error rates that must be fit by Markov Chain Monte Carlo Algorithm (MCMC) or an expectation-maximization algorithm. Thus, HMM-based imputation is a computationally intensive process, requiring access to both high-performance computing environments and large, privacy-sensitive, WGS reference panels8. Often, investigators outside of large consortia will resort to submitting genotype data to imputation servers3, resulting in privacy and scalability concerns9.
Embodiments of the invention relate to a system for dynamically producing predictive data using varying data. In some embodiments, the system can be structured to generate imputed genomic sequence by inputting actual genomic sequence into an artificial neural network. In some embodiments, the system can include one or more of: an artificial neural network; a communication device; and/or a processing device communicably coupled to the communication device.
The processing device can be configured to: (i) receive actual genetic information; (ii) generate an input vector, wherein the input vector can include probabilistic or binary data converted from allelic data; (iii) access the neural network including a dynamic reservoir containing units, wherein each of the units can be connected to at least one other unit in the dynamic reservoir and the connections between the units are weighted; (iv) input the input vector into the neural network in order to generate an imputed sequence; (vi) generate, via the neural network, an imputed sequence, wherein the imputed sequence is an output of the neural network and at least partially based on the input vector; (vii) modify the initial prediction to generate a final prediction; (viii) present the final prediction to a user; and/or (ix) export the final prediction to a computer system.
In some embodiments, the variables can include data relating to at least one genetic variant.
In some embodiments, the genetic variant can be at least one of: a single-nucleotide variant, a multi-nucleotide variant, an insertion, a deletion, a structural variation, an inversion, a copy-number change, or any other genetic variation relative to a reference genome, or the like.
In some embodiments, the units can include nodes within at least three layers.
In some embodiments, the dynamic reservoir can include a plurality of layers.
In some embodiments, the at least three layers can include an input layer, a hidden layer, and an output layer.
In some embodiments, the system can include a plurality of dynamic reservoirs, wherein each reservoir can be adapted to impute sequence of a selected portion of genetic data, and wherein the plurality of reservoirs can be adapted to cover all desired portions of a genome or fragment thereof.
In some embodiments, the initial prediction can be provided in binary form, and the final prediction can be provided as genetic sequence.
Some embodiments of the invention relate to a method of obtaining, from an input of incomplete genomic information from an individual or population of an organism, an output of more complete genomic information for the individual or population within a desired accuracy cutoff The method can include one or more of the steps of: (a) providing a genetic inference model comprising encoded complex genotype relationships, the model having been encoded by a system described herein, (b) inputting incomplete genomic information from the individual or population into the model in or mediated by the system wherein the incomplete genetic information comprises at least a sparse genotyping or sequencing, as defined and described herein, of a randomly sampled genome of the individual or population; (c) applying the model to the information by operation of the system; and/or (d) obtaining the output of more complete genomic information for the individual or population, wherein the more complete genomic information can include genotypes for genetic variants observed in a reference population used to define the weights of the neural network.
In some embodiments, the accuracy cutoff can be an accuracy level as in any of the ranges depicted in any of the figures, or any other useful accuracy. For example, the accuracy level can be 99.9% or more, 99.5%, 99%, 98%, 97%, 96%, 95%, 92%, 90%, 85%, 80%, 75%, 70%, 65%, or 60%.
In some embodiments, the organism can be selected from an animal, a plant, a fungus, a chromistan, a protozoan, a bacterium, an aracheon, and the like.
Some embodiments of the invention relate to a method of training the system described herein. The method can includes one or more of the steps of: (a) providing training genetic sequence data; (b) generating a first training input vector, wherein the first training input vector can include binary or probabilistic data converted from allelic data of the training genetic sequence data; (c) generating a second training input vector from the first training input vector, wherein the second input vector can include reducing a number of variables of the first training input vector; (d) inputting the second training input vector to a neural network comprising a dynamic reservoir containing units, wherein each of the units can be connected to at least one other unit in the dynamic reservoir and wherein the connections between the units can be initially weighted according to a training weighting; (e) generating, via the neural network, imputed training sequence data, wherein the imputed training sequence data are an output of the neural network and at least partially based on the second input vector; (f) comparing the imputed training sequence data from step (e) to the original training genetic sequence data from step (a) to obtain a training result reflecting a degree of correspondence between the imputed training sequence data and the original training genetic sequence data; (g) modifying the at least one of the generating of step (c) and the weighting of step (d), based upon the training result of step (f); (h) repeating steps (c) through (g) until the correspondence reaches a predetermined value; and thereby (i) completing training of the system.
In some embodiments, the training genetic sequence can include at least one full genomic sequence of an organism.
In some embodiments, the training genetic sequence can include a plurality of full genomic sequences from a selected population of an organism.
Embodiments of the invention relate to the use of artificial neural networks, such as autoencoders, in population genomics and individual genome processing as a way to fill-in missing data from genomic assays with significant sparsity, like low-pass whole genome sequencing or array-based genotyping10-12. Autoencoders are neural networks tasked with the problem of simply reconstructing the original input data, with constraints applied to the network architecture or transformations applied to the input data in order to achieve a desired goal like dimensionality reduction or compression, and de-noising or de-masking13-15, stochastic noise or masking is used to modify or remove data inputs, training the autoencoder to reconstruct the original uncorrupted data from corrupted inputs16. These autoencoder characteristics are well-suited for genotype imputation and may address some of the limitations of HMM-based imputation by eliminating the need for dissemination of reference panels and allowing the capture of non-linear relationships in genomic regions with complex linkage disequilibrium structures. Some attempts at genotype imputation using neural networks have been previously reported, though for specific genomic contexts17 at genotype masking levels (5%-20%) not applicable in typical real-world population genetics scenarios18-21, or in contexts where the neural network must be re-trained on different input variant sets20.
The instant invention relates to an autoencoder-based neural network approach for the simultaneous execution of genetic imputation that can include feature extraction and dimensionality reduction for downstream tasks. An autoencoder is a neural network whose target output (x) is the same as input (x) (
To accomplish this task, autoencoders can be tiled across the genome. The boundaries of the tiling are determined by a combination of video memory limitations and minima in the linkage disequilibrium patterns of known genetic variants—i.e., minima in the correlation structure of genetic variants associated with regions of high recombination.
All genotypes of each genetic variant can be converted to binary values representing the presence (1) or absence (0) of the reference allele A and alternative allele B, respectively:
Where G can represent the original genotypes and x can represent the encoded genotype for variant i that is used as input for the autoencoder. The output of the autoencoder can follow the same format: the activation function returns a continuous value for each allele that can be rescaled to these values, for example the tanh function returns a continuous value for each allele that can range between-2 and 2, which is then rescaled to 0-1 range, where 1 represents allele presence and 0 represents allele absence. The scaled outputs can also be regarded as probabilities and can be combined for the calculation of alternative allele dosage, and genotype probabilities. This binary presentation extends one-hot-encoding from the genotype level to the allele level to take into account interdependencies among classes. For example, [A,A] class shares one allele with [A,B], so the binary representation makes these classes share one common value ([1,0] and [1,1], respectively), whereas one-hot-encoding at the genotype-level would mistakenly regard these values as totally independent classes, and causing unnecessary additional allocation of memory ([1,0,0] as [A,A], [0,1,0] as [A,B] or [B,A], and [0,0,1] as [B,B]). Missing genotypes can be represented as absence or 0 probability for all alleles ([0,0]), similar to the one-hot encoding representation of missing values (e.g. [0,0,0]).
The autoencoder can be trained using randomly masked whole-genome sequence data from the Haplotype Reference Consortium or any other large collection of genetic data. This autoencoder can take raw genotype data as input and performs imputation as part of its autoencoding function. Random masking of input genetic data can be performed at a low masking rate, which is incremented in subsequent training rounds until a maximum masking rate is achieved. The training performance and autoencoder complexity can determine the maximum masking rate, which can reach up to Ni-5 where N is the total number of input variants of autoencoder i. In order for the autoencoder to provide interesting latent representations that are useful for feature extraction and dimensionality reduction, constraints can be placed on the autoencoder. In one embodiment, sparsity (both node activation sparsity and edge weight regularization) and de-masking constraints can be used to find non-trivial latent genetic factors (h). Many activation functions can be used. In some embodiments, they system uses the commonly used tanh function though reasonable accuracy can be achieved with leaky rectified linear units and sigmoid activation functions. Many loss functions can be used. In some embodiments, the system can use a modified version of cross entropy as the loss function, which can be customized for improving performance when training on highly sparse data with extreme class unbalance, a typical characteristic of genetic variants. Dimensionality reduction can be achieved by using a hidden layer that has a smaller number of nodes than the input/output layers, or by training an autoencoder with the same or larger number of hidden nodes than input/output layers with forced sparsity and trimming of dead nodes that are not used or activated infrequently.
The loss function is a measure of the dissimilarity between the predicted (x) and ground truth (x) genotype values. In other words, it is a measure of how accurately the hidden representation (h) can be decoded to reconstruct the original genotype data. The goal when training an autoencoder is to minimize the value of the loss function. In some embodiments, the system uses Sparse Focal Loss (SFL):
Where the class imbalance weighting factor at is represented as the inverse form of class frequency (F) of a genetic variant j:
Where f( ) can represents relative frequencies of each allele across genotypes observed for genetic variant j. CE can represent the classic cross entropy loss or log loss of xt, which is the predicted class probability:
The focal loss element (1−xt)γ amplifies the effect of the weights as the class probability values become more imbalanced (
Both the de-noising and sparse autoencoder subtypes can have useful features for the generation of latent genetic factors. De-noising autoencoders add noise to the input (x*), but calculate the loss by comparing the output ({circumflex over (x)}) to the original input (x). In some embodiments, de-masking (setting genotype values to missing rather than random values) can be used at an escalating corruption rate (c) to force the autoencoder to learn the underlying linkage disequilibrium structure, perform imputation, and create accurate reconstructions from noisy genotype data. De-noising can also be useful for the correction of genotyping errors, much in the same way imputation could be used to detect potential variant calling errors. Sparse autoencoders can accomplish the goal of dimensionality reduction by explicitly forcing sparsity in the hidden representation (h) through the addition of a sparsity penalty(S) to the loss function:
where S can be the sparsity loss between the sparsity parameter ({circumflex over (ρ)}) that represents the target probability that a hidden node in the hidden layer will get activated, and {circumflex over (ρ)} that represents the observed mean activation over a training batch i. n represents the total number of batches. In addition, a regularization term (L1 and/or L2) can be introduced to both autoencoder subtypes to prevent overfitting—which can introduce sparsity to the weight matrix (W):
where L1 and L2 penalties can encourage weights equal to zero in the weight matrix (W) the contribution of which is set by weight decay hyperparameter λ, this calculation can be performed for each hidden node (h=1 through n hidden nodes). L1 can penalize the absolute values of weights, which can reduce weights to zero and therefore compress the model, whereas L2 penalizes their squared form, reducing the weights to values near zero. The final Sparse Focal Loss function, including L1 and L2 regularizers, can be represented as:
Many combinations of hyperparameters can produce high accuracy genetic imputation. In some embodiments, a random grid search approach can be used to evaluate the reconstruction quality of autoencoders using a range of model parameters-corruption parameters (c), regularization parameters (2 for L1 vs L2), and sparsity parameters (ρ and β), and across all model frameworks—the combinations of hidden node activation functions (rectified linear unit, tanh, sigmoid), by loss functions (squared-error, or multinomial logistic loss). 10-fold cross validation can be used to combat overfitting. Parameter tuning can be performed with a greedy genetic search algorithm starting from the top 10 parameter sets from the grid search. Final models for each region can be selected based on best performance on an independent set of 1,000 Genomes individuals with both genotype and WGS data available.
Sample-level reconstruction quality can be quantified using a number of similarity metrics including identity-by-state, concordance, imputation quality score (IQS), r-squared, and F-score. The amount of training required can be guided by these accuracy measures-training continues until these accuracy measures level off per region of the genome.
Embodiments of the invention relate to a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives. The Examples describe the initial training and implementation of autoencoders spanning all of human chromosome 22, achieving equivalent to superior accuracy relative to modern HMM-based methods, and dramatically improving computational efficiency at deployment without the need to distribute reference panels.
The unphased human genotype imputation approach can be extended to phased human genotype imputation by pre-phasing input genotypes using available algorithms and modifying the encoding to reflect allele presence at each phased allele, e.g.:
In some embodiments of the invention, the approach can similarly be extended to different organisms with different levels of ploidy by extending the encoding further.
Further information can be found in the reference: Raquel Dias, Doug Evans, Shang-Fu Chen, Kai-Yu Chen, Salvatore Loguercio, Leslie Chan, Ali Torkamani (2022) Rapid, Reference-Free human genotype imputation with denoising autoencoders eLife 11: e75600, which is hereby incorporated by reference in its entirety herein.
Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized. Each bi-allelic SNP was encoded as two binary input nodes, representing the presence or absence of each allele (
Genotypes for all bi-allelic SNPs were converted to binary values representing the presence (1) or absence (0) of the reference allele A and alternative allele B, respectively, as shown in Equation 1.
Where x is a vector containing the two allele presence input nodes to the autoencoder and their encoded allele presence values derived from the original genotype, G, of variant i. The output nodes of the autoencoder, regardless of activation function, are similarly rescaled to 0-1. The scaled outputs can also be regarded as probabilities and can be combined for the calculation of alternative allele dosage and/or genotype probabilities. This representation maintains the interdependencies among classes, is extensible to other classes of genetic variation, and allows for the use of probabilistic loss functions.
Training Data. Whole-genome sequence data from the Haplotype Reference Consortium (HRC) was used for training and as the reference panel for comparison to HMM-based imputation23. The dataset consists of 27,165 samples and 39,235,157 biallelic SNPs generated using whole-genome sequence data from 20 studies of predominantly European ancestry (HRC Release 1.1): 83.92% European, 2.33% East Asian, 1.63% Native American, 2.17% South Asian, 2.96% African, and 6.99% admixed ancestry individuals. Genetic ancestry was determined using continental population classification from the 1000 Genomes Phase3 v5 (1000G) reference panel and a 95% cutoff using Admixture software24. Genotype imputation autoencoders were trained for all 510,442 unique SNPs observed in HRC on human chromosome 22. For additional comparisons, whole-genome sequence data from 31 studies available through the NHLBI Trans-Omics for Precision Medicine (TOPMed) program were used as an alternative reference panel for HMM-based imputation tools (Taliun et al., 2021). Freeze 8 of TOPMed was downloaded, which is the latest version with all consent groups genotyped across the same set of jointly called variants. GRCh38 TOPMed cohorts were converted to hg19 with Picard 2.25 (‘Picard toolkit’, 2019), and multi allelic SNPs removed with bcftools v.1.10.2 (Danecek et al., 2021). Any variants with missing genotypes were excluded as well, yielding a final reference panel for chr22 consisting of 73,586 samples and 11,089,826 biallelic SNPs. Since the ARIC and MESA cohorts are used for model selection and validation, they were excluded from the TOPMed reference panel. Relatedness analysis using the KING robust kinship estimator revealed significant data leakage boosting HMM-based imputation performance through individuals directly participating in the MESA and other TOPMed cohorts, as well as through numerous first- and second-degree familial relationships spanning MESA individuals and individuals in other TOPMed cohorts.
Validation and Testing Data. A balanced (50%: 50% European and African genetic ancestry) subset of 796 whole genome sequences from the Atherosclerosis Risk in Communities cohort (ARIC)25, was used for model validation and selection. The Wellderly26, Human Genome Diversity Panel (HGDP)27, and Multi-Ethnic Study of Atherosclerosis (MESA)28 cohorts were used for model testing. The Wellderly cohort consisted of 961 whole genomes of predominantly European genetic ancestry. HGDP consisted of 929 individuals across multiple ancestries: 11.84% European, 14.64% East Asian, 6.57% Native American, 10.98% African, and 55.97% admixed. MESA consisted of 5,370 whole genomes across multiple ancestries: 27.62% European, 11.25% East Asian, 4.99% Native American, 5.53% African, and 50.61% admixed.
GRCh38 mapped cohorts (HGDP and MESA) were converted to hg19 using Picard v2.2529. All other datasets were originally mapped and called against hg19. Multi-allelic SNPs, SNPS with >10% missingness, and SNPs not observed in HRC were removed with bcftools v1.10.230. Mock genotype array data was generated from these WGS cohorts by restricting genotypes to those present on commonly used genotyping arrays (Affymetrix 6.0, UKB Axiom, and Omni 1.5M). For chromosome 22, intersection with HRC and this array-like masking respectively resulted in: 9,025, 10,615, and 14,453 out of 306,812 SNPs observed in ARIC; 8,630, 10,325, and 12,969 out of 195,148 SNPs observed in the Wellderly; 10,176, 11,086, and 14,693 out of 341,819 SNPs observed in HGDP; 9,237, 10,428, and 13,677 out of 445,839 SNPs observed in MESA.
Data Augmentation. The inventors employed two strategies for data augmentation-random variant masking and simulating further recombination with offspring formation. During training, random masking of input genotypes was performed at escalating rates, starting with a relatively low masking rate (80% of variants) that is gradually incremented in subsequent training rounds until up to only 5 variants remain unmasked per autoencoder. Masked variants are encoded as the null case in Equation 1. During finetuning the inventors used sim1000G to simulate of offspring formation using the default genetic map and HRC genomes as parents. A total of 30,000 offspring genomes were generated and merged with the original HRC dataset, for a total of 57,165 genomes.
In order to account for the overwhelming abundance of rare variants, the accuracy of allele presence reconstruction was scored using an adapted version of focal loss (FL) [32], shown in Equation 2.
where the classic cross entropy (shown as binary log loss in brackets) of the truth class (xt) predicted probability (pt) is weighted by the class imbalance factor αt and a modulating factor (1−pt)γ. The modulating factor is the standard focal loss factor with hyperparameter, γ, which amplifies the focal loss effect by down-weighting the contributions of well-classified alleles to the overall loss (especially abundant reference alleles for rare variant sites). αt is an additional balancing hyperparameter set to the truth class frequency.
This base focal loss function is further penalized and regularized to encourage simple and sparse models in terms of edge-weight and hidden layer activation complexity. These additional penalties result in the final loss function as shown in Equation 3.
where L1 and L2 are the standard L1 and L2 norms of the autoencoder weight matrix, with their contributions mediated by the hyperparameters λ1 and λ2. S is a sparsity penalty, with its contribution mediated by the hyperparameter β, which penalizes deviation from a target hidden node activation set by the hyperparameter (ρ) vs the observed mean activation {circumflex over (ρ)} over a training batch j summed over total batches n, as shown in Equation 4:
All model training tasks were distributed across a diversified set of NVIDIA graphical processing units (GPUs) with different video memory limits: 5× Titan Vs (12 GB), 8×A100s (40 GB), 60×V100s (32 GB). Given computational complexity and GPU memory limitations, individual autoencoders were designed to span approximately independent genomic segments with boundaries defined by computationally identified recombination hotspots (
These segments were defined using an adaptation of the LDetect algorithm [33]. First, the inventors calculated a n×n matrix of pairwise SNP correlations using all common genetic variation (≥5% minor allele frequency) from HRC. Correlation values were thresholded at 0.45. For each SNP, the inventors calculated a boxsum of all pairwise SNP correlations spanning 500 common SNPs upstream and downstream of the index SNP. This moving boxsum quantifies the overall local LD strength centered at each SNP. Local minima in this moving boxsum were used to split the genome into approximately independent genomic segments of two types—large segments of high LD interlaced with short segments of weak LD corresponding to recombination hotspot regions. Individual autoencoders were designed to span the entirety of a single high LD segment plus its adjacent upstream and downstream weak LD regions. Thus, adjacent autoencoders overlap at their weak LD ends. If an independent genomic segment exceeded the threshold number of SNPs amenable to deep learning given GPU memory limitations, internal local minima within the high LD regions were used to split the genomic segments further to a maximum of 6000 SNPs per autoencoder. Any remaining genomic segments still exceeding 6000 SNPs were further split into 6000 SNP segments with large overlaps of 2500 SNPs given the high degree of informative LD split across these regions. This tiling process resulted in 256 genomic segments: 188 independent LD segments, 32 high LD segments resulting from internal local minima splits, and 36 segments further split due to GPU memory limitations.
The inventors first used a random grid search approach to define initial hyperparameter combinations producing generally accurate genotype imputation results. The hyperparameters and their potential starting values are listed in Table 1. This coarse-grain grid search was performed on all genomic segments of chromosome 22 (256 genomic segments), each tested with 100 randomly selected hyperparameter combinations per genomic segment, with a batch size of 256 samples, training for 500 epochs without any stop criteria, and validating on an independent dataset (ARIC). To evaluate the performance of each hyperparameter combination, the inventors calculated the average coefficient of determination (r-squared) comparing the predicted and observed alternative allele dosages per variant. Concordance and F1-score were also calculated to screen for anomalies but were not ultimately used for model selection.
In order to avoid local optimal solutions and reduce the hyperparameter search space, an ensemble-based machine learning approach was developed (Extreme Gradient Boosting-XGBoost) to predict the expected performance (r-squared) of each hyperparameter combination per genomic segment using the results of the coarse-grid search and predictive features calculated for each genomic segment. These features include the number of variants, average recombination rate and average pairwise Pearson correlation across all SNPs, proportion of rare and common variants across multiple minor allele frequency (MAF) bins, number of principal components necessary to explain at least 90% of variance, and the total variance explained by the first 2 principal components. The observed accuracies of the coarse-grid search, numbering 25,600 training inputs, were used to predict the accuracy of 500,000 new hyperparameter combinations selected from Table 1 without training. All categorical predictors (activation function name, optimizer type, loss function type) were one-hot encoded. The model was implemented using XGBoost package v1.4.1 in Python v3.8.3 with 10-fold cross-validation and default settings.
The inventors then ranked all hyperparameter combinations by their predicted performance and selected the top 10 candidates per genomic segment along with the single best initially tested hyperparameter combination per genomic segments for further consideration. All other hyperparameter combinations were discarded. Genomic segments with sub-optimal performance relative to Minimac were subjected to tuning with simulated offspring formation. For tuning, the maximum number of epochs was increased (35,000) with automatic stop criteria: if there is no improvement in average loss value of the current masking/training cycle versus the previous one, the training is interrupted, otherwise training continues until the maximum epoch limit is reached. Each masking/training cycle consisted of 500 epochs. Final hyperparameter selection was based on performance on the validation dataset (ARIC).
This process results in 256 unique autoencoders spanning the genomic segments of chromosome 22. Each genomic segment consists of a different number of input variables (genetic variants), sparsity, and correlation structure. Thus, 256 unique autoencoder models span the entirety of chromosome 22 (e.g.: each autoencoder has different edge weights, number of layers, loss function, as well as regularization and optimization parameters).
Performance was compared to Minimac434, Beagle55, and Impute54 using default parameters. Population level reconstruction accuracy is quantified by measuring r-squared across multiple strata of data: per genomic segment, at whole chromosome level, and stratified across multiple minor allele frequency bins: [0.001-0.005), [0.005-0.01), [0.01-0.05), [0.05-0.1), [0.1-0.2), [0.2-0.3), [0.3-0.4), [0.4-0.5). While r-squared is the primary comparison metric, sample-level and population-level model performance is also evaluated with concordance and the F1-score. Wilcoxon rank-sum testing was used assess the significance of accuracy differences observed. Spearman correlations were used to evaluate the relationships between genomic segment features and observed imputation accuracy differences. Standard errors for per variant imputation accuracy r-squared is equal or less than 0.001 where not specified. Performance is reported only for the independent test datasets (Wellderly, MESA, and HGDP).
The inventors used the MESA cohort for inference runtime comparisons. Runtime was determined using the average and standard error of three imputation replicates. Two hardware configurations were used for the tests: 1) a low-end environment: 16-core Intel Xeon CPU (E5-2640 v2 2.00 GHz), 250 GB RAM, and one GPU (NVIDIA GTX 1080); 2) a high-end environment: 24-Core AMD CPU (EPYC 7352 2.3 GHZ), 250 GB RAM, using one NVIDIA A100 GPU. The inventors report computation time only, input/output (I/O) reading/writing times are excluded as separately optimized functions.
The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline. The python script for calculating imputation accuracy is available at https://github <dot> com <slash> TorkamaniLab/imputation_accuracy_calculator.
A preliminary comparison of the best performing autoencoder per genomic segment vs HMM-based imputation was made after the initial grid (Minimac4:
In order to understand the relationship between genomic segment features, hyperparameter values, and imputation performance, the inventors calculated predictive features for each genomic segment and determined their Spearman correlation with the differences in r-squared observed for the autoencoder vs Minimac4 (
The inventors then used the genomic features significantly correlated with imputation performance to predict the performance of and select the hyperparameter values to advance to fine-tuning. An ensemble model inference approach was able to predict the genomic segment-specific performance of hyperparameter combinations with high accuracy (
After tuning, autoencoder performance surpassed HMM-based imputation performance across all imputation methods, independent test datasets, and genotyping array marker sets. At a minimum, autoencoders surpassed HMM-based imputation performance in >62% of chromosome 22 genomic segments (two proportion test p=1.02×10−11) (Minimac4:
Tuning improved performance of the autoencoders across all genomic segments, generally improving the superiority of autoencoders relative to HMM-based approaches in genomic segments with complex haplotype structures while equalizing performance relative to HMM-based approaches in genomic segments with more simple LD structures (as described herein, by the number of unique haplotypes:
After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets (Wilcoxon rank-sums test p≤5.55×10−67). The autoencoder's mean r-squared per variant ranged from 0.363 for HGDP to 0.605 for the Wellderly vs 0.340 to 0.557 for Minimac4, 0.326 to 0.549 for Beagle5, and 0.314 to 0.547 for Eagle5, respectively. Detailed comparisons are presented in in Table 6 and Table 7.
Further, when imputation accuracy is stratified by MAF bins, the autoencoders maintain superiority across all MAF bins by nearly all test dataset and genotyping array marker sets (
When the reference panel of the HMM-based tools was upgraded with the more expansive TOPMed cohort, the superior performance of the HRC-trained autoencoder was still sustained across all datasets except for MESA (
Finally, the inventors evaluated ancestry-specific imputation accuracy. As before, overall autoencoder-based imputation maintains superiority across all continental populations present in MESA (
Further stratification of ancestry-specific imputation accuracy results by MAF continues to support autoencoder superiority across all ancestries, MAF bins, and nearly all test datasets, and genotyping array marker sets (
Thus, with training on equivalent reference cohorts, autoencoder performance was superior across all variant allele frequencies and ancestries with the primary source of superiority arising from hard to impute regions with complex LD structures. When the reference panel of the HMM-based tools is upgraded to the more diverse TOPMed dataset, the HRC-trained autoencoder remains superior across all ancestry groups of HGDP (
Inference runtimes for the autoencoder vs HMM-based methods were compared in a low-end and high-end computational environment as described herein. In the low-end environment, the autoencoder's inference time is at least ˜4× faster than all HMM-based inference times (summing all inference times from all genomic segments of chromosome 22, the inference time for the autoencoder was 2.4±1.1*103 seconds versus 1,754±3.2, 583.3±0.01, and 8.4±4.3*10-3 seconds for Minimac4, Beagle5, and Impute5, respectively (
Artificial neural network-based data mining techniques are revolutionizing biomedical informatics and analytics35,36. Here, the inventors have demonstrated the potential for these techniques to execute a fundamental analytical task in population genetics, genotype imputation, producing superior results in a computational efficient and portable framework. The trained autoencoders can be transferred easily, and execute their functions rapidly, even in modest computing environments, obviating the need to transfer private genotype data to external imputation servers or services. Furthermore, the fully trained autoencoders robustly surpass the performance of all modern HMM-based imputation approaches across all tested independent datasets, genotyping array marker sets, minor allele frequency spectra, and diverse ancestry groups. This superiority was most apparent in genomic regions with low LD and/or high complexity in their linkage disequilibrium structure.
Superior imputation accuracy is expected to improve GWAS power, enable more complete coverage in meta-analyses, and improve causal variant identification through fine-mapping. Moreover, superior imputation accuracy in low LD regions may enable the more accurate interrogation of specific classes of genes under a greater degree of selective pressure and involved in environmental sensing. For example, promoter regions of genes associated with inflammatory immune responses, response to pathogens, environmental sensing, and neurophysiological processes (including sensory perception genes) are often located in regions of low LD 36,37. These known disease-associated biological processes that are critical to interrogate accurately in GWAS. Thus, the autoencoder-based imputation approach both improves statistical power and biological coverage of individual GWAS' and downstream meta-analyses.
The various methods and techniques described above provide a number of ways to carry out the application. Of course, it is to be understood that not necessarily all objectives or advantages described are achieved in accordance with any particular embodiment described herein. Thus, for example, those skilled in the art will recognize that the methods can be performed in a manner that achieves or optimizes one advantage or group of advantages as taught herein without necessarily achieving other objectives or advantages as taught or suggested herein. A variety of alternatives are mentioned herein. It is to be understood that some embodiments specifically include one, another, or several features, while others specifically exclude one, another, or several features, while still others mitigate a particular feature by including one, another, or several other features.
Furthermore, the skilled artisan will recognize the applicability of various features from different embodiments. Similarly, the various elements, features and steps discussed above, as well as other known equivalents for each such element, feature or step, can be employed in various combinations by one of ordinary skill in this art to perform methods in accordance with the principles described herein. Among the various elements, features, and steps some will be specifically included and others specifically excluded in diverse embodiments.
Although the application has been disclosed in the context of certain embodiments and examples, it will be understood by those skilled in the art that the embodiments of the application extend beyond the specifically disclosed embodiments to other alternative embodiments and/or uses and modifications and equivalents thereof.
In some embodiments, any numbers expressing quantities of ingredients, properties such as molecular weight, reaction conditions, and so forth, used to describe and claim certain embodiments of the disclosure are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and any included claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the application are approximations, the numerical values set forth in the specific examples are usually reported as precisely as practicable.
In some embodiments, the terms “a” and “an” and “the” and similar references used in the context of describing a particular embodiment of the application (especially in the context of certain claims) are construed to cover both the singular and the plural. The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (for example, “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the application and does not pose a limitation on the scope of the application otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the application.
Variations on preferred embodiments will become apparent to those of ordinary skill in the art upon reading the foregoing description. It is contemplated that skilled artisans can employ such variations as appropriate, and the application can be practiced otherwise than specifically described herein. Accordingly, many embodiments of this application include all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the application unless otherwise indicated herein or otherwise clearly contradicted by context.
All patents, patent applications, publications of patent applications, and other material, such as articles, books, specifications, publications, documents, things, and/or the like, referenced herein are hereby incorporated herein by this reference in their entirety for all purposes, excepting any prosecution file history associated with same, any of same that is inconsistent with or in conflict with the present document, or any of same that may have a limiting effect as to the broadest scope of the claims now or later associated with the present document. By way of example, should there be any inconsistency or conflict between the description, definition, and/or the use of a term associated with any of the incorporated material and that associated with the present document, the description, definition, and/or the use of the term in the present document shall prevail.
In closing, it is to be understood that the embodiments of the application disclosed herein are illustrative of the principles of the embodiments of the application. Other modifications that can be employed can be within the scope of the application. Thus, by way of example, but not of limitation, alternative configurations of the embodiments of the application can be utilized in accordance with the teachings herein. Accordingly, embodiments of the present application are not limited to that precisely as shown and described.
The present application for patent claims benefit of Provisional Application No. 63/304,427 entitled “A NEURAL NETWORK APPROACH TO GENETIC IMPUTATION, FEATURE EXTRACTION, AND DIMENSIONALITY REDUCTION” filed Jan. 28, 2022 and hereby expressly incorporated by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2023/061455 | 1/27/2023 | WO |
Number | Date | Country | |
---|---|---|---|
63304427 | Jan 2022 | US |