The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using techniques for converting context of an artificial neural network (ANN) or another type of computing system that is trainable through machine learning.
The following are incorporated by reference for all purposes as if fully set forth herein:
The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
Described herein are technologies for converting context of an ANN or context of another type of computing system that is trainable through machine learning. In some implementations, the technologies convert a first context of a computing system (such as an ANN), which is to provide pathogenicity of variants (e.g., missense variants) of genomes of a population, to a second context of the computing system, which is to provide pathogenicity of indels of the genomes of the population.
In providing such technologies, the systems and methods described herein overcome some technical problems in obtaining scores from a computing system in which the context of the computing system is changed. Also, the techniques disclosed herein provide specific technical solutions to at least overcome the technical problems mentioned herein as well as other technical problems not described herein but recognized by those skilled in the art.
With respect to some implementations, disclosed herein are computerized methods for converting context of an ANN or context of another type of computing system, as well as a non-transitory computer-readable storage medium for carrying out technical operations of the computerized methods. The non-transitory computer-readable storage medium has tangibly stored thereon, or tangibly encoded thereon, computer readable instructions that when executed by one or more devices (e.g., one or more personal computers or servers) cause at least one processor to perform a method for converting context of an ANN or context of another type of computing system.
In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings.
The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
PrimateAI is a deep residual neural network for classifying the pathogenicity of missense mutations. In at least one version, PrimateAI is trained on a dataset of ˜380,000 common variants from humans and six non-human primate species, using a semi-supervised benign vs unlabeled training regimen. In such version(s), the input to the network is the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species, without any additional human-engineered features, and the output is the pathogenicity score from 0 (less pathogenic) to 1 (more pathogenic). In such version(s), to incorporate information about protein structure, PrimateAI can learn to predict secondary structure and solvent accessibility from amino acid sequence and includes these as sub-networks in the full model. Also, in such version(s), the total size of the network, with protein structure included, is 36 layers of convolutions, including roughly 400,000 trainable parameters.
gnomAD
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects and making summary data available for the wider scientific community. Multiple versions of the gnomAD have been released.
Described herein are techniques for converting context of an artificial neural network or another type of computing system that is trainable through machine learning. Examples of the techniques disclosed herein convert a first context for a computing system (such as an ANN) to a second context for the computing system. Specifically, the first context for the computing system is pathogenicity of variants (e.g., missense variants) of genomes of a population, and the second context for the computing system is pathogenicity of indels of the genomes of the population. To put it another way, some of the techniques disclosed herein provide operations for converting a computing system or the output of the computing system, which is initially meant to provide pathogenicity of variants (e.g., missense variants) of genomes of a population, to a computing system or the output of the computing system that provides pathogenicity of indels of the genomes of the population.
The actions of
In one implementation, the ANN is a multilayer perceptron (MLP). In another implementation, the ANN is a feedforward neural network. In yet another implementation, the ANN is a fully-connected neural network. In a further implementation, the ANN is a fully convolution neural network. In yet further implementation, the ANN is a semantic segmentation neural network. In yet another further implementation, the ANN is a generative adversarial network (GAN) (e.g., CycleGAN, StyleGAN, pixelRNN, text-2-image, DiscoGAN, IsGAN).
In one implementation, the ANN is a convolution neural network (CNN) with a plurality of convolution layers. In another implementation, the ANN is a recurrent neural network (RNN) such as a long short-term memory network (LSTM), bi-directional LSTM (Bi-LSTM), or a gated recurrent unit (GRU). In yet another implementation, the ANN includes both a CNN and an RNN.
In yet other implementations, the ANN can use 1D convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5D convolutions, dilated or atrous convolutions, transpose convolutions, depthwise separable convolutions, pointwise convolutions, 1×1 convolutions, group convolutions, flattened convolutions, spatial and cross-channel convolutions, shuffled grouped convolutions, spatial separable convolutions, and deconvolutions. The ANN can use one or more loss functions such as logistic regression/log loss, multi-class cross-entropy/softmax loss, binary cross-entropy loss, mean-squared error loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss. The ANN can use any parallelism, efficiency, and compression schemes such TFRecords, compressed encoding (e.g., PNG), sharding, parallel calls for map transformation, batching, prefetching, model parallelism, data parallelism, and synchronous/asynchronous stochastic gradient descent (SGD). The ANN can include upsampling layers, downsampling layers, recurrent connections, gates and gated memory units (like an LSTM or GRU), residual blocks, residual connections, highway connections, skip connections, peephole connections, activation functions (e.g., non-linear transformation functions like rectifying linear unit (ReLU), leaky ReLU, exponential linear unit (ELU), sigmoid and hyperbolic tangent (tanh)), batch normalization layers, regularization layers, dropout, pooling layers (e.g., max or average pooling), global average pooling layers, and attention mechanisms (e.g., self-attention).
The ANN can be a rule-based model, linear regression model, a logistic regression model, an Elastic Net model, a support vector machine (SVM), a random forest (RF), a decision tree, and a boosted decision tree (e.g., XGBoost), or some other tree-based logic (e.g., metric trees, kd-trees, R-trees, universal B-trees, X-trees, ball trees, locality sensitive hashes, and inverted indexes). The ANN can be an ensemble of multiple models, in some implementations.
The ANN is trained using backpropagation-based gradient update techniques. Example gradient descent techniques that can be used for training the ANN include stochastic gradient descent, batch gradient descent, and mini-batch gradient descent. Some examples of gradient descent optimization algorithms that can be used to train the ANN are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, and AMSGrad.
In different implementations, the ANN includes self-attention mechanisms like Transformer, Vision Transformer (ViT), Bidirectional Transformer (BERT), Detection Transformer (DETR), Deformable DETR, UP-DETR, DeiT, Swin, GPT, iGPT, GPT-2, GPT-3, BERT, SpanBERT, RoBERTa, XLNet, ELECTRA, UniLM, BART, T5, ERNIE (THU), KnowBERT, DeiT-Ti, DeiT-S, DeiT-B, T2T-ViT-14, T2T-ViT-19, T2T-ViT-24, PVT-Small, PVT-Medium, PVT-Large, TNT-S, TNT-B, CPVT-S, CPVT-S-GAP, CPVT-B, Swin-T, Swin-S, Swin-B, Twins-SVT-S, Twins-SVT-B, Twins-SVT-L, Shuffle-T, Shuffle-S, Shuffle-B, XCiT-S12/16, CMT-S, CMT-B, VOLO-D1, VOLO-D2, VOLO-D3, VOLO-D4, MoCo v3, ACT, TSP, Max-DeepLab, VisTR, SETR, Hand-Transformer, HOT-Net, METRO, Image Transformer, Taming transformer, TransGAN, IPT, TTSR, STTN, Masked Transformer, CLIP, DALL-E, Cogview, UniT, ASH, TinyBert, FullyQT, ConvBert, FCOS, Faster R-CNN+FPN, DETR-DC5, TSP-FCOS, TSP-RCNN, ACT+MKDD (L=32), ACT+MKDD (L=16), SMCA, Efficient DETR, UP-DETR, UP-DETR, ViTB/16-FRCNN, ViT-B/16-FRCNN, PVT-Small+RetinaNet, Swin-T+RetinaNet, Swin-T+ATSS, PVT-Small+DETR, TNT-S+DETR, YOLOS-Ti, YOLOS-S, and YOLOS-B.
For the purpose of this disclosure, it is to be understood that a plurality of indels, in general, includes a plurality of insertions and/or a plurality of deletions. Also, for the purpose of this disclosure, it is to be understood that a variant in a generic term for a variant or an indel variant (i.e., an indel). And, for the purpose of this disclosure, it is to be understood that an indel variant (i.e., an indel) is a generic term for an insertion variant (i.e., an insertion) or a deletion variant (i.e., a deletion). And, unless specified otherwise herein, the term “variant” refers to a nucleic acid sequence that is different from a nucleic acid reference. Typical nucleic acid sequence variant includes without limitation single nucleotide polymorphism (SNP), short deletion and insertion polymorphisms (indel), copy number variation (CNV), microsatellite markers or short tandem repeats and structural variation. Somatic variant calling is the effort to identify variants present at low frequency in the DNA sample. Somatic variant calling is of interest in the context of cancer treatment. Cancer is caused by an accumulation of mutations in DNA. A DNA sample from a tumor is generally heterogeneous, including some normal cells, some cells at an early stage of cancer progression (with fewer mutations), and some late-stage cells (with more mutations). Because of this heterogeneity, when sequencing a tumor (e.g., from an FFPE sample), somatic mutations will often appear at a low frequency. For example, a SNV might be seen in only 10% of the reads covering a given base. A variant that is to be classified as somatic or germline by the variant classifier is also referred to herein as the “variant under test.”
Method 100 commences with step 102, which includes processing a plurality of first variations of an object to generate a plurality of first scores pertaining to a respective quantifiable attribute for a variation of the plurality of first variations of the object. Method 100 then continues with step 104, which includes generating, according to one or more curve-forming functions, a first-context curve based on the plurality of first scores.
“Function” or “logic” (e.g., curve-forming functions), as used herein, can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps described herein. The “logic” can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps. The “logic” can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media). In one implementation, the logic implements a data processing function. The logic can be a general purpose, single core or multicore, processor with a computer program specifying the function, a digital signal processor with a computer program, configurable logic such as an FPGA with a configuration file, a special purpose circuit such as a state machine, or any combination of these. Also, a computer program product can embody the computer program and configuration file portions of the logic.
Also, the method 100 commences with step 106, which includes processing a plurality of second variations of the object to generate a plurality of second scores pertaining to a respective quantifiable attribute for a variation of the plurality of second variations of the object. Method 100 then continues with step 108, which includes generating, according to one or more curve-forming functions, a second-context curve based on the plurality of second scores.
Next, the method 100 continues with step 110, which includes determining selection pattern differences between the first-context curve and the second-context curve. Then, the method 100 continues with step 112, which includes determining one or more scaling functions to reduce the selection pattern differences between the first-context curve and the second-context curve. Finally, at step 114, the method 100 continues with enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of second scores according to the scaling function(s) to provide increased accuracy of the respective quantifiable attribute for each second variation of the plurality of second variations of the object.
In some implementations of the method 100 (such as method 200 shown in
Method 200 commences with step 202, which includes processing a plurality of variants to generate a plurality of missense pathogenicity scores for each variant of the plurality of variants. Method 200 then continues with step 204, which includes generating, according to one or more curve-forming functions, a missense curve based on the plurality of missense pathogenicity scores.
Also, the method 200 commences with step 206, which includes processing a plurality of indels to generate a plurality of indel pathogenicity scores for each indel of the plurality of indels. Method 200 then continues with step 208, which includes generating, according to the curve-forming function(s), an indel curve based on the plurality of indel pathogenicity scores.
Next, the method 200 continues with step 210, which includes determining selection pattern differences between the indel curve and the missense curve. Then, the method 200 continues with step 212, which includes determining one or more scaling functions to reduce the selection pattern differences between the missense curve and the indel curve. Finally, at step 214, the method 200 continues with enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of indel pathogenicity scores according to the scaling function(s) to provide a recalibrated accuracy of indel pathogenicity score for each indel of the plurality of indels.
In some implementations of the aforesaid methods, the curve-forming function(s) include a function that accounts for proportions of different indels and proportions of different variants in genomes of a population. For instance, in some implementations, the curve-forming function(s) include a function that accounts for natural selection of different indels and natural selection of different variants in the genomes of the population. See
In more generic implementations, the curve-forming function(s) include a function that accounts for proportions of the first variations of the object and proportions of the second variations of the object in populations of the object.
In some implementations of the aforesaid methods (such as method 300 shown in
Method 300 commences with step 302, which includes processing a plurality of variants to generate a plurality of missense pathogenicity scores for each variant of the plurality of variants. Method 300 then continues with step 304, which includes generating, according to one or more curve-forming functions, a missense curve based on the plurality of missense pathogenicity scores.
Also, the method 300 commences with step 306a, which includes processing a plurality of insertions to generate a plurality of insertion scores for each insertion of the plurality of insertions. The method 300 also commences with step 306b, which includes processing a plurality of deletions to generate a plurality of deletion scores for each deletion of the plurality of deletions. Method 300 then continues with step 308a, which includes generating, according to the curve-forming function(s), an insertion curve based on the plurality of insertion scores. Also, method 300 continues with step 308b, which includes generating, according to the curve-forming function(s), a deletion curve based on the plurality of deletion scores.
Next, the method 300 continues with step 310a, which includes determining selection pattern differences between the insertion curve and the missense curve. Also, the method 300 continues with step 310b, which includes determining selection pattern differences between the deletion curve and the missense curve. Then, the method 300 continues with step 312a, which includes determining one or more scaling functions to reduce the selection pattern differences between the missense curve and the insertion curve. Further, the method 300 continues with step 312b, which includes determining additional one or more scaling functions to reduce the selection pattern differences between the missense curve and the deletion curve. Finally, at step 314, the method 300 continues with enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of insertion scores and the plurality of deletion scores according to the respective scaling function(s) to provide a recalibrated accuracy of insertion pathogenicity score for each insertion of the plurality of insertions and each deletion of the plurality of deletions.
In some implementations of the aforesaid methods (e.g., see
In more generic examples (e.g., see
In some implementations of the aforesaid methods (e.g., see
Analogous techniques to the techniques shown in
In some implementations, measuring the central tendency distribution of the indel pathogenicity scores includes determining a mean of the indel pathogenicity scores. For example, measuring the central tendency distribution of the insertion scores includes determining a mean of the insertion scores and measuring the central tendency distribution of the deletion scores includes determining a mean of the deletion scores. Also, in some implementations, measuring the central tendency distribution of the missense pathogenicity scores includes determining a mean of the missense pathogenicity scores. In some implementations, measuring the central tendency distribution of the indel pathogenicity scores includes determining a mode of the indel pathogenicity scores. For example, measuring the central tendency distribution of the insertion scores includes determining a mode of the insertion scores and measuring the central tendency distribution of the deletion scores includes determining a mode of the deletion scores. Also, in some implementations, measuring the central tendency distribution of the missense pathogenicity scores includes determining a mode of the missense pathogenicity scores. In some implementations, measuring the central tendency distribution of the indel pathogenicity scores includes determining a median of the indel pathogenicity scores. For example, measuring the central tendency distribution of the insertion scores includes determining a median of the insertion scores and measuring the central tendency distribution of the deletion scores includes determining a median of the deletion scores. Also, in some implementations, measuring the central tendency distribution of the missense pathogenicity scores includes determining a median of the missense pathogenicity scores. Also, such techniques apply to even more generic implementations as well. For example, measuring the central tendency distribution of the first-context scores includes determining a mean, mode, or median of the first-context scores. And, measuring the central tendency distribution of the second-context scores includes determining a mean, mode, or median of the second-context scores.
In some implementations of the aforesaid methods (e.g., see
In more generic examples (e.g., see
In even more generic examples (e.g., see
In some implementations of the aforesaid methods (e.g., see
In some implementations of the aforesaid methods (e.g., see
In some implementations (e.g., see
In some implementations, comparable variants of the variants are synonymous mutations for variants. And, in some of such examples, the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of missense pathogenicity scores includes calibrating missense propensity scores based on the synonymous mutations for variants.
In some implementations, the comparable variants of the indels are indels in coding and noncoding regions of the genomes of the population. And, in some of such examples, the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of insertion scores includes calibrating insertion propensity scores based on an observed versus expected ratio based on insertions occurring in coding regions versus noncoding regions of the genomes of the population, respectively. Also, in such instances, the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of deletion scores includes calibrating deletion propensity scores based on an observed versus expected ratio based on deletions occurring in coding regions versus noncoding regions of the genomes of the population, respectively.
In some implementations, the group of bins represents all the scores, and each bin of the group of bins represents a different range of scores in all the scores. Also, in some examples, all the scores includes the plurality of insertion scores, the plurality of deletion scores, and the plurality of missense pathogenicity scores. In some examples, all the scores includes the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores. And, in more even more generic examples, all the scores includes the plurality of first-context scores and the plurality of second-context scores.
In some implementations, each bin of the group of bins is associated with a certain amount of the plurality of insertions that have scores within a respective range of scores associated with the bin. Also, each bin of the group of bins is associated with a certain amount of the plurality of deletions that have scores within a respective range of scores associated with the bin. And, each bin of the group of bins is associated with a certain amount of the plurality of variants that have scores within a respective range of scores associated with the bin. The same can be said for more generic examples, with the bins being associated with indel and missense pathogenicity scores. And, the same can be said for even more generic examples, with the bins being associated with first-context scores and second-context scores.
In some implementations, the group of bins includes a group of percentile bins. For instance, the group of percentile bins includes one hundred bins. And, with the one hundred bins, a first bin of the one hundred bins represents scores that range from 0 to 0.01 and a one hundredth bin represents scores that range from 0.99 to 1. The bins between the first bin and the one hundredth bin each include a range of scores of a percentile.
In some implementations (e.g., see
In some implementations, (e.g., see
As mentioned herein and with respect to
In some implementations, the first genome database includes a version of a Genome Aggregation Database (gnomAD). In some of such implementations, the second genome database includes a version of the gnomAD. In some instances, the second genome database and the first genome database are the same version of the gnomAD; and in some other implementations, the second and first genome databases are different versions of the gnomAD.
Method 800 commences with step 802, which includes identifying a plurality of variants in a first genome database. Also, method 800, starts with step 804, which includes identifying a plurality of indels in a second genome database. Method 800 continues with an artificial neural network (ANN) generating a plurality of missense pathogenicity scores for each variant of a plurality of variants (at step 806). Also, method 800 continues with the ANN generating a plurality of indel pathogenicity scores for each indel of a plurality of indels (at step 808). At step 810, the method 800 continues with further processing the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores to be applied to one or more curve-forming functions. At step 812, the method 800 continues with applying the further processed scores to the curve-forming function(s) to generate an indel curve and a missense curve. At step 814, the method 800 continues with determining selection pattern differences between the indel curve and the missense curve. At step 816, the method 800 continues with determining one or more scaling functions to reduce the selection pattern differences between the curves. At step 818, the method 800 continues with updating coefficients of the ANN according to the scaling function(s). In some implementations, the updating the coefficients of the ANN according to the scaling function(s) includes enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of indel pathogenicity scores according to the scaling function(s) to provide a recalibrated accuracy of indel pathogenicity score for each indel of the plurality of indels.
With respect to
With respect to
With respect to
In some of such implementations, the insertion curve includes a first plurality of data points including an insertion propensity score for each bin of a group of bins. Also, the deletion curve includes a second plurality of data points including a deletion propensity score for each bin of the group of bins. And, the missense curve includes a third plurality of data points including a missense propensity score for each bin of the group of bins. For an example of such data points being displayed on a graph, see
With respect to
Also, with respect to
Also, with respect to
Not depicted in
Also, with respect to
Also, with respect to
Also, with respect to
Also, with respect to
Also, with respect to
Also, with respect to
Also, with respect to
In some implementations, the computing system 900 corresponds to a host system that includes, is coupled to, or utilizes memory or is used to perform the operations performed by any one of the computing devices, data processors, and user interface devices described herein. In alternative implementations, the machine is connected (e.g., networked) to other machines in a LAN, an intranet, an extranet, or the Internet. In some implementations, the machine operates in the capacity of a server or a client machine in client-server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, or as a server or a client machine in a cloud computing infrastructure or environment. In some implementations, the machine is a personal computer (PC), a tablet PC, a cellular telephone, a web appliance, a server, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The computing system 900 includes a processing device 902, a main memory 904 (e.g., read-only memory (ROM), flash memory, dynamic random-access memory (DRAM), etc.), a static memory 906 (e.g., flash memory, static random-access memory (SRAM), etc.), and a data storage system 910, which communicate with each other via a bus 930. The processing device 902 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processing device is a microprocessor or a processor implementing other instruction sets, or processors implementing a combination of instruction sets. Or, the processing device 902 is one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 902 is configured to execute instructions 914 for performing the operations or steps discussed herein. In some implementations, the computing system 900 includes a network interface device 908 to communicate over a communications network 940 shown in
The data storage system 910 includes a machine-readable storage medium 912 (also known as a computer-readable medium) on which is stored one or more sets of instructions 914 or software embodying any one or more of the methodologies or functions described herein. The instructions 914 also reside, completely or at least partially, within the main memory 904 or within the processing device 902 during execution thereof by the computing system 900, the main memory 904 and the processing device 902 also constituting machine-readable storage media.
In some implementations, the instructions 914 include instructions to implement functionality corresponding to any one of the computing devices, data processors, user interface devices, and I/O devices described herein. While the machine-readable storage medium 912 is shown in an example implementation to be a single medium, the term “machine-readable storage medium” should be taken to include a single medium or multiple media that store the one or more sets of instructions. The term “machine-readable storage medium” shall also be taken to include any medium that is capable of storing or encoding a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure. The term “machine-readable storage medium” shall accordingly be taken to include solid-state memories, optical media, magnetic media, or the like.
Also, as shown, computing system 900 includes user interface 920 that includes a display, in some implementations, and, for example, implements functionality corresponding to any one of the user interface devices disclosed herein. A user interface, such as user interface 920, or a user interface device described herein includes any space or equipment where interactions between humans and machines occur. A user interface described herein allows operation and control of the machine from a human user, while the machine simultaneously provides feedback information to the user. Examples of a user interface (UI), or user interface device include the interactive aspects of computer operating systems (such as graphical user interfaces or GUI), machinery operator controls, and process controls. A UI described herein includes one or more layers, including a human-machine interface (HIM) that interfaces machines with physical input hardware and output hardware.
Also, it is to be understood, that the methodologies discussed herein are computer-implemented methods and, in some implementations, are implementable by the computing system 900. For instance, a computer-implemented method includes an artificial neural network (ANN) generating a plurality of missense pathogenicity scores for each variant of a plurality of variants. Also, the computer-implemented method includes the ANN generating a plurality of indel pathogenicity scores for each indel of a plurality of indels. Further, the computer-implemented method includes applying the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores to one or more curve-forming functions. And, the computer-implemented method includes applying the further processed scores to the curve-forming function(s) to generate an indel curve and a missense curve and determining selection pattern differences between the indel curve and the missense curve. Also, the computer-implemented method includes determining one or more scaling functions to reduce the selection pattern differences between the curves and updating coefficients of the ANN according to the scaling function(s). The updating the coefficients of the ANN according to the scaling function(s) includes enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of indel pathogenicity scores according to the scaling function(s) to provide a recalibrated accuracy of indel pathogenicity score for each indel of the plurality of indels.
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a predetermined result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. The present disclosure can refer to the action and processes of a computing system, or similar electronic computing device, which manipulates and transforms data represented as physical (electronic) quantities within the computing system's registers and memories into other data similarly represented as physical quantities within the computing system memories or registers or other such information storage systems.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus can be specially constructed for the intended purposes, or it can include a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program can be stored in a computer readable storage medium, such as any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, coupled to a computing system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems can be used with programs in accordance with the teachings herein, or it can prove convenient to construct a more specialized apparatus to perform the method. The structure for a variety of these systems will appear as set forth in the description below. In addition, the present disclosure is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages can be used to implement the teachings of the disclosure as described herein.
The present disclosure can be provided as a computer program product, or software, which can include a machine-readable medium having stored thereon instructions, which can be used to program a computing system (or other electronic devices) to perform a process according to the present disclosure. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). In some implementations, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium such as a read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory components, etc.
While the invention has been described in conjunction with the specific implementations described herein, it is evident that many alternatives, combinations, modifications and variations are apparent to those skilled in the art. Accordingly, the example implementations of the invention, as set forth herein are intended to be illustrative only, and not in a limiting sense. Various changes can be made without departing from the spirit and scope of the invention.
We disclose the following clauses:
1. A computer-implemented method, comprising:
processing a plurality of variants to generate a plurality of missense pathogenicity scores for each variant of the plurality of variants;
generating, according to one or more curve-forming functions, a missense curve based on the plurality of missense pathogenicity scores;
processing a plurality of indels to generate a plurality of indel pathogenicity scores for each indel of the plurality of indels;
generating, according to the one or more curve-forming functions, an indel curve based on the plurality of indel pathogenicity scores;
determining selection pattern differences between the indel curve and the missense curve; determining one or more scaling functions to reduce the selection pattern differences between the missense curve and the indel curve; and
enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of indel pathogenicity scores according to the one or more scaling functions to provide a recalibrated accuracy of indel pathogenicity score for each indel of the plurality of indels.
2. The computer-implemented method of clause 1, wherein the one or more curve-forming functions comprise a function that accounts for proportions of different indels and proportions of different variants in genomes of a population.
3. The computer-implemented method of clause 2, wherein the one or more curve-forming functions comprise a function that accounts for natural selection of different indels and natural selection of different variants in the genomes of the population.
4. The computer-implemented method of clause 2, wherein the plurality of indels comprises a plurality of insertions and a plurality of deletions, and wherein the plurality of indel pathogenicity scores comprises a plurality of insertion scores and a plurality of deletion scores, respectively.
5. The computer-implemented method of clause 4, comprising:
generating, according to the one or more curve-forming functions, an insertion curve based on the plurality of insertion scores; and
generating, according to the one or more curve-forming functions, a deletion curve based on the plurality of deletion scores.
6. The computer-implemented method of clause 5,
wherein the insertion curve comprises a first plurality of data points comprising an insertion propensity score for each bin of a group of bins,
wherein the deletion curve comprises a second plurality of data points comprising a deletion propensity score for each bin of the group of bins, and
wherein the missense curve comprises a third plurality of data points comprising a missense propensity score for each bin of the group of bins.
7. The computer-implemented method of clause 6,
wherein the insertion propensity score for a bin of the group of bins relates to a proportion of different insertions in the genomes of the population that have insertion scores of the plurality of insertion scores that are associated with the bin,
wherein the deletion propensity score for a bin of the group of bins relates to a proportion of different deletions in the genomes of the population that have deletion scores of the plurality of deletion scores that are associated with the bin, and
wherein the missense propensity score for a bin of the group of bins relates to a proportion of variants in the genomes of the population that have missense pathogenicity scores of the plurality of missense pathogenicity scores that are associated with the bin.
8. The computer-implemented method of clause 7, wherein the generating of the insertion curve comprises:
grouping the plurality of insertions into the group of bins; and
for each bin of the group of bins:
9. The computer-implemented method of clause 7, wherein the generating of the deletion curve comprises:
grouping the plurality of deletions into the group of bins; and
for each bin of the group of bins:
10. The computer-implemented method of clause 7, wherein the generating of the missense curve comprises:
grouping the plurality of variants into the group of bins; and
for each bin of the group of bins:
11. The computer-implemented method of clause 7,
wherein the insertion propensity score for a bin of the group of bins represents a probability of one of the plurality of insertions associated with the bin occurs in the genomes of the population given a set of observed insertions,
wherein the deletion propensity score for the bin represents a probability of one of the plurality of deletions associated with the bin occurs in the genomes of the population given a set of observed deletions, and
wherein the missense propensity score for the bin represents a probability of one of the plurality of variants associated with the bin occurs in the genomes of the population given a set of observed variants.
12. The computer-implemented method of clause 11, wherein the insertion propensity score, the deletion propensity score, and the missense propensity score reduce selection bias by equating groups based on covariates, and wherein the covariates are the set of observed insertions, the set of observed deletions, and the set of observed variants, respectively.
13. The computer-implemented method of clause 7,
wherein the insertion curve is generated when the first plurality of data points is plotted on a two-dimensional graph with one axis for propensity scores and the other axis for the group of bins,
wherein the deletion curve is generated when the second plurality of data points is plotted on the two-dimensional graph, and
wherein the missense curve is generated when the third plurality of data points is plotted on the two-dimensional graph.
14. The computer-implemented method of clause 13,
wherein the two-dimensional graph comprises a set of ordered pairs (x, y),
wherein f(x)=y,
wherein x is the group of bins, and
wherein y is the propensity scores.
15. The computer-implemented method of clause 13, comprising:
determining selection pattern differences between the insertion curve and the missense curve;
determining one or more second scaling functions to reduce the selection pattern differences between the insertion curve and the missense curve; and
enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of insertion scores according to the one or more second scaling functions to provide a recalibrated accuracy of insertion pathogenicity score for each insertion of the plurality of insertions.
16. The computer-implemented method of clause 15, comprising:
determining selection pattern differences between the deletion curve and the missense curve;
determining one or more third scaling functions to reduce the selection pattern differences between the deletion curve and the missense curve; and
enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of deletion scores according to the one or more third scaling functions to provide a recalibrated accuracy of deletion pathogenicity score for each deletion of the plurality of deletions.
17. The computer-implemented method of clause 16,
wherein the one or more second scaling functions and the one or more third scaling functions are part of the one or more scaling functions, and
wherein the one or more scaling functions comprise functions to scale the proportions of different insertions, different deletions, and different variants in the genomes of the population, respectively, since indels and single-nucleotide variants have different mutability.
18. The computer-implemented method of clause 17, wherein the one or more scaling functions obtain scaling factors from comparable variants under natural selection.
19. The computer-implemented method of clause 18,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of insertion scores comprises scaling a plurality of insertion propensity scores according to first scaling factors of the scaling factors that are associated with insertions in the genomes of the population,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of deletion scores comprises scaling a plurality of deletion propensity scores according to second scaling factors of the scaling factors that are associated with deletions in the genomes of the population, and
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of missense pathogenicity scores comprises scaling a plurality of missense propensity scores according to third scaling factors of the scaling factors that are associated with variants in the genomes of the population.
20. The computer-implemented method of clause 19, wherein comparable variants of the variants are synonymous mutations for variants.
21. The computer-implemented method of clause 20, wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of missense pathogenicity scores comprises calibrating missense propensity scores based on the synonymous mutations for variants.
22. The computer-implemented method of clause 19, wherein the comparable variants of the indels are indels in coding and noncoding regions of the genomes of the population.
23. The computer-implemented method of clause 22,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of insertion scores comprises calibrating insertion propensity scores based on an observed versus expected ratio based on insertions occurring in coding regions versus noncoding regions of the genomes of the population, respectively, and
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of deletion scores comprises calibrating deletion propensity scores based on an observed versus expected ratio based on deletions occurring in coding regions versus noncoding regions of the genomes of the population, respectively.
24. The computer-implemented method of clause 6,
wherein the group of bins represents all scores,
wherein each bin of the group of bins represents a different range of scores in all the scores, and
wherein all the scores comprise the plurality of insertion scores, the plurality of deletion scores, and the plurality of missense pathogenicity scores.
25. The computer-implemented method of clause 24,
wherein each bin of the group of bins is associated with a certain amount of the plurality of insertions that have scores within a respective range of scores associated with the bin,
wherein each bin of the group of bins is associated with a certain amount of the plurality of deletions that have scores within a respective range of scores associated with the bin, and
wherein each bin of the group of bins is associated with a certain amount of the plurality of variants that have scores within a respective range of scores associated with the bin.
26. The computer-implemented method of clause 25, wherein the group of bins comprises a group of percentile bins.
27. The computer-implemented method of clause 26, wherein the group of percentile bins comprises one hundred bins, wherein a first bin of the one hundred bins represents scores that range from 0 to 0.01 and a one hundredth bin represents scores that range from 0.99 to 1, and wherein bins between the first bin and the one hundredth bin each comprise a range of scores of a percentile.
28. The computer-implemented method of clause 1, wherein the plurality of indel pathogenicity scores is generated by an artificial neural network (ANN), and wherein the processing of the plurality of indels is implemented by the ANN.
29. The computer-implemented method of clause 28, wherein the ANN is configured to classify pathogenicity of variants.
30. The computer-implemented method of clause 29, wherein the ANN comprises a deep residual neural network for classifying pathogenicity of missense mutations.
31. The computer-implemented method of clause 30, wherein the ANN comprises a version of PrimateAI.
32. The computer-implemented method of clause 1, further comprising identifying the plurality of variants in a first genome database.
33. The computer-implemented method of clause 32, wherein the first genome database comprises a version of a Genome Aggregation Database (gnomAD).
34. The computer-implemented method of clause 32, further comprising identifying the plurality of indels in a second genome database.
35. The computer-implemented method of clause 34, wherein the second genome database comprises a version of the gnomAD.
36. The computer-implemented method of clause 1, further comprising:
identifying the plurality of variants in a first genome database; and
identifying the plurality of indels in a second genome database,
37. The computer-implemented method of clause 1, wherein the generating of the indel curve comprises grouping the plurality of indels into a group of bins.
38. The computer-implemented method of clause 37, wherein the generating of the missense curve comprises grouping the plurality of variants into the group of bins.
39. The computer-implemented method of clause 38, wherein the generating of the indel curve comprises, for each bin of the group of bins:
measuring a central tendency distribution of indel pathogenicity scores in the bin; and
applying the central tendency distribution of the indel pathogenicity scores in the bin to identify an indel propensity score for the bin.
40. The computer-implemented method of clause 39, wherein the generating of the missense curve comprises, for each bin of the group of bins:
measuring a central tendency distribution of missense pathogenicity scores in the bin; and
applying the central tendency distribution of the missense pathogenicity scores in the bin to identify a missense propensity score for the bin.
41. The computer-implemented method of clause 39, wherein measuring the central tendency distribution of the indel pathogenicity scores comprises determining a mean of the indel pathogenicity scores.
42. The computer-implemented method of clause 40, wherein measuring the central tendency distribution of the missense pathogenicity scores comprises determining a mean of the missense pathogenicity scores.
43. The computer-implemented method of clause 39, wherein measuring the central tendency distribution of the indel pathogenicity scores comprises determining a median of the indel pathogenicity scores.
44. The computer-implemented method of clause 40, wherein measuring the central tendency distribution of the missense pathogenicity scores comprises determining a median of the indel pathogenicity scores.
45. The computer-implemented method of clause 39, wherein measuring the central tendency distribution of the indel pathogenicity scores comprises determining a mode of the indel pathogenicity scores.
46. The computer-implemented method of clause 40, wherein measuring the central tendency distribution of the missense pathogenicity scores comprises determining a mode of the indel pathogenicity scores.
47. The computer-implemented method of clause 8, wherein measuring the central tendency distribution of the insertion scores comprises determining a mean of the insertion scores.
48. The computer-implemented method of clause 9, wherein measuring the central tendency distribution of the deletion scores comprises determining a mean of the deletion scores.
49. The computer-implemented method of clause 10, wherein measuring the central tendency distribution of the missense pathogenicity scores comprises determining a mean of the missense pathogenicity scores.
50. The computer-implemented method of clause 10, wherein measuring the central tendencies of the insertion scores, the deletion scores, and the missense pathogenicity scores comprises determining a mode or a median of the scores.
51. A computer-implemented method, comprising:
identifying a plurality of variants in a first genome database;
identifying a plurality of indels in a second genome database;
generating, by an artificial neural network (ANN), a plurality of missense pathogenicity scores for each variant of the plurality of variants;
generating, by the ANN, a plurality of indel pathogenicity scores for each indel of the plurality of indels;
applying the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores to one or more curve-forming functions;
further processing the plurality of missense pathogenicity scores and the plurality of indel pathogenicity scores using the one or more curve-forming functions to generate an indel curve and a missense curve;
determining selection pattern differences between the indel curve and the missense curve; determining one or more scaling functions to reduce the selection pattern differences between the indel curve and the missense curve; and
updating coefficients of the ANN according to the one or more scaling functions.
52. The computer-implemented method of clause 51,
wherein further processing the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores using the one or more curve-forming functions comprises:
53. The computer-implemented method of clause 51, wherein the one or more curve-forming functions comprise a function that accounts for proportions of different indels and proportions of different variants in genomes of a population.
54. The computer-implemented method of clause 53, wherein the one or more curve-forming functions comprise a function that accounts for natural selection of different indels and natural selection of different variants in the genomes of the population.
55. The computer-implemented method of clause 53, wherein the plurality of indels comprises a plurality of insertions and a plurality of deletions, and wherein the plurality of indel pathogenicity scores comprises a plurality of insertion scores and a plurality of deletion scores, respectively.
56. The computer-implemented method of clause 55, comprising:
generating, according to the one or more curve-forming functions, an insertion curve based on the plurality of insertion scores;
generating, according to the one or more curve-forming functions, a deletion curve based on the plurality of deletion scores; and
generating, according to the one or more curve-forming functions, the missense curve based on the plurality of missense pathogenicity scores.
57. The computer-implemented method of clause 56,
wherein the insertion curve comprises a first plurality of data points comprising an insertion propensity score for each bin of a group of bins,
wherein the deletion curve comprises a second plurality of data points comprising a deletion propensity score for each bin of the group of bins, and
wherein the missense curve comprises a third plurality of data points comprising a missense propensity score for each bin of the group of bins.
58. The computer-implemented method of clause 57,
wherein the insertion propensity score for a bin of the group of bins relates to a proportion of different insertions in the genomes of the population that have insertion scores of the plurality of insertion scores that are associated with the bin,
wherein the deletion propensity score for a bin of the group of bins relates to a proportion of different deletions in the genomes of the population that have deletion scores of the plurality of deletion scores that are associated with the bin, and
wherein the missense propensity score for a bin of the group of bins relates to a proportion of variants in the genomes of the population that have missense pathogenicity scores of the plurality of missense pathogenicity scores that are associated with the bin.
59. The computer-implemented method of clause 58, wherein the generating of the insertion curve comprises:
grouping the plurality of insertions into the group of bins; and
for each bin of the group of bins:
60. The computer-implemented method of clause 58, wherein the generating of the deletion curve comprises:
grouping the plurality of deletions into the group of bins; and
for each bin of the group of bins:
61. The computer-implemented method of clause 58, wherein the generating of the missense curve comprises:
grouping the plurality of variants into the group of bins; and
for each bin of the group of bins:
62. The computer-implemented method of clause 58,
wherein the insertion propensity score for a bin of the group of bins represents a probability of one of the plurality of insertions associated with the bin occurs in the genomes of the population given a set of observed insertions,
wherein the deletion propensity score for the bin represents a probability of one of the plurality of deletions associated with the bin occurs in the genomes of the population given a set of observed deletions, and
wherein the missense propensity score for the bin represents a probability of one of the plurality of variants associated with the bin occurs in the genomes of the population given a set of observed variants.
63. The computer-implemented method of clause 62, wherein the insertion propensity score, the deletion propensity score, and the missense propensity score reduce selection bias by equating groups based on covariates, and wherein the covariates are the set of observed insertions, the set of observed deletions, and the set of observed variants, respectively.
64. The computer-implemented method of clause 58,
wherein the insertion curve is generated when the first plurality of data points is plotted on a two-dimensional graph with one axis for propensity scores and the other axis for the group of bins,
wherein the deletion curve is generated when the second plurality of data points is plotted on the two-dimensional graph, and
wherein the missense curve is generated when the third plurality of data points is plotted on the two-dimensional graph.
65. The computer-implemented method of clause 64,
wherein the two-dimensional graph comprises a set of ordered pairs (x, y),
wherein f(x)=y,
wherein x is the group of bins, and
wherein y is the propensity scores.
66. The computer-implemented method of clause 64, comprising:
determining selection pattern differences between the insertion curve and the missense curve;
determining one or more second scaling functions to reduce the selection pattern differences between the insertion curve and the missense curve; and
enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of insertion scores according to the one or more second scaling functions to change the output of the ANN.
67. The computer-implemented method of clause 66, comprising:
determining selection pattern differences between the deletion curve and the missense curve;
determining one or more third scaling functions to reduce the selection pattern differences between the deletion curve and the missense curve; and
enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of deletion scores according to the one or more third scaling functions to change the output of the ANN.
68. The computer-implemented method of clause 67,
wherein the one or more second scaling functions and the one or more third scaling functions are part of the one or more scaling functions, and
wherein the one or more scaling functions comprise functions to scale the proportions of different insertions, different deletions, and different variants in the genomes of the population, respectively, since indels and single-nucleotide variants have different mutability.
69. The computer-implemented method of clause 68, wherein the one or more scaling functions obtain scaling factors from comparable variants under natural selection.
70. The computer-implemented method of clause 69,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of insertion scores comprises scaling the plurality of insertion propensity scores according to first scaling factors of the scaling factors that are associated with insertions in the genomes of the population,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of deletion scores comprises scaling the plurality of deletion propensity scores according to second scaling factors of the scaling factors that are associated with deletions in the genomes of the population, and
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of missense pathogenicity scores comprises scaling the plurality of missense propensity scores according to third scaling factors of the scaling factors that are associated with variants in the genomes of the population.
71. The computer-implemented method of clause 68, wherein comparable variants of the variants are synonymous mutations for variants.
72. The computer-implemented method of clause 71, wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of missense pathogenicity scores comprises calibrating missense propensity scores based on the synonymous mutations for variants.
73. The computer-implemented method of clause 72, wherein comparable variants of the indels are indels in coding and noncoding regions of the genomes of the population.
74. The computer-implemented method of clause 73,
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of insertion scores comprises calibrating insertion propensity scores based on an observed versus expected ratio based on insertions occurring in coding regions versus noncoding regions of the genomes of the population, respectively, and
wherein the enhancing/calibrating/recalibrating/updating/optimizing/modifying of the plurality of deletion scores comprises calibrating deletion propensity scores based on an observed versus expected ratio based on deletions occurring in coding regions versus noncoding regions of the genomes of the population, respectively.
75. The computer-implemented method of clause 52,
wherein the group of bins represents all scores,
wherein each bin of the group of bins represents a different range of scores in all the scores, and
wherein all the scores comprise the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores.
76. The computer-implemented method of clause 75,
wherein each bin of the group of bins is associated with a certain amount of the plurality of indels that have scores within a respective range of scores associated with the bin, and
wherein each bin of the group of bins is associated with a certain amount of the plurality of variants that have scores within a respective range of scores associated with the bin.
77. The computer-implemented method of clause 76, wherein the group of bins comprises a group of percentile bins.
78. The computer-implemented method of clause 77, wherein the group of percentile bins comprises one hundred bins, wherein a first bin of the one hundred bins represents scores that range from 0 to 0.01 and a one hundredth bin represents scores that range from 0.99 to 1, and wherein bins between the first bin and the one hundredth bin each comprise a range of scores of a percentile.
79. The computer-implemented method of clause 51, wherein the ANN is configured to classify pathogenicity of variants.
80. The computer-implemented method of clause 79, wherein the ANN comprises a deep residual neural network for classifying pathogenicity of missense mutations.
81. The computer-implemented method of clause 80, wherein the ANN comprises a version of PrimateAI.
82. The computer-implemented method of clause 51, wherein the first genome database comprises a version of a Genome Aggregation Database (gnomAD).
83. The computer-implemented method of clause 82, wherein the second genome database comprises a version of the gnomAD.
84. The computer-implemented method of clause 52, wherein measuring the central tendency distribution of the indel pathogenicity scores in the bin comprises determining a mean of the indel pathogenicity scores.
85. The computer-implemented method of clause 52, wherein measuring the central tendency distribution of the missense pathogenicity scores in the bin comprises determining a mean of the missense pathogenicity scores.
86. The computer-implemented method of clause 52, wherein measuring the central tendencies of the indel pathogenicity scores and the missense pathogenicity scores in the bin comprises determining a mode or a median of the scores.
87. The computer-implemented method of clause 59, wherein measuring the central tendency distribution of the insertion scores comprises determining a mean of the insertion scores.
88. The computer-implemented method of clause 60, wherein measuring the central tendency distribution of the deletion scores comprises determining a mean of the deletion scores.
89. The computer-implemented method of clause 61, wherein measuring the central tendency distribution of the missense pathogenicity scores in the bin comprises determining a mean of the missense pathogenicity scores.
90. The computer-implemented method of clause 59, wherein measuring the central tendency distribution of the insertion scores comprises determining a mode or a median of the insertion scores.
91. The computer-implemented method of clause 60, wherein measuring the central tendency distribution of the deletion scores comprises determining a mode or a median of the deletion scores.
92. The computer-implemented method of clause 61, wherein measuring the central tendency distribution of the missense pathogenicity scores in the bin comprises determining a mode or a median of the missense pathogenicity scores.
93. The computer-implemented method of clause 51, wherein the updating the coefficients of the ANN according to the one or more scaling functions comprises enhancing/calibrating/recalibrating/updating/optimizing/modifying the plurality of indel pathogenicity scores according to the one or more scaling functions to provide a recalibrated accuracy of indel pathogenicity score for each indel of the plurality of indels.
94. A computer-implemented method, comprising:
generating, by an artificial neural network (ANN), a plurality of missense pathogenicity scores for each variant of a plurality of variants;
generating, by the ANN, a plurality of indel pathogenicity scores for each indel of a plurality of indels;
applying the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores to one or more curve-forming functions;
further processing the plurality of indel pathogenicity scores and the plurality of missense pathogenicity scores using the one or more curve-forming functions to generate an indel curve and a missense curve;
determining selection pattern differences between the indel curve and the missense curve;
determining one or more scaling functions to reduce the selection pattern differences between the indel curve and the missense curve; and
updating coefficients of the ANN according to the one or more scaling functions, and
This application claims the benefit of, and priority to, U.S. Provisional Application No. 63/304,308, entitled “INDEL PATHOGENICITY DETERMINATION,” filed Jan. 28, 2022. The aforementioned application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63304308 | Jan 2022 | US |