Embodiments of the subject matter disclosed herein generally relate to a system, method, and neural network for predicting a usage level of the alternative polyadenylation site, and more particularly, to simultaneously quantitatively predicting the usage of all alternatives polyadenylation sites for a given gene by using deep learning methodologies.
In eukaryotic cells' genes 100, as illustrated in
The process of polyadenylation begins as the transcription of a gene terminates. The 3′-most segment of the newly made pre-mRNA is first cleaved off by a set of proteins. These proteins then synthesize the poly(A) tail at the RNA's 3′ end. In some genes, these proteins add a poly(A) tail at one of several possible sites. Therefore, polyadenylation can produce more than one transcript from a single gene (alternative polyadenylation).
Often, one gene 100 could have multiple polyadenylation sites (PAS). The so-called alternative polyadenylation (APA) could generate from the same gene locus different transcript isoforms with different 3′-UTRs and sometimes even different protein coding sequences. The diverse 3′-UTRs generated by APA may contain different sets of cis-regulatory elements, thereby modulating the mRNA stability, translation, subcellular localization of mRNAs, or even the subcellular localization and function of the encoded proteins. It has been shown that dysregulation of APA could result in various human diseases. Thus, knowing the PAS and their usage probabilities helps in determining the likelihood of a potential disease.
APA is regulated by the interaction between cis-elements located in the vicinity of PAS and the associated trans-factors. The most well-known cis-element that defines a PAS is the hexamer AAUAAA and its variants are located 15-30nt upstream of the cleavage site, which is directly recognized by the cleavage and polyadenylation specificity factor (CPSF) components: CPSF30 and WDR33. Other auxiliary cis-elements located upstream or downstream of the cleavage site include upstream UGUA motifs bound by the cleavage factor Im (CFIm) and downstream U-rich or GU-rich elements targeted by the cleavage stimulation factor (CstF). The usage of individual PAS for a multi-PAS gene depends on how efficiently each alternative PAS is recognized by these 3′ end processing machineries, which is further regulated by additional RNA binding proteins (RBPs) that could enhance or repress the usage of distinct PAS signals through binding in their proximity.
In addition, the usage of alternative PAS is mutually exclusive. In particular, once an upstream PAS is utilized, all the downstream ones would have no chance to be used no matter how strong their PAS signals are. Therefore, proximal PAS, which are transcribed first, have positional advantage over the distal competing PAS. Indeed, it has been observed that the terminal PAS more often contain the canonical AAUAAA hexamer, which is considered to have a higher affinity than the other variants, which possibly compensates for their positional disadvantage.
There has been a long-standing interest in predicting PAS based on genomic sequences using purely computational approaches. The so-called “PAS recognition problem” aims to discriminate between nucleotide sequences that contain a PAS and those that do not. A variety of hand-crafted features have been proposed and statistical learning algorithms, e.g., random forest (RF), support vector machines (SVM) and hidden Markov models (HMM), are then applied on these features to solve the binary classification problem [1-3]. Very recently, researchers started investigating the “PAS quantification problem”, which aims to predict a score that represents the strength of a PAS [4, 5]. However, the quantification problem is much more difficult than the recognition one.
Recent developments in deep learning have made great improvements on many tasks, for example, to bioinformatics tasks such as protein-DNA binding, RNA splicing pattern prediction, enzyme function prediction, Nanopore sequencing, and promoter prediction. Deep learning is favored due to its automatic feature extraction ability and good scalability with large amount of data. As for the polyadenylation prediction, deep learning models have been applied on the PAS recognition problem and they outperformed existing feature-based methods by a large margin [6].
Recently, deep learning models have also been applied on the PAS quantification problem, where Polyadenylation Code [4] was developed to predict the stronger one from a given pair of two competing PAS. Very recently, another model, DeepPASTA [5] has been proposed. DeepPASTA contains four different modules that deal with both the PAS recognition problem and the PAS quantification problem. Similar to the Polyadenylation Code, the DeepPASTA also casts the PAS quantification problem into a pairwise comparison task.
Thus, there is a need for a new system and method that are capable of quantitatively predicting the usage of all the competing PAS from a same gene simultaneously, regardless of the number of possible PAS.
According to an embodiment, there is a method for calculating usage of all alternative polyadenylation sites (PAS) in a genomic sequence. The method includes receiving plural genomic sub-sequences centered on corresponding PAS; processing each genomic sub-sequence of the plural genomic sequences, with a corresponding neural network of plural neural networks; supplying plural outputs of the plural neural networks to an interaction layer that includes plural forward Bidirectional Long Short Term Memory Network (Bi-LSTM) cells and plural backward Bi-LSTM cells, wherein each pair of a forward Bi-LSTM cell and a backward Bi-LSTM cell uniquely receives a corresponding output, of the plural outputs, from a corresponding neural network; and generating a scalar value for each PAS, based on an output from a corresponding pair of the forward Bi-LSTM cell and the backward Bi-LSTM cell.
According to another embodiment, there is a computing device for calculating usage of all alternative polyadenylation sites (PAS) in a genomic sequence. The computing device includes an interface configured to receive plural genomic sub-sequences centered on corresponding PAS; and a processor connected to the interface and configured to process each genomic sub-sequence of the plural genomic sequences, with a corresponding neural network of plural neural networks; supply plural outputs of the plural neural networks to an interaction layer that includes plural forward Bidirectional Long Short Term Memory Network (Bi-LSTM) cells and plural backward Bi-LSTM cells, wherein each pair of a forward Bi-LSTM cell and a backward Bi-LSTM cell uniquely receives a corresponding output of the plural outputs; and generate a scalar value for each PAS, based on an output from a corresponding pair of the forward Bi-LSTM cell and the backward Bi-LSTM cell.
According to still another embodiment, there is a neural network system for calculating usage of all alternative polyadenylation sites (PAS) in a genomic sequence. The system includes plural neural networks configured to receive plural genomic sub-sequences centered on corresponding PAS, wherein the plural neural networks are configured to process the genomic sub-sequences such that each neural network processes only a corresponding genomic sub-sequence of the genomic sequence; an interaction layer configured to receive plural outputs of the plural neural networks, wherein the interaction layer includes plural forward Bidirectional Long Short Term Memory Network (Bi-LSTM) cells and plural backward Bi-LSTM cells, and wherein each pair of a forward Bi-LSTM cell and a backward Bi-LSTM cell uniquely receives a corresponding output of the plural outputs; and an output layer configured to generate a scalar value for each PAS, based on an output from a corresponding pair of the forward Bi-LSTM cell and the backward Bi-LSTM cell.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims.
Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
According to an embodiment, a novel deep learning model, called herein DeeReCT-APA (Deep Regulatory Code and Tools for Alternative Polyadenylation), for the PAS quantification problem is introduced. The DeeReCT-APA model can simultaneously, quantitatively predict the usage of all the competing PAS from a same gene, regardless of the number of PAS. The model is trained and evaluated based on a dataset from a previous study, which consists of a genome-wide PAS measurement of two different mouse strains (C57BL/6J (BL) and SPRET/EiJ (SP)), and their F1 hybrid. After training the novel model on the dataset, the novel model is evaluated based on a number of criteria. The novel model demonstrates the necessity of simultaneously modeling the competition among multiple PAS. The novel model is found to predict the effect of genetic variations on APA patterns, visualize APA regulatory motifs and potentially facilitate the mechanistic understanding of APA regulation.
The novel DeeReCT-APA model (also called method herein) is based on a deep learning architecture 200 (or neural network system), as shown in
There are two types of base networks that can be used in the architecture 200, and these base networks include: (1) a hand-engineered feature extractor 310, as schematically illustrated in
The hand-engineered feature extractor 310 shown in
The CNN 320 in
The three different base network configurations are deep neural network architectures. In one embodiment, the base network in
The Feature-Net 310 only consists of multiple fully-connected layers 316 and takes as input multiple types of features extracted from the sub-sequences 204I of interest. The features, described in more detail in [4], may include polyadenylation signals, auxiliary upstream elements, core upstream elements, core downstream elements, auxiliary downstream elements, RNA-binding protein motifs, as well as 1-mer, 2-mer, 3-mer and 4-mer features, which are illustrated in
The output of the lower-level base network 210I is then passed to the upper-level interaction layers 220, which computationally model the process of choosing the competing PAS. The interaction layers 220 of the DeeReCT-APA neural network system 200 are based in this embodiment on the Long Short Term Memory Networks (LSTM) [7], which are designed to handle natural language processing and can naturally handle sentences with an arbitrary length, which makes them suitable for handling any number of alternative PAS from a same gene locus. The interaction layers then output the percentage values of all the competing PAS of the gene, as illustrated in
The utilization of alternative PAS is intrinsically competitive. On one hand, as a multi-PAS gene is transcribed, any one of its PAS along the already transcribed region is possible to be used. However, if one of the PAS has already been used, it will make other PAS impossible to be chosen. On the other hand, given that the same polyadenylation machinery is used by all the alternative PAS, such competition of resources also contributes to the competitiveness of this process.
Previous work [4, 5] in polyadenylation usage prediction did not take this important point into account. Both models introduced in [4, 5], Polyadenylation Code and DeepPASTA (tissue-specific relatively dominant poly(A) sites prediction model), can only handle two PAS at a time, ignoring the competition with the other PAS. To overcome this limitation of the traditional methods, the DeeReCT-APA neural network system considers all the competing PAS at the same time and simultaneously take as input all the PAS in a gene, thus jointly predicting the usage levels of all of PAS. This fundamental difference between the existing models and the DeeReCT-APA model makes this model advantageous.
To fulfil this simultaneous condition, the interaction layers 220 above the base networks 210I are configured to model the interaction between different PAS. In neural networks, a way to model interactions among inputs is to introduce a recurrent neural network (RNN) layer, which can capture the interdependencies among inputs corresponding to each time step. In this embodiment, the LSTM layer [7] was selected to be the foundation of the interaction layers 220. The LSTM layer is a type of RNN that has hidden memory cells which are able to remember a state for an arbitrary length of time steps. To fit into the PAS usage level prediction task, each time step of the LSTM layer corresponds to one PAS, at which the LSTM takes the extracted features of that PAS from the lower-level base network 210I. As there is both an influence from upstream PAS to downstream PAS and vice versa, in this embodiment, the DeeReCT-APA model uses a bidirectional LSTM (Bi-LSTM), in which one LSTM's time step goes from the upstream PAS to the downstream one and the other from the downstream to the upstream, as shown by the arrows 611 in
More specifically,
The outputs of the forward LSTM cell 612I and the backward LSTM cell 614I at the same PAS are then concatenated and sent as output 616I to the upper fully-connected layer 620. The fully-connected layer 620 transforms the LSTM output to a scalar value representing the log-probability 630I of that PAS to be used. After the log-probabilities of all competing PAS pass through a final SoftMax layer 640, they are transformed to properly normalized percentage scores, which sum up to one, representing the probability of each PAS of being chosen. Although the DeepPASTA method also contains a Bi-LSTM component, their Bi-LSTM layer is configured to process the sequence of one of the two competing PAS that are given as input. The time steps of the Bi-LSTM in the DeepPASTA method correspond to different positions in one particular sequence rather than to different PAS as in the case for the configuration of
In other words, the DeeReCT-APA model shown in
Two experiments were performed with the DeeReCT-APA model to evaluate the effect of the Bi-LSTM interaction layer 610. The first experiment removes the Bi-LSTM layer from the interconnected layers 220 and only keeps the fully-connected layer 620 and the SoftMax layer 640. In this scenario, the modified network still simultaneously considers all PAS of a gene, but with a non-RNN interaction layer 220. The second experiment removes the interaction layer 220 altogether and uses a comparison-based training (like in Polyadenylation Code) to train the base networks 210I. As shown in the table of
A genome-wide PAS quantification dataset derived from fibroblast cells of C57BL/6J (BL) and SPRET/EiJ (SP) mouse, as well as their F1 hybrid is obtained from a previous study [8]. In the F1 cells, the two alleles have the same trans environment and the PAS usage difference between two alleles can only be due to the sequence variants between their genome sequences, making it a valuable system for APA cis-regulation study. Apart from APA, this kind of systems have also been used in the study of alternative splicing and translational regulation.
The detailed description of the sequencing protocol and data analysis procedure can be found in [8]. As a brief summary, the study uses fibroblast cell lines from BL, SP and their F1 hybrids. The total RNA is extracted from fibroblast cells of BL and SP undergoes 3′-Region Extraction and Deep Sequencing (3′READS) to build a good PAS reference of the two strains. The 3′-mRNA sequencing is then performed in all three cell lines to quantify those PAS in the reference. In the F1 hybrid cell, reads are assigned to BL and SP alleles according to their strain specific SNPs. The PAS usage values are then computed by counting the sequencing reads assigned to each PAS. The sequence centering around each PAS cleavage site (448nt in total) is extracted and undergoes feature extraction or one-hot encoding before training the model. The extracted features are then inputted to the Feature-Net 310, while the one-hot encoded sequences are inputted to the Single-Cony-Net 320 and the Multi-Cony-Net 320.
The DeeReCT-APA model is trained based on the parental BL/SP PAS usage level dataset. For the F1 hybrid data, however, it was chosen to start from the pre-trained parental model (for which either the BL parental model or the SP parental model were used and the results are shown separately) and fine-tune the model on the F1 dataset. This is because, due to the read assignment problem, the usage of many PAS in F1 cannot be unambiguously characterized by 3′-mRNA sequencing [8]. As a result of this process, the F1 dataset does not contain an enough number of PAS to train the DeeReCT-APA model from scratch. At the training stage, genes are randomly selected from the training set and the sequences of their PAS flanking regions are fed into the network. Each sequence of PAS in a gene passes through one Base-Net 210I. The parameters of the Base-Net 210I that are responsible for each PAS are all shared. Each Base-Net 210I then outputs a vector representing the distilled features for each PAS, which is then sent to the interaction layers 220. The interaction layers 220 generate a percentage score of each PAS of this gene. Cross-entropy loss between the predicted usage and the actual usage is used as the training target. During back-propagation, the gradients are back-propagated through the passage originated from each PAS. As the model parameters are shared between the base networks 210I, the gradients are then summed up to update the model parameters.
Several techniques are used to reduce the overfitting: (1) Weight decay is applied on weight parameters of CNN and all fully-connected layers; (2) Dropout is applied on the Bi-LSTM layer; (3) The training is stopped as soon as the mean absolute error of the predicted usage value does not improve on the validation set; (4) While fine-tuning the model on the F1 dataset, a learning rate that is about 100 times smaller than the one used when training from scratch is used.
The neural network system 200 is trained with the adaptive moment estimation (Adam) optimizer. A detailed list of hyperparameters used for the training is shown in
To evaluate the performance of the DeeReCT-APA model, a 5-fold cross-validation is performed at the gene level using all the genes in the dataset for each strain. That is, if a gene is selected as a training (testing) sample, all of its PAS are in the train (test) set. At each time, four folds are used for training and the remaining one is used for testing. To make a fair comparison with the existing methods Polyadenylation Code and DeepPASTA previously discussed, the two models are also trained (fine-tuned) and their model parameters are optimized on the parental and F1 datasets introduced above.
The following measures are used for evaluating the DeeReCT-APA model against its baseline and also against state-of-the-art models: mean absolute error (MAE), comparison accuracy, highest usage prediction accuracy, and average Spearman's correlation. The MAE metric is defined as the mean absolute error of the usage prediction of each PAS, which is given by:
where pi stands for the predicted usage, ti stands for the experimentally determined ground truth usage for PAS i and M is the total number of PAS across all genes in the test set. This is the most intuitive way of measuring the performance of the DeeReCT-APA model. However, this measure is not applicable to the Polyadenylation Code or DeepPASTA methods as they do not have quantitative outputs that can be interpreted as the PAS usage values. For the same reason, this measure is not applicable to the DeeReCT-APA model either, when its interaction layers are removed and the comparison-based training is used.
The Comparison Accuracy is based on the Pairwise Comparison Task, which is defined as listing all the pairs of PAS in a given gene and keeping only those pairs with PAS usage level difference greater than 5%. The model is asked to predict which PAS in the pair is of the higher usage level. The accuracy is defined as
The primary reason that this metric is used to compare the DeeReCT-APA model with the Polyadenylation Code and DeepPASTA models, is that these traditional models were designed for predicting which one is stronger between the two competing PAS.
The Highest Usage Prediction Accuracy measure aims to test the model's ability of predicting which PAS is of the highest usage level in a single gene. For this measure, all the genes are selected that have their highest PAS usage level greater than their second highest one by at least 15% in the test set for evaluation. For the DeeReCT-APA model, the predicted usage in percentage is used for ranking the PAS. For the Polyadenylation Code and DeepPASTA models, as they do not provide a predicted value in percentage, the logit value before the SoftMax layer is used. The logit values, though not in the scale of real usage percentage values, can at least give a ranking of different PAS sites. The highest usage prediction accuracy is the percentage of genes whose highest-usage PAS are correctly predicted.
The Averaged Spearman's Correlation is defined as follows. The predicted usage levels by each model is converted into a ranking of PAS sites in that gene. The Spearman's correlation is computed between the predicted ranking and ground truth ranking. The correlation values for all genes are then averaged together to give an aggregated score. In other words,
where N is the total number of genes, Pi is the number of PAS in gene i, prip is the predicted rank of PAS p in gene i, grip is the ground truth rank of PAS p in gene i, and pri and gri are the averaged predicted and ground truth ranks in gene i, respectively.
Based on these measures, the performance of various Base-Net designs are first evaluated. The DeeReCT-APA model is evaluated first with the Feature-Net module 310, then with the Single-Cony-Net module 320, and finally with the Multi-Cony-Net 330. As shown in the table of
Then, the performance of the DeeReCT-APA model with the Multi-Cony-Net module was compared to the Polyadenylation Code and DeepPASTA. As shown in
The improvement made by the DeeReCT-APA model is statistically significant in terms of comparison accuracy, even though the performance improvement is not numerically substantial. For this purpose, the experiment is repeated five times, with each repeat having the dataset randomly split in a different way, and the accuracy of the DeeReCT-APA model (using the Multi-Cony-Net), Polyadenylation Code, and DeepPASTA is reported after 5-fold cross validation. The performance of the three tools is then compared with the p-value computed by t-test. As shown in
To demonstrate that the results of the above comparison are independent of the datasets, the DeeReCT-APA model was trained and tested on another dataset used in [4]. Because this dataset consists of polyadenylation quantification data from multiple human tissues, the performance (comparison accuracy) of the DeeReCT-APA model is reported for each tissue separately, as illustrated in
The benefits of jointly modelling all the PAS, as implemented in the DeeReCT-APA model of
To test this explanation, an in-silico experiment was designed by constructing a hypothetical allele of gene Srr (hereafter referred to as “mixed allele”) that has the BL sequence of PAS 1, PAS 2 and PAS 3, and SP sequence of PAS 4. Then, the Dee ReCT-APA model was asked to predict the usage level of each PAS in the “mixed allele,” where the usage differences between the BL allele and the “mixed allele” should then be purely due to the sequence variants in PAS 4 because the two alleles are exactly the same on the other PAS. As shown in
A goal of the DeeReCT-APA model is to determine the effect of sequence variants on the APA patterns. The F1 hybrid dataset chosen here is ideal to test how well such a goal is achieved, since in the F1 cells, the allelic difference in PAS usage can only be due to the sequence variants between their genome sequences. In this regard,
To check whether the novel DeeReCT-APA model could be used to identify the effects of these variants, a “mutation map” was plotted for the two genes. In brief, for each gene, given the sequence around the most distal PAS (suppose it is of length L), 3L “mutated sequences” were generated. Each one of the 3L sequences has exactly one nucleotide mutated from the original sequence. These 3L sequences are then fed into the model along with other PAS sequences from that gene and the model then predicts usage for all sites and for each of the 3L sequences, separately. The predicted usage values of the original sequence are then subtracted from each of the 3L predictions and plotted in a heatmap, the “mutation map.” The obtained heatmap entries that correspond to the sequence variants between BL and SP are consistent with experimental findings from [8]. In addition, the mutation maps can also show the predicted effect of sequence variants other than those between BL and SP, giving an overview of the effects from all potential mutations.
The two examples described above involve sequence variants disrupting PAS signals, which makes the prediction relatively simple. To check whether the Dee ReCT-APA model could be used for the variants with more subtle effects, a third example, gene Alg10b, was selected for some tests. Previous experiments showed that the usage of the most distal PAS of its BL allele is higher than its SP allele, as illustrated in
To globally evaluate the performance of the DeeReCT-APA model on predicting the allelic difference in PAS usage, the predicted allelic difference were compared to the experimentally measured allelic difference in a genome-wide manner. As a baseline control, the same was performed for the prediction made by the Polyadenylation Code where logit values before the SoftMax layer were used as surrogates for the predicted allelic difference in PAS usage. Here, the F1 dataset fine-tuned from the BL parental model is used. It is worth noting that this is a very challenging task because the training data do not well represent the complete landscape of genetic mutations. That is, the BL dataset only contains invariant sequences from different PAS, and the F1 dataset contains a limited number of genetic variants.
The Pearson correlation between the experimentally measured allelic usage difference and the ones predicted by the two models was computed as illustrated in the table in
To illustrate the knowledge learned by the convolutional filters of the DeeReCT-APA model, the convolutional filters of the model are visualized. The aim of visualization is to reveal the important subsequences around polyadenylation sites that activate a specific convolutional filter. In contrast to existing methods, in which the researchers only used sequences in the test set for visualization, the inventors used all sequences in the train and test dataset of F1 for visualization due to the smaller size of the dataset. For visualization, neither the model parameters nor the hyperparameters are tuned on the test set. The learned filters in layer 1 were convolved with all the sequences in the above dataset, and for each sequence, its subsequence (having the same size as the filters) with the highest activation on that filter is extracted and accumulated in a position frequency matrix (PFM). The PFM is then ready for visualization as the knowledge learned by that specific filter. For layer 2 convolutional filters, as they do not convolve with raw sequences during training and testing, directly convolving it with the sequences in the dataset as it was performed for layer 1 would be undesirable. Instead, the layer 2 activations are calculated by a partial forward pass in the network and the subsequences of the input sequences in the receptive field of the maximally-activated neuron is extracted and accumulated in a PFM.
As shown in
The above embodiments disclose a novel way to simultaneously predict the usage of all competing PAS within a gene. The novel DeeReCT-APA method incorporates both sequence-specific information through automatic feature extraction by CNN and multiple PAS competition through interaction modeling by the RNN layers. The novel model was trained and evaluated on the genome-wide PAS usage measurement obtained from 3′-mRNA sequencing of fibroblast cells from two mouse strains as well as their F1 hybrid. The DeeReCT-APA model was shown to outperform the state-of-the-art PAS quantification methods on the tasks that they are trained for, including pairwise comparison, highest usage prediction, and ranking task. In addition, it was shown that simultaneously modeling all the PAS of a gene captures the mechanistic competition among the PAS and reveals the genetic variants with regulatory effects on PAS usage.
A method for using a Bi-LSTM layer to model competitive biological processes was proposed recently in [9]. The researchers in [9] used the Bi-LSTM layer to model the usage level of competitive alternative 5′/3′ splice sites. Although the DeeReCT-APA model provides the first-of-its-kind way to model all the PAS of a gene, it still can be improved as all the existing genome-wide PAS quantification datasets used as training data could only sample the limited number of naturally occurring sequence variants. Although in the experiments noted above the two parental strains from which the F1 hybrid mouse was derived are already the evolutionarily most distant ones among all the 17 mouse strains with complete genomic sequences, the number of genetic variants is still rather limited. Thus, it will be desirable to provide a complementary dataset, i.e., to establish a large-scale synthetic APA mini-gene reporter-based system which samples the regulatory effect of millions of random sequences.
A method for calculating usage of all alternative PAS in a genomic sequence is now discussed with regard to
The method may further include simultaneously considering the plural outputs 206I from the plural neural networks 210I to jointly calculate the plural outputs 206I, where the plural forward Bi-LSTM cells 612I are connected to each other in a given sequence and the plural backward Bi-LSTM cells 614I are connected to each other in a reverse sequence. In one application, each neural network 210I of the plural neural networks 210I includes a convolutional neural network. The convolutional neural network may include a convolution layer, a ReLU layer, and a max-pooling layer. In another application, each of the neural network 210I of the plural neural networks 210I further includes a fully connected layer. In one embodiment, the outputs 206I of the plural neural networks 210I include a sequence motif associated with a corresponding genomic sub-sequence 204I of the plural genomic sub-sequences 204I.
The method may further include a step of generating within the interaction layer 220 a scalar value representing a log-probability 630I for each corresponding PAS, and/or applying a soft-max layer to the scalar values of the PAS to generate a usage percentage value of each PAS, where a sum of all the percentage values is 100%. In one application, the plural forward Bi-LSTM cells 612I and the plural backward Bi-LSTM cells 614I form a recurrent neural network layer, which is configured to capture interdependencies among inputs corresponding to each time step. The recurrent neural network layer has hidden memory cells configured to remember a state for an arbitrary length of time steps, and each time step corresponds to a single PAS.
The above-discussed procedures and methods may be implemented in a computing device as illustrated in
Computing device 1900 suitable for performing the activities described in the embodiments may include a server 1901. Such a server 1901 may include a central processor (CPU) 1902 coupled to a random access memory (RAM) 1904 and to a read-only memory (ROM) 1906. ROM 1906 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1902 may communicate with other internal and external components through input/output (I/O) circuitry 1908 and bussing 1910 to provide control signals and the like. Processor 1902 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
Server 1901 may also include one or more data storage devices, including hard drives 1912, CD-ROM drives 1914 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD-ROM or DVD 1916, a USB storage device 1918 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1914, disk drive 1912, etc. Server 1901 may be coupled to a display 1920, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1922 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
Server 1901 may be coupled to other devices, such as sources, sensors, microscopes, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1928, which allows ultimate connection to various landline and/or mobile computing devices.
The disclosed embodiments provide a system, neural network system, and method for Deep Learning based PAS usage prediction in a gene. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.
Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein.
This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.
This application claims priority to U.S. Provisional Patent Application No. 62/851,898, filed on May 23, 2019, entitled “DEERECT-APA: ALTERNATIVE POLYADENYLATION SITE USAGE PREDICTION THROUGH DEEP LEARNING,” the disclosure of which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2020/053867 | 4/23/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62851898 | May 2019 | US |