Embodiments of the subject matter disclosed herein generally relate to a system and method for prioritization of candidate genes to the genome-based diagnostics of a range of genetic diseases and more particularly, using a novel graph convolutional network-based disease-gene prioritization method, PGCN, through the systematic embedding of a heterogeneous network made by genes and diseases, as well as their individual features.
The last decade has seen a rapid increase in the adoption of whole-exome sequencing in the clinical diagnosis of genetic diseases. However, the success rate of such genome-based diagnostics still remains far from perfect, with reported yields for a range of Mendelian diseases ranging from ˜20 to ˜50%. This relatively low-yield is largely attributed to a considerable difficulty in differentiating disease-causing variants from a large pool of rare genetic variants that are not pathogenic and do not play roles in the expression of the disease phenotype.
To efficiently detect pathogenic variants and to improve the diagnostic rate of the genome-based approach, it is necessary to have disease-gene prioritization that substantially reduces the number of candidate causal variants and ranks them for further interrogations based on the association of the corresponding genes with the disease phenotype. In other words, the disease-gene prioritization is the process of assigning a likelihood of gene involvement in generating a disease phenotype.
A number of computational methods have been developed to tackle the disease-gene prioritization problem and have been shown to be useful. For example, Endeavour was able to associate GATA4 with congenital diaphragmatic hernia; GeneDistiller discovered the role of MED17 mutations in infantile cerebral and cerebellar atrophy. Based on the underlying computational techniques, existing disease-gene prioritization methods can be categorized into five types.
The first type is the filter methods, which sift the candidate list of genes into a smaller one according to the properties that associated genes should have. The second type of methods is based on text mining. Such methods score the candidate genes using the co-occurrence evidence with a certain disease from the literature. Thus, these methods can only detect associations that are already known. The third type is similarity profiling and data fusion methods. This is the dominant type in the disease gene prioritization community and includes the famous Endeavour method. These methods are based on the idea that similar genes should be associated with similar sets of diseases and vice versa. The similarity measurement can be defined using different data sources, such as Gene Ontology (GO) or the BLAST score. After obtaining the similarity scores from each data source, such methods apply data fusion to aggregate these scores into a global ranking. The fourth type is network-based methods, which are discussed in [1] to [8]. Such methods represent diseases and genes as nodes in a heterogeneous network, in which the edge weight represents their similarities. The last type is based on matrix completion techniques in recommender systems. These methods represent the disease-gene association as an incomplete matrix and solve the disease-gene prioritization problem by filling the missing values of the matrix. This category of methods has been shown to be the state-of-the-art at present.
Despite the advances of the existing methods, they have the following problems. Firstly, the similarity-based methods, which are rooted in the “guilt-by-association” principle, often fail to handle new diseases whose associated genes are completely unknown. Secondly, although the performance of the network-based methods is reasonable, they are biased by the network topology and cannot easily integrate multiple sources of information about genes and diseases. Thirdly, the matrix completion methods assume and look for a weighted linear relationship between genes and diseases, which, in reality, is most likely to be highly nonlinear. In addition, most of the existing methods rely heavily on manually-crafted features or pre-defined rules of data fusion.
Therefore, the disease-gene prioritization problem remains elusive. On the other hand, the recent success of graphical models and deep learning in bioinformatics [10] to [14] suggests the possibility to systematically incorporate multiple sources of information in the heterogeneous network and learn the highly nonlinear relationship between diseases and genes.
Thus, there is a need for a new method and system that prioritizes the disease-gene link and avoids the problems mentioned above.
According to an embodiment, there is a method for disease-gene prioritization, and the method includes building a heterogenous network to include gene nodes gj and disease nodes di; supplying additional information (xdi, xgj) related to the gene nodes gj and the disease nodes di to generate embeddings zk associated with the gene nodes gj and the disease nodes di; applying a graph convolutional neural network model G to the heterogenous network and to the embeddings zk to calculate aggregated embeddings zk+1; and estimating, with an edge decoder model ED, a probability P of an edge (di, gj), between a selected gene node gj and a selected disease node di. The edge (di, gj) between the selected gene node gj and the selected disease node di is the disease-gene prioritization.
According to another embodiment, there is a computing device for producing a disease-gene prioritization, and the device includes an input/output interface for receiving additional information (xdi, xgj) related to gene nodes gj and disease nodes di to generate embeddings zk associated with the gene nodes gj and the disease nodes di; and a processor connected to the input/output interface and configured to, build a heterogenous network made by the gene nodes gj and the disease nodes di; apply a graph convolutional neural network model G to the heterogenous network and the embeddings zk to calculate aggregated embeddings zk+1; and estimate, with an edge decoder model ED, a probability P of an edge (di, gj), between a selected gene node gj and a selected disease node di. The edge (di, gj) between the selected gene node gj and the selected disease node di is the disease-gene prioritization.
According to still another embodiment, there is a method for training a graph convolutional neural network model G for disease-gene prioritization. The method includes building a heterogenous network from gene nodes gj and disease nodes di; supplying additional information (xdi, xgj) related to the gene nodes gj and the disease nodes di to generate embeddings zk associated with the gene nodes gj and the disease nodes di; applying the graph convolutional neural network model G to the heterogenous network and the embeddings zk to calculate aggregated embeddings zk+1; estimating, with an edge decoder model ED, a probability P of an edge (di, gj), between a selected gene node gj and a selected disease node di; and repeating the above steps until the probability P is one for a known connection between the selected gene node gj and the selected disease node di. The edge (di, gj) between the selected gene node gj and the selected disease node di is the disease-gene prioritization.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The following description of the embodiments refers to the accompanying drawings. The same reference numbers in different drawings identify the same or similar elements. The following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims. The following embodiments are discussed, for simplicity, with regard to a system and method that casts the disease-gene prioritization problem as a link prediction problem.
Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with an embodiment is included in at least one embodiment of the subject matter disclosed. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification is not necessarily referring to the same embodiment. Further, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
According to an embodiment, a novel disease-gene prioritization method, called herein “PGCN,” is developed based on graph convolutional neural networks (GCN) introduced by [10] and [15]-[17].Starting from a heterogeneous network, which is composed of a genetic interaction network, a human disease similarity network, and a known disease-gene association network, to which additional information about genes and diseases from multiple sources is added, the novel method first learns embeddings for genes and diseases through graph convolutional neural networks, by considering both the network topology and the additional information of diseases and genes. Such embeddings are fed into an edge decoding (edge prediction) model to make predictions for disease-gene associations. Although this method is described in two steps, the model used by the method is trained in an end-to-end manner so that the model can jointly learn the embedding and the decoding.
In one embodiment, the disease-gene prioritization problem is treated as a link prediction problem. Unlike previous studies which solve the problem with matrix factorization, the novel method uses graph convolutional neural networks. The method compiles the disease similarities, genetic interactions, and disease-gene associations into a multi-nodal heterogeneous network 100, as shown in
In this heterogenous network 100, the potential disease-gene associations or links 140 can be considered as missing links and the goal of this method is to predict (calculate a probability) these links. Thus, according to one embodiment, the method to be discussed next learns the nodes' latent representations (embeddings 118 and 128) from their initial raw representations (information encoded from different sources), considering the graph's topological structure and the nodes' neighborhood, after which the method makes predictions using the learned embeddings using the edge decoding model. Both the embedding model and the decoding model (which are discussed later) are trained in an end-to-end manner so that each model is optimized while being regularized by the other one. The components of the proposed method are discussed now in more detail.
Recent studies have formulated the disease-gene prioritization problem as a matrix completion problem and applied the recently developed methods in recommender systems, resulting in better performance than the previous state-of-the-art methods. Although the method proposed herein also considers the problem as a recommender system problem, the novel method treats the entire data structure as a heterogeneous network 100 as shown in
One component of the novel method is the graph convolutional encoder, which can learn the embeddings 118 and 128 from the nodes' neighborhood, node-specific information, and the topology of the heterogeneous network 100. A problem for learning the embeddings 118 and 218 from the graph data is to propagate and transform the associated information along the network 100. As shown in
Because the method instantiates the graph convolution operation using a fully-connected neural network, the model G can seamlessly integrate information from different sources. The embeddings are fed into the link decoding model as discussed later. Thus, the proposed method can achieve problem-specific data integration systematically, whose parameters are learned from the data in an end-to-end manner.
As previously discussed, the network 100 in the model of
For the gene network 110, the method used the HumanNet database. This large-scale functional gene network was constructed by considering multiple sources of information, including human mRNA co-expression, protein-protein interactions, protein complex, and comparative genomics information. In total, it incorporated 21 genomics and proteomics datasets from four species. Compared to the network built from the single dataset, such as protein-protein interaction networks, it has higher accuracy and genome coverage. The usefulness of the HumanNet in the disease gene prioritization has been proved by previous studies. In summary, the gene network 110 is composed of 12,331 genes and 733,836 edges with positive weights. Those skilled in the art will understand that more or less information can be used for any of the three networks 110, 120, and 130.
The disease similarity network 120 used the MimMiner network. This network was built by using text mining analysis on the OMIM database 210. For each disease, the anatomy and disease sections of the medical subject headings were used to extract terms from the OMIM database 210, whose frequencies were used as the feature vectors of the disease. After further refinement, the feature vectors were used to compute the pairwise similarities between the disease, which resulted in the MimMiner network. Although in the construction process it did not involve gene information, the similarities were shown to be positively correlated with a number of measures of gene function. This network has also been used as a feature input in the previous disease-gene prioritization methods [8].After setting the similarity threshold as 0.2, a disease similarity network with 3,215 diseases and 645,945 edges was obtained.
In contrast to the existing network-based methods, the model 100 can naturally incorporate additional information about the nodes from different sources, i.e., the novel method is generic and can take any source of information for diseases and genes. In one implementation, the model 100 incorporated, as illustrated in
The second data source is the clinical text from the OMIM webpages. The Clinical Feature and Clinical Management sections were collected from the OMIM webpages for each disease, and the most frequent and most rare words were removed. Then, the frequency of each unique word in the corpus related to each disease was counted. To remove the bias of the relatively frequent words, the method applied the TF-IDF scheme 212 to the term frequency matrix and obtained the corresponding row as the feature vector xdi for a disease. Finally, the two vectors were concatenated as the additional information for the disease.
The method also used two kinds of features as the additional information for the gene nodes of the gene network 110. The method collected the microarray measurement of the gene expression level in different tissue samples from BioGPS and Connectivity Map. Since some genes are missing in the probes, the method obtained 4,536 features for 8,755 genes. It is well-known that samples from the same cell type of different individuals tend to have a similar expression pattern, which results in redundant information in the obtained feature matrix. To eliminate the redundancy and reduce the dimensionality, the method applied the principle component analysis (PCA) on the features and used the first 100 eigenvectors as the feature representations from gene expression microarray.
The second type of additional information for genes is derived from the gene-phenotype associations 230 of other species. Following the previous studies [8], the method used the phenotypes from eight species. As a result, the method obtained eight matrices, whose rows represent different genes and the columns represent the phenotypes of different species. The method concatenated those gene-phenotype matrices together with the microarray matrix 232 along the gene dimension, resulting in the additional information xgi of the genes. The additional information xdi and xgi was added to each corresponding node in the disease network and the gene network, respectively, as schematically illustrated in
Based on this additional information xdi and xgi, the embeddings 118 and 128 are now constructed using graph convolutional neutral networks, by taking into account the network topology, the nodes' neighborhood, and the additional information associated with each node. Formally, the embeddings are constructed by considering a graph =(V, ε), where V represents the set of nodes and ε represents the set of edges, with the adjacent matrix being A. The additional information of a node i ϵ V is denoted as xi ϵm
A problem of learning the embeddings (or embedding vector z) with the graph convolutional neural network is to figure out how to transform and propagate information (the additional information and intermediate embeddings of each node) across the entire network. In this embodiment, the GCN module defines the information propagation architecture (the local computational graph) for each node using the node's neighborhood in the graph corresponding to the network 100. Note that
In this regard,
If there is only one layer of the graph convolution model G, as illustrated in
Formally, in each layer k of the model G, for each node i, the information aggregation and transformation model hi,k illustrated in
where zi,k ϵc
In this method, the summation is used as the information aggregation method in the GCN model. With different information aggregation methods, it can result in different GCN variants. However, no matter which method is chosen, the aggregation and transformation layer convert the hidden representation of node i in layer k, zi,k, into the hidden representation in the next layer as Zi,k+1. The output of the last graph convolutional layer, zi,N, is used as the final embedding 118 or 128 for that node, zi. With these selections, the input of the first convolutional layer is the original feature vector of each node, i.e., zi,0=xi.
Having described how to construct the embedding 118 or 128 of each node in
P(di,dj)=σ(zd
where zd
The edge decoder ED is illustrated in
Taking together the GCN model G illustrated in
The hyper-parameters when building and training the model are now discussed. The cross-entropy loss L was used as the loss function to train the entire model G and ED, as schematically illustrated in
L(di, gj)=−log P (di, gj)−log(1−P(di, gn)), (4)
where (di, gj) defines an edge in the training data and is an ensemble of loss related to a negative training set (that includes random linkages between two nodes). The second term is incorporated into equation (4) to force the model to recover the non-edges in the original graph. This means that the ground truth value Y(di, gj)=1 in
where εdg represents all the edges connecting the diseases and genes nodes shown in the network 100 in
In one embodiment, the above model has been implemented to have the number of layers 2, with the dimension of the hidden representation as 64 and the final embedding dimension as 32. The model was trained using an Adam optimizer, with the learning rate as 0.001. To reduce overfitting, this embodiment used the combination of dropout on the hidden layer unites with the dropout rate as 0.1, and the legendary weight decay method. The model's parameters were initialized using the Xavier initializer. During training, mini-batches of edges were fed to the model, with the batch size as 512. This can reduce the memory requirement and serve as an additional regularizer that further alleviates overfitting. In total, the model was trained for 300 epochs. With the help of a Titan Xp card, the training of the model was performed in 10 hours.
A method for disease-gene prioritization is now discussed with regard to
In one application, the step of applying a graph convolutional neural network model G includes aggregating, for the selected gene node, (1) embeddings zgk of all gene nodes linked to the selected gene node, (2) an embedding zdk of the selected gene node, and (3) embeddings zdk of all disease nodes linked to the selected gene node to obtain a gene feature vector hdk; and activating the gene feature vector hdk with an activation function ϕ to obtain the aggregated embedding zg(k+1) for the selected gene node. The step of applying a graph convolutional neural network model G may further include aggregating, for the selected disease node, (1) embeddings zdk of all disease nodes linked to the selected disease node, (2) an embedding zdk of the selected disease node, and (3) embeddings zdk of all disease nodes linked to the selected disease node to obtain a disease feature vector hdk; and activating the disease feature vector hdk with an activation function ϕ to obtain the aggregated embedding zd(k+1) for the selected disease node.
In another application, the step of aggregating, for a selected gene node or for a selected disease node, uses a different weight for each type of embedding. The method may also include training the graph convolutional neural network model G and the edge decoder model ED for each of the different weight. The step of estimating may include calculating the probability P as a sigmoid function applied to a product of (1) the aggregated embedding of the selected gene node, (2) a weight of the edge decoder model, and (3) the aggregated embedding of the selected disease node.
In one embodiment, the method may include applying a cross-entropy loss function L to the edge decoder model ED to calculate a final probability Pf of the edge (di, gj). The additional information includes one or more of an Online Mendelian Inheritance in Man, disease ontology, associations in other species, human mRNA co-expressions, protein-protein interactions, protein complex, comparative genomics interaction, and disease similarity network. The heterogenous network includes a gene network, a disease network, and a gene-disease network.
In one application, the step of building includes linking each gene node gj to other known gene nodes; linking each disease node di to other known disease nodes; and linking each gene node gj to the disease node di if such a link is known. The method may also include initializing the embeddings with the additional information. All the steps and features discussed above with regard to the method of
To evaluate this novel method versus the traditional methods, the following criteria have been used: Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision-Recall Curve (AUPRC), Boltzmann-Enhanced Discrimination of ROC (BEDROC), Average Precision at K (AP@K), and Recall at K (R@K) score. AUROC is a commonly used criterion in machine learning, which computes the area under the ROC curve. In the disease-gene prioritization problem, it can be interpreted as the probability of a true disease-associated gene is ranked higher than a false one selected randomly in a uniform distribution. Similar to AUROC, AUPRC computes the area under the precision-recall curve. BEDROC, proposed to solve the “early recognition” problem, can be interpreted as the probability of a disease-associated gene being ranked higher than a gene selected randomly following a distribution in which top-ranked genes have a higher probability to be chosen. AP@K computes the precision of the prediction if one considers the top K predicted associations. Recall at K considers the recall score within the top K predictions. These five criteria can provide a comprehensive evaluation of the proposed novel method.
Prior to showing and comparing the results obtained with the novel method and the five traditional methods, the five competing methods are briefly introduced. The first method is Katz [8], which is a typical network-based method. It computes the node similarity based on the network topology. The similarity matrix is then used to make predictions for disease-gene associations. The second method is Catapult [8], another network-based method. It combines the supervised learning with social network analysis, and has been shown to be the state-of-the-art network-based method. This method deploys a biased support vector machine (SVM) as the classifier while the features are derived from random walks in the heterogeneous gene-trait network. This method significantly outperformed the previous network-based methods, such as PRINCE and RWRH. The third method is a recent network-based method, the Graph Convolution-based Association Scoring (GCAS) method [9].This method used the GCN as a pure network analysis tool which can perform information propagation on the similarity and association networks. The novel method discussed in
For comparing all these methods, a dataset was built from the OMIM database (Nov. 26, 2017). After preprocessing, a dataset with 12,331 genes, 3,215 diseases, and 3,988 associations was constructed. With this dataset, 10% associations were randomly hid as the testing set and the remaining 90% edges were used as the training data to evaluate the overall performance of different methods on recovering the hidden associations. The performance of the different methods discussed above is summarized in the table in
On the other hand, because the proposed method, PGCN, can utilize both the network topology information and the additional information of the nodes in a systematic and natural way, it can outperform all the state-of-the-art methods significantly and consistently across different criteria with a large margin. In terms of AUPRC, PGCN can outperform the second-best method by around 10%. The ROC curves and the PRC curves are shown in
For disease-gene prioritization, the Recall at K method is an important indicator because the top-ranked genes are candidates for further investigation.
Following the idea of [8], the performance of different methods on predicting the associations of singleton genes, which are defined as those genes with only one link in the database, was checked. In the experiment performed by the inventors, the only links for the singleton genes were removed from training, which means that the methods needed to predict the associations “from scratch.” This test used the recall at K to evaluate the various methods, which is a difficult measurement because each test gene has one and only one true association. As shown in
Next, the inventors evaluated the ability of the various methods to predict associations for novel diseases for which no associated genes are known. For a novel disease, all of its associations with genes were removed during training and the various methods were challenged to recover those missing associations. This task is considerably less difficult in terms of recall than recovering the associations for singleton genes because a disease can be associated with more than one gene. At the same time, this task is practically important because it is directly related to the molecular diagnosis for human diseases. As shown in
To further understand how the novel method of
The inventors also evaluated the prediction performance of different methods for novel associations, which are defined to be the association between a disease and a gene, both of which have no association in the training set. This is the most stringent and challenging requirement. In order for a method to recover such associations, neither the disease end nor the gene end of the association can be directly used. The method must be powerful enough to effectively use the disease-and gene-specific information, and propagate the information through other diseases, genes, and their associations in the heterogeneous network. The results for this experiment are shown in
As a case study, the inventors have investigated the top 10 associations for breast cancer. Among these 10 genes, other than the four ground-truth breast cancer-related genes reported in the OMIM dataset, the novel model also predicted three interesting genes: Axin2, TLR4, and PTPRJ, which were reported to be related to breast cancer. For example, Axin2 was found to be included in the Wnt/β-catenin/Axin2 pathway, which can regulate the breast cancer invasion and metastasis; TLR4 was found to be overexpressed in the majority of the breast cancer samples and also related to the metastasis of breast cancer; and PTPRJ forms DEP-1/PTPRJ/CD148, which is the receptor-like protein tyrosine phosphatases (PTP), was found to be mutated or deleted in human breast cancer. These results suggest the potential application of the PGCN method on discovering new genes related to complex human diseases.
The above-discussed procedures and methods may be implemented in a computing device as illustrated in
Exemplary computing device 1000 suitable for performing the activities described in the embodiments discussed above may include a server 1001. Such a server 1001 may include a central processor (CPU) 1002 coupled to a random access memory (RAM) 1004 and to a read-only memory (ROM) 1006. ROM 1006 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc. Processor 1002 may communicate with other internal and external components through input/output (I/O) circuitry 1008 and bussing 1010 to provide control signals and the like. Processor 1002 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
Server 1001 may also include one or more data storage devices, including hard drives 1012, CD-ROM drives 1014 and other hardware capable of reading and/or storing information, such as DVD, etc. In one embodiment, software for carrying out the above-discussed steps may be stored and distributed on a CD-ROM or DVD 1016, a USB storage device 1018 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 1014, disk drive 1012, etc. Server 1001 may be coupled to a display 1020, which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc. A user input interface 1022 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
Server 1001 may be coupled to other devices, such as various databases, etc. The server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 1028, which allows ultimate connection to various landline and/or mobile computing devices.
The disclosed embodiments provide a method for disease-gene prioritization by disease and gene embedding through graph convolutional neural networks. It should be understood that this description is not intended to limit the invention. On the contrary, the embodiments are intended to cover alternatives, modifications and equivalents, which are included in the spirit and scope of the invention as defined by the appended claims. Further, in the detailed description of the embodiments, numerous specific details are set forth in order to provide a comprehensive understanding of the claimed invention. However, one skilled in the art would understand that various embodiments may be practiced without such specific details.
Although the features and elements of the present embodiments are described in the embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the embodiments or in various combinations with or without other features and elements disclosed herein.
This written description uses examples of the subject matter disclosed to enable any person skilled in the art to practice the same, including making and using any devices or systems and performing any incorporated methods. The patentable scope of the subject matter is defined by the claims, and may include other examples that occur to those skilled in the art. Such other examples are intended to be within the scope of the claims.
This application claims priority to U.S. Provisional Patent Application No. 62/808,581, filed on Feb. 21, 2019, entitled “DEEP LEARNING-BASED DISEASE-GENE PRIORITIZATION METHOD,” the disclosure of which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2020/050614 | 1/27/2020 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62808581 | Feb 2019 | US |