Embodiments of the invention generally relate to machine learning, and more particularly to neural networks.
Medical and computer scientists and researchers in the biomedical domain increasingly rely on computer technology to perform new tasks, to perform old tasks in new and better ways, or to tackle previously-known (but unsolved) or newly-discovered challenges. Conventional computers and computing techniques, and human ingenuity alone, are inadequate to perform these tasks or to address these challenges.
Several important tasks in the biomedical domain may be described as link prediction tasks. Link prediction is the task of inferring missing links between two or more entities in a network of entities (for example, as represented by a knowledge graph), by learning from observed links between those entities. In the biomedical context, link prediction may be used to perform drug-drug interaction prediction, disease-gene prioritization, and drug-target interaction prediction.
In these link prediction tasks, one objective may be to identify links between two biomedical entities. A biomedical entity generally refers to any composition of matter that is related to the fields of biology and medicine. In the context of computing technology, a biomedical entity is generally representable using a data type, structure, or pattern. Examples of biomedical entities representable via a computer are genes, proteins, amino acids, diseases, and drugs. These are merely examples; other biomedical entities are possible.
Embodiments of the invention provide for methods, computer program products, and systems for using a neural network model for determining an association between biomedical entities in a biomedical entity pair. For example, the method, according to an embodiment, generates vector representations of respective tokens of biomedical entities of the biomedical entity pair. The method generates, using a neural network, hidden vectors for the vector representations to generate hidden matrices. The method concatenates the hidden matrices and generating respective concatenated matrices, and correlates the concatenated matrices. The method predicts a probability of an association between the biomedical entities of the biomedical entity pair based at least in part on respective attention vectors generated using the concatenated matrices.
According to an embodiment, the method generates vector representations of biomedical entities of the biomedical entity pairs by processing tokens of the biomedical entities via an embedding lookup layer.
According to an embodiment, a biomedical entity refers to a data representation of a composition of matter that is related to the fields of biology and medicine.
According to an embodiment, the neural network is a Long Short Term Memory (LSTM) recurrent neural network (RNN).
According to an embodiment, correlating the concatenated matrices refers to performing attentive pooling on the concatenated matrices.
According to an embodiment, performing attentive pooling is done using attentive pooling.
According to an embodiment, the attentive pooling comprises row-wise attentive pooling and column-wise attentive pooling.
According to an embodiment, the method generates attention vectors corresponding to the biomedical entity pairs.
According to an embodiment, the steps of the method are repeated iteratively using a training dataset; and the method optimizes parameters of the neural network to maximize the predicted probability of an association for the training dataset.
According to an embodiment, the method processes a new biomedical entity pair not appearing in the training set and for which a prior association is not known; and determining a probability of association between biomedical entities of the new biomedical entity pair.
The task of biomedical link prediction generally involves answering the question of whether (or predicting the likelihood that) two biomedical entities under consideration are associated in some way, where the answer is not directly known in a knowledge source (such as a knowledge graph). In this context, a given biomedical entity may be taken as the reference point, and compared against one or more “targets,” i.e., biomedical entities with which the given biomedical entity might be associated. For example, in the more specific task of determining drug-target interactions (DTIs), a question that might be answered is whether a given drug (a chemical compound) is associated with a protein (the target).
Previous approaches to link prediction for pairs of biomedical entities either cannot sufficiently use the rich features of the relevant domain (as reflected, for example, in the entities' matrix factorization), or require extensive domain expertise for feature engineering (for example, similarity-based prediction). More specifically, prior art solutions cannot use both linkage information and content information at the same time. Moreover, prior art solutions do not utilize basic entity information in the general training phase of a neural network, and cannot handle unobserved entities at inference time. Additionally, the prior art does not extend to biomedical entities such as gene sequences, protein sequences, or chemical structures.
Some embodiments of the invention will generally be described in the context of the following three processing phases: a specific neural network training phase (“specific training phase”); a general neural network training phase (“general training phase”); and a neural network inference phase (“inference phase”).
The specific training phase generally refers to a set of functions that receive, as their inputs, a biomedical entity pair; process them using various machine learning techniques including those that use a neural network; and generate an output that represents a likelihood that the two biomedical entities in the biomedical entity pair are associated with one another (the output may also be considered a measure of their association). This process is referred to as “specific” because its output is based on a given biomedical entity pair, and because iterative execution of this specific process forms part of the general training phase (along with other processes).
The general training phase generally refers to a set of functions that process multiple biomedical entity pairs (a training set) and a knowledge graph containing the biomedical entity pairs, where the knowledge graph may include known associations (or lack of associations) between the various biomedical entities that the knowledge graph represents. The biomedical entities and their known associations (or lack of associations) are used, in the general training phase, to train parameters of a link prediction neural network, through iterative execution of the specific training phase and use of machine learning techniques such as gradient descent. Through these processes, the general training phase derives and optimizes the neural network's parameters.
The inference phase generally refers to a set of functions that evaluate a given biomedical entity pair's level of association (whether as a scale or as a binary value) by using the given biomedical entity pair as inputs to the trained neural network, and by receiving an output of the trained neural network. The output represents a measure of association between the biomedical entities of the biomedical entity pair. In this context, the biomedical entity pair under consideration may be new biomedical entities or newly paired biomedical entities, for which a prior association measure is not yet known or observed.
Embodiments of the invention will now be described with greater specificity, in connection with the Figures.
According to the depicted embodiment, link prediction system 100 includes a link prediction program 102 having one or more modules, including a specific training module 103, a general training module 104, and an inference module 112. Other components of link prediction system 100 include one or more biomedical entity pairs 108, one or more knowledge graphs 109, and one or more trained neural networks 116, stored one more databases (not shown). General properties of these components and their interactions are described in more detail below.
Specific training module 103: Generally, specific training module 103 receives as its input a biomedical entity pair 108, processes that input using a neural network (which may be, for example, the trained neural network 116, if that neural network already exists), and generates an output that represents a measure of association between the biomedical entities in the biomedical entity pair. In this context, biomedical entity pair 108 may be any pairing of biomedical entities from any source. While biomedical entity pairs 108 and knowledge graph 109 are shown separately in
General training module 104: generally, general training module 104 receives as inputs one or more biomedical entity pairs 108 from one or more knowledge graphs 109; that is, general training module 104 generates, or receives a training data set containing pairings of biomedical entities from among the set of biomedical entities represented in knowledge graph 109. For each biomedical entity pair 108 in the training data set, general training module 104 processes the biomedical entities of that pair using known associations (as represented in the knowledge graph) between the two biomedical entities. The processing results in general training module 104 generating and optimizing parameters of trained neural network 116. According to an embodiment of the invention, the processing may be done performed through successive iterations of specific training module 103. Additional details of the operation of general training module 104, as well as the components with which it operates, are provided in connection with
Inference module 112: generally, inference module 112 receives as input a biomedical entity pair 108 and trained neural network 116, processes the biomedical pair 108 using trained neural network 116, and generates link predictions 108. In this context, biomedical entity pair 108 represents a pairing of biomedical entities whose association is not known, and whose association is being predicted. Additional details of inference module 112 and components with which it operates are discussed in connection with
Referring now to
With continued reference to
With continued reference to
With continued reference to
With continued reference to
Referring now to
With continued reference to
With continued reference to
With continued reference to
Referring now to
P(y=1|rg,rd)=σ(g,d)=(1+e−r
Some embodiments of the invention have been tested using a testing dataset. The testing dataset was constructed in a way that simulates the practical situations, where, given a pair of drug and protein at testing time, the drug, the protein, or both of them may have not been observed in the training time. Such experimental setting demands great generalization ability in the underlying model. Evaluated against prior art solutions, embodiments of the invention use less feature engineering and require less domain expertise, and therefore present superior results in the difficult cases not covered well by human designed features, and where neither the drug nor the protein from a testing pair is observed.
With continued reference to
With continued reference to M and the cell states from the previous time step h(t-1)∈
H; c(t-1)∈
H and produces a hidden state ht∈
H. Here, M and H are two hyper parameters that specify the dimension of the embedding space and the dimension of the hidden space respectively. The variant of LSTM used is defined as:
i
t=σ(Wiixt+Whih(t-1)+bhi) (1)
f
t=σ(Wifxt+bif+Whfh(t-1)+bhf) (2)
g
t=tanh(Wigxt+big+Whch(t-1)+bhg) (3)
o
t=σ(Wioxt+bio+Whoh(t-1)+bho) (4)
c
t
=f
t
*c
(t-1)
+i
t
*g
t (5)
h
t
=o
t*tanh(ct) (6)
where Wi, Wh, bi, and bh are learning parameters, and where h0=0H is initialized as a vector of zeros. Suppose now that the input tokens belong to a vocabulary V=|{t1, . . . , t|v|}, the input embeddings are obtained as:
x
i
=W
v
T
I
i (7)
where Wv∈|v|×M is a learnable parameter and Ii∈
|v|×1 is a vector whose i-th value is one and all other values are zero.
With continued reference to
= neighbors(a);
ru;
Algorithm 1 shows the pseudo-code of the neural fingerprint algorithm that produces a dense vector representation from the input molecule graph, and as a side effect it also assigns a dense vector representation for each atom in the molecule. At the initialization phase (line 1, 2 in Algorithm 1), the atom features are initialized as a 62-dimension sparse vector that indicates both chemical and topological properties of the atom. The algorithm then iteratively applies convolutional operation on the graph (lines 4-10 in Algorithm 1) R times and updates the fingerprint at the end of each iteration. The radius parameter R controls how many hops can information be propagated, and it is set to (3) in this instance.
While the CNN is usually applied on a matrix, for example images, Algorithm 1 is convolutional in the sense that it applies filters to each atom and its neighborhood to capture a local signal, and then the aggregated local signals are pooled to get the final vector representation. In contrast to an image in which each pixel always has 8 neighbor pixels, an atom can have from one to five neighbor atoms. Therefore, instead of using one convolutional filter, Algorithm 1 uses 5 linear filters H1 . . . H5 for atoms with a corresponding number of neighbors. At the end of each iteration, the fingerprint is updated by adding the softmax of a linear transformation of each atom vector, and the linear transformation for each layer is defined by learnable parameters WL∈62×H, L=1, . . . , R.
With continued reference to
For example, suppose P∈HpxLp is the context matrix of a given protein, where Hp, Lp are the dimensions of the protein hidden space and the number of inputs, it can be formed in 3 ways as proteins have two input sources: (1) the concatenation of LSTM hidden vectors with amino acids sequences input so that Lp equals the number of amino acids in the sequence; (2) the concatenation of GO annotations embeddings so that Lp equals the number of GO terms for the protein; and (3) the concatenation of both (1) and (2).
Similarly, suppose D∈HdxLd is the context matrix of a given drug, Hd, Ld being the dimensions of the drug hidden space and the number of inputs; it can be (1) the concatenation of LSTM hidden vectors with SMILES string input so that Ld equals the number of tokens in the SMILES string, or (2) the concatenation of atom vectors obtained from graph CNN so that Ld equals to the number of atoms in the molecule.
A soft alignment matrix A∈LpxLd is calculated as A=tanh(PT U D), where U∈
HpxHd is a trainable parameter. For an intuitive example, when proteins are represented by amino acid sequences and drugs by chemical structure graphs, A empirically represents the interaction between each amino acid and each atom.
Next, the attention weights αp∈Lp, αd∈
Ld, which can be interpreted as importance scores on the input units, are calculated by applying row-wise and column-wise maxpooling operations to A:
Finally, αp and αd are exponentially normalized by a softmax function, the results of which are used as weights to generate weighted sum the context vectors:
r
p
=P·softmax(αp) (10)
r
d
=D·softmax(αd) (11)
where the softmax function is defined as:
With continued reference to
The attention-based vector representations rp and rd are fed separately into the two networks. Then the inner product of the outputs may be taken, and a sigmoid function may be used to predict the probability that a binding exists between a pair of protein and drug:
where fp, fp are the transformations of the siamese networks for protein and drugs, respectively.
In a classification scenario, a hyper-parameter threshold δ is selected as classification boundary:
With continued reference to
where Θ is the set of neural network parameters described above. However, although the discussed examples use a dataset with both positive and negative pairs, negative pairs are usually not available for similar tasks especially when a dataset is from a knowledge graph that stores only existing triples. Therefore, a pairwise ranking loss may be employed, which, for each given protein p, maximizes the margin between interacting drugs and non-interacting drugs, i.e. ranking positive drugs higher than negative drugs as much as possible.
where γ>0 is a hyper-parameter that specifies the width of the margin, and N+(p) and N−(p) give the set of drugs that interact with p and those that do not interact with p, respectively. In this setting, the training only emphasizes the observed positive examples so that negative examples can be generated by sampling pseudo-negative drugs with heuristic criteria, if a dataset does not have any.
Additional neural network training and parameter optimization 750 may be performed according to any known method in the art of neural network optimization (for example, at step 316 shown in
In computing device 10, there is a computer system/server 12, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like.
Computer system/server 12 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system/server 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
As shown in
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12, and it includes both volatile and non-volatile media, removable and non-removable media.
System memory 28 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
Program/utility 40, having a set (at least one) of program modules 42, may be stored in memory 28 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard, a pointing device, a display 24, etc.; one or more devices that enable a user to interact with computer system/server 12; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet, computer system/server 12 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 12. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
Referring now generally to embodiments of the present invention, the embodiments may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.