The present invention relates generally to systems and methods for predicting adverse drug reactions, and particularly a framework for predicting potential adverse drug reactions (ADRs) for drug candidates and undetected ADRs for marketed drugs, and identifying the relevant targets. Further aspects enable use of the framework to assess the mechanisms of actions about certain ADRs.
Machine learning models have been developed to predict adverse drug reactions and improve drug safety. Though some prediction methods are effective, most machine learning models do not provide sufficient, if any, biological explanation for the prediction results, especially information relevant to target binding.
Adverse drug reactions (ADRs) are complicated and can vary from individual to individual. Identification of relevant targets can not only help to understand the mechanisms of ADRs, but also help to focus on potentially causative aspects, such as genetic mutations, thus helping with the improvement of precision medicine.
While computational methods have been developed to predict adverse drug reactions using a variety of features (e.g., chemical structures, binding assays and phenotypical information) and models (e.g., logistical regression, random forest and support vector machine), most of the studies focus on feature variety and model performance instead of hypothesis generation of mechanism explanation.
A system, method and computer program product for predicting possible ADRs for a new or candidate drug by requiring only the structural input of a drug molecule. Additionally, the relevant binding targets that may play a key role in causing such ADRs can be identified/highlighted.
According to one embodiment, there is provided a method to automatically predict an adverse drug reaction for a new drug or predict an undetected adverse drug reaction for a currently marketed drug.
The method comprises: receiving, at a processor, data regarding a molecular structure of a drug; computing for the drug, using the processor, a plurality of drug-target interaction features, each drug-target interaction feature between the drug molecular structure and each of a plurality of unique, high-resolution target protein structures; running, at the processor, one or more classifier models associated with a corresponding one or more known adverse drug reaction (ADR); predicting, using each the classifier model, one or more ADRs based on the drug-target interaction features involving the drug and known drug-ADR relationships; and generating, by the processor, an output indicating the predicted one or more ADRs.
In a further embodiment, there is provided a system to automatically predict an adverse drug reaction for a drug. The system comprises: at least one memory storage device; and one or more hardware processors operatively connected to the at least one memory storage device, the one or more hardware processors configured to: receive data regarding a molecular structure of a drug; compute, for the drug, a plurality of drug-target interaction features, each drug-target interaction feature being between the drug molecular structure and each of a plurality of unique, high-resolution target protein structures; run one or more classifier models associated with a corresponding one or more known adverse drug reaction (ADR); predict, using each the classifier model, one or more ADRs based on the drug-target interaction features involving the drug and known drug-ADR relationships; and generate an output indicating the predicted one or more ADRs.
In a further aspect, there is provided a computer program product for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method. The method is the same as listed above.
A system, method and computer program product for predicting adverse drug reactions (ADRs) from structural input of drug molecule. The systems and methods further generate hypotheses by highlighting the relevant binding targets that may play a key role in causing ADRs. More specifically, a system framework is provided for implementing methods for automatically generating interaction scores associated with the 3D structure of the drug and conforming such scores from a structural library.
In a further embodiment, for the drug molecules in the drug set 104, the computer system may access tools for generating associated 3D molecular structures based on an input chemical formula or drawing representing a 2-D molecule, e.g., using the “molconvert” command line via an interface generated by program tool “MolConverter” available in Marvin Beans (e.g., available from ChemAxon Marvin Beans 6.0.1). In one embodiment, the Marvin Beans is an application and API for chemical sketching and visualization and a Molconverter tool for converting files between 2-D and 3-D various file formats, e.g., molecule file formats, graphics formats etc.
Further, in one embodiment, for the 3-D drug molecules in the drug set 104, the system may first remove the drug molecules that do not have rotatable bonds (e.g., such as calcium acetate) or that are too large (having a molecular weight >1200, e.g., such as cisatracurium besylate) since they may not generate meaningful docking scores, e.g., too large to fit into protein pockets.
As further shown in
In one embodiment, extracted from the PDBBind database 112, are data representing unique human protein targets. The target proteins are selected from the PDBBind database 112 according to selected criteria: (1) High-quality: all the protein structures extracted are to have high resolutions on the order of 1.98±0.47 Å; (2) Targetable: the structures have experimental ligand binding data available; (3) Unique human proteins: the structures represent unique human proteins, i.e., for one protein, selecting the one of the many possible crystal structures available that have the highest resolution; and (4) Well-defined binding pockets: the structures have embedded ligands to define binding pockets.
After the selection and extracting of the drug molecules set 104, and unique target proteins set 114, the method prepares structure files using an automated docking tools such as AutoDock Tools 1.5.6 (e.g., available at autodock.scripps.edu). In one embodiment, Gasteiger charges are added to both the drug and target structures using the preparation scripts of AutoDock Tools. As known, the AutoDock Tools are software programs configured to prepare files that are needed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of a known 3D (e.g., target protein) structure. In one embodiment, the binding pockets of the proteins are centered at the original embedded ligands, with a fixed size of 25×25×25 Å3 to reduce pocket-based variation.
Continuing in the method 100 of
Based on the method steps of
Returning to
In one embodiment, based on the method steps of
In one embodiment, the method may first include a filtering step to filter the ADRs that contain less than a pre-determined amount of positive drugs, e.g., five positive drugs, since they have too few positive samples.
Returning to
In one embodiment, one logistic classifier model is generated for each ADR. In one embodiment, training an ADR model includes, for a specific ADR, the obtaining one ADR column at a time, e.g., column 118 in
In one embodiment, for a specific ADR model, these inputs are received in one logistic regression function such as:
Given drug x, the molecular docking scores towards 600 proteins are a vector of (x1, x2, . . . , x600). The coefficients (b1, b2, . . . , b600) along with the value for constant α were obtained during the model training process. The methods include calculating f(x) as the predicted confidence score (range: 0% to 100%) that drug x may cause this specific ADR.
In one embodiment, the sklearn package in Anaconda® Python may be implemented on the computer system to develop the logistic regression model and in one embodiment, the coefficients are determined via minimizing a cost function (which is the aggregated difference between predictions and actual values). Use of L2 regularization may yield coefficients with best prediction performance. The Scikit-learn software machine learning library for the Python programming language may also be used to develop the ADR model.
In one embodiment, the coefficients calculated in a logistic regression ADR model build using the machine learning mathematical techniques are subject to relevant target analysis to understand ADR mechanism.
In one embodiment, to select the best parameters for a model, different combinations of regularization types (L1 and L2) and parameters (C=0.001, 0.01, 0.1, 1, 10, 100 and 1000) during 10-fold cross-validations may be explored and the best parameters may be selected based on a best area under the receiver operating characteristic curve (AUROC). To demonstrate the ADR prediction performance of molecular docking, seven different types of structural fingerprints were generated for the drugs in the training set for feature comparison. The seven structural fingerprints are E-state, Extended Connectivity Fingerprint (ECFP)-6, Functional-Class Fingerprints (FCFP)-6, FP4, Klekota-Roth method, MACCS and PubChem structural descriptors (called E-state, ECFP6, FCFP6, FP4, KR, MACCS and PubChem, respectively). After comparing the prediction performance of molecular docking against these structural fingerprints via 10-fold cross-validations on both AUROC and area under the precision-recall curve (AUPR) values, the final models 130 were developed based on molecular docking features with the optimal parameters.
It should be understood that there are different types of prediction models that can be developed to predict ADRs. For example, while there is built a separate model for each ADR as described, there may also be developed only one model which can predict for all ADRs. For this alternative approach, there is a need to harvest features for ADRs, such that each row in the training set represents a drug-ADR pair, and it contains both the drug and ADR features. The label for such row is either positive (represents known drug-ADR association) or negative (represents unknown drug-ADR association).
As further shown in
Then, interaction results are used to predict ADRs via the machine learning models f(x). Additionally, feature analysis may be implemented to understand the underlying mechanisms of the ADRs.
Thus, as shown in
In one embodiment, the ADRs are ranked by confidence scores. For example, the top binding targets for Drug X may be used to study the mechanisms underlying the drug-ADR relationship. See, for example, a first case study Example 1 herein below.
Alternatively, the top relevant targets for the ADRs may be identified via model-based feature/coefficient analysis to understand the mechanisms of the ADRs. See, for example, a second case study Example 2 herein below.
In
In an alternate embodiment, as shown in
Whether obtained in a first instance by selecting and inputting a known drug formula from a pre-existing list and obtaining a corresponding SMILES code representation as described at 402 in
In a first example case study, it was determined that the drug Mometasone induces dermatitis acneiform an ADR. Thus, using the exemplary method 400 of
Then, as described at step 410 of
In the first illustrative example, as an output of running each ADR model against the interaction scores 530 for each input drug, there is generated a confidence score that the drug will provide a drug-protein interaction that is associated with the current ADR. As shown in the chart 600 of
As known, Dermatitis acneiform (Unified Medical Language System Concept ID: C0234708) is acne-like cutaneous eruptions. As shown in
To understand the potential mechanisms of this ADR, there may be conducted a Target binding analysis for drug X and an ADR-specific feature analysis. In one embodiment, the method accesses binding scores for the new drug against all target proteins. For this first case study example, processes are invoked for determining the top binding proteins for Mometasone and ranking them by their binding scores.
In one embodiment, to avoid this ADR interaction, there may be developed a drug modification or a new drug developed to minimize or avoid the binding with the 3B0W protein. Alternatively, the existing drug structure may be re-designed or modified to minimize or avoid the binding with the 3B0W protein. Such modifications include those known in the art, including, without limitation, altering ligand length, size and/or shape, altering spatial configuration, polarity and hydrogen bonding aspects, e.g., adding a heteroatom (oxygen, nitrogen, etc.) or groups that effect hydrogen bonding to avoid interaction with a protein determined as the underlying cause of the ADR.
As mentioned above with respect to
In a second example case study, the computer system performs a model based feature analysis, i.e., a coefficient analysis, including analyzing the feature coefficients of the ADR model and ranking the target according to the coefficients to understand the mechanisms relevant to the ADR.
In the second example case study, there may be determined a drug that may induce cataract subcapsular—an ADR. Thus, in accordance with a further analysis step 133 of
As a result of the analysis, the methods determine the top protein features related to a subject ADR as weighted by the corresponding ADR model.
In the analysis shown in table 800 of
Thus, from this feature-based analysis, it is possible to find protein targets that are associated with ADRs, thus generating hypothesis that help to explore and understand the mechanisms of ADRs.
From the above case studies, the methods can not only predict ADRs for drug molecules, but also provide possible mechanism explanations via the binding targets. Since ADRs are complicated and differ from individual to individual, such explanation could potentially provide clues for toxicology researchers to generate hypothesis and help with the design for wet-lab experiments about ADR mechanisms, thus improving the safety evaluation of drugs. As the methods only require the structural information of the drug molecules to predict ADRs, it is feasible to use it in the early drug development stage when other types of information of the drug candidates are limited.
Referring now to
Computing system 200 includes at least one processor 252, a memory 254, e.g., for storing an operating system and/or program instructions, a network interface 256, a display device 258, an input device 259, and any other features common to a computing device. In some aspects, computing system 200 may, for example, be any computing device that is configured to communicate with a database 230 web-site 225 or web- or cloud-based server 220 over a public or private communications network 99. Further, shown as part of system 200 is a further memory 260 for temporarily storing extracted Drug-Target interaction features and drug-ADR information, e.g., used for building the ADR model(s). For example, in one embodiment, further memory 260 may provide the structural library including a database of identified drugs and human protein targets and their interaction profiles calculated via molecular docking.
In one embodiment, as shown in
In
In one embodiment, the computer system 200 is a machine implementing multiple processors. As the molecular docking process is a most time consuming process, i.e., each time when a new drug is to be processed, it needs to dock to 600 proteins, then multiple control processor units, e.g., CPUs 252A, 252B, 252C can speed this up by parallel computing the docking process. For example, instead of molecular docking 600 proteins one by one, a 50-core machine can do 50 dockings at a time. In one embodiment, computer system 200 may be a multi-core machine, whereby the more cores had, the faster is the computation. For ADR model development, multi-cores would help to speed up the parameter testing. For example, if it is desired to test 10 sets of parameters, a 10-core machine can do it in one batch.
Memory 254 may include, for example, non-transitory computer readable media in the form of volatile memory, such as random access memory (RAM) and/or cache memory or others. Memory 254 may include, for example, other removable/non-removable, volatile/non-volatile storage media. By way of non-limiting examples only, memory 254 may include a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Network interface 256 is configured to transmit and receive data or information to and from a database web-site server 220, e.g., via wired or wireless connections. For example, network interface 256 may utilize wireless technologies and communication protocols such as Bluetooth®, WWI (e.g., 802.11a/b/g/n), cellular networks (e.g., CDMA, GSM, M2M, and 3G/4G/4G LTE), near-field communications systems, satellite communications, via a local area network (LAN), via a wide area network (WAN), or any other form of communication that allows computing device 200 to transmit information to or receive information from the server 220, e.g., to select particular Target protein structures data or specify small molecule drug structure data from respective databases.
Display device 258 may include, for example, a computer monitor, television, smart television, a display screen integrated into a personal computing device such as, for example, laptops, smart phones, smart watches, virtual reality headsets, smart wearable devices, or any other mechanism for displaying information to a user. In some aspects, display 258 may include a liquid crystal display (LCD), an e-paper/e-ink display, an organic LED (OLED) display, or other similar display technologies. In some aspects, display 258 may be touch-sensitive and may also function as an input device.
Input device 259 may include, for example, a keyboard, a mouse, a touch-sensitive display, a keypad, a microphone, or other similar input devices or any other input devices that may be used alone or together to provide a user with the capability to interact with the computing device 200.
In an early drug development stage, pharmaceutical companies can use this system framework 200 to predict potential ADRs for drug candidates and identify the relevant targets. Therefore, they can choose other candidates that are predicted to be safer or less likely to bind with the risky targets to avoid ADRs. Further, in a post-market stage, pharmaceutical companies can use this system framework 200 to identify the mechanisms of actions about certain ADRs. By studying the relevant targets by the framework, they may find genetic mutations that may alter the susceptibility to ADRs regarding these targets. Therefore, they can advise patients with the specific genetic mutations to adjust the usage of the risky drugs (aka. precision medicine).
In some embodiments, the computer system may be described in the general context of computer system executable instructions, embodied as program modules stored in memory 16, being executed by the computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks and/or implement particular input data and/or data types in accordance with the present invention (see e.g.,
The components of the computer system may include, but are not limited to, one or more processors or processing units 12, a memory 16, and a bus 14 that operably couples various system components, including memory 16 to processor 12. In some embodiments, the processor 12 may execute one or more modules 10 that are loaded from memory 16, where the program module(s) embody software (program instructions) that cause the processor to perform one or more method embodiments of the present invention. In some embodiments, module 10 may be programmed into the integrated circuits of the processor 12, loaded from memory 16, storage device 18, network 24 and/or combinations thereof.
Bus 14 may represent one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus.
The computer system may include a variety of computer system readable media. Such media may be any available media that is accessible by computer system, and it may include both volatile and non-volatile media, removable and non-removable media.
Memory 16 (sometimes referred to as system memory) can include computer readable media in the form of volatile memory, such as random access memory (RAM), cache memory an/or other forms. Computer system may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 18 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (e.g., a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 14 by one or more data media interfaces.
The computer system may also communicate with one or more external devices 26 such as a keyboard, a pointing device, a display 28, etc.; one or more devices that enable a user to interact with the computer system; and/or any devices (e.g., network card, modem, etc.) that enable the computer system to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 20.
Still yet, the computer system can communicate with one or more networks 24 such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 22. As depicted, network adapter 22 communicates with the other components of computer system via bus 14. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with the computer system. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The corresponding structures, materials, acts, and equivalents of all elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.