The present disclosure generally relates to a device for predicting protein-protein interaction using protein complex surface information based on artificial intelligence and method using the same. More specifically, some embodiments of the present disclosure may predict the structure of a protein complex through an output obtained by learning a protein sequence as an input through artificial intelligence.
A new type of antigen composed of protein complexes may be derived from somatic mutations, and may be formed by the cancer cells and antigen-presenting cells of the subject. Immunogenicity refers to the activation of T cells recognizing a specific antigen. T cells may be activated by recognizing protein complexes in which major histocompatibility complex (MHC) proteins and peptide antigens are bounded. For this mechanism to occur, T cell receptors and MHC-peptide protein complexes may be physically bound to each other.
Recently, the performance of various learning models for predicting unknown structures of protein complexes has developed. However, due to noise and high cost that may occur during the learning process, it is challenging to directly utilize these models for predicting the binding of protein complexes.
Therefore, a learning method that can more accurately predict the physical structure of a protein capable of binding through the inputs of the given protein sequence may be needed.
Some embodiments of the present disclosure may extract the structure of a protein complex that may be derived from a given random protein sequence to provide immunogenicity data corresponding to that protein sequence.
According to certain embodiments of the present disclosure, a prediction device may provide accurate data on atoms constituting proteins by predicting possible 3D structures of protein complexes through iterative learning.
The problems solved by the present disclosure are not limited to those mentioned above, and other problems not mentioned may be clearly understood by those skilled in the art from the description below.
A prediction device for predicting protein-protein interactions using protein complex surface information based on artificial intelligence according to an embodiment of the present disclosure may include: a memory; a communication unit; and a processor electrically connected to the memory and the communication unit, wherein the processor may be configured to: predict the structure of a protein complex based on a first model, extract surface information of the protein complex, and provide interaction prediction data for the protein complex and an external protein based on the extracted surface information.
A method for predicting protein-protein interaction using protein complex surface information based on artificial intelligence, performed by a processor of a computer device according to an embodiment of the present disclosure, may include: a step for predicting the structure of a protein complex based on a first model; a step for extracting surface information of the protein complex; and a step for providing interaction prediction data for the protein complex and an external protein based on the extracted surface information.
In addition, other methods and systems for implementing the present disclosure and a computer-readable recording medium storing a computer program for executing the method, may further be provided.
Furthermore, a computer program stored on a medium that allows the method of implementing some embodiments of the present disclosure to be performed on a computer may further be provided.
According to certain embodiments of the present disclosure, a device for predicting protein interaction may extract the structure of a protein complex, which may be physically bound to a T cell receptor from a protein sequence, based on surface information of the protein complex.
In addition, various embodiments of the present disclosure may improve the performance of a prediction model for more precisely and efficiently searching for a binding site of a T cell receptor by using surface information of a protein complex as an input.
The effects of the present disclosure are not limited to those mentioned above, and other effects not mentioned may be clearly understood by those skilled in the art from the description below.
Throughout the present disclosure, the same reference numerals designate the same components. The present disclosure does not describe all elements of the embodiments, and general contents in the technical field of the present disclosure or duplicated content among the embodiments are omitted. The terms “unit, module, member, block” used in the specification may be implemented in software or hardware, and depending on the embodiments, multiple “units, modules, members, blocks” may be implemented as a single component, or a single “unit, module, member, block” may also include multiple components.
Throughout the specification, when a part is described as being “connected” to another part, it includes not only cases where they are directly connected but also cases where they are indirectly connected, and indirect connection includes being connected by a wireless communication network.
Furthermore, when a part is described as “comprising” a certain component, it does not mean that it excludes other components unless explicitly stated otherwise but means that it may further include other components.
Throughout the specification, when one member is described as being “on” other member, it includes not only cases where the members are in contact but also cases where another member exists between them.
Terms such as “first” and “second” are used to distinguish one component from another, and are not intended to limit the components by the aforementioned terms.
Singular expressions include plural expressions unless the context clearly indicates otherwise.
Identification codes used for each step are provided for convenience in description and do not specify the order of the steps, and each step may be carried out in a different order unless a specific order is explicitly described.
The operating principles and embodiments of the present disclosure will be described below with reference to the accompanying drawings.
The term “device according to the present disclosure” in the present disclosure encompasses various devices capable of performing operations and providing results to users. For example, the device according to the present disclosure may include a computer, server device, and portable terminal, or it may take any one of these forms.
Here, the computer may include, for example, a notebook, desktop, laptop, tablet personal computer (PC), or slate PC, equipped with a web browser.
The server device is a server that processes information by communicating with an external device, and may include an application server, a computing server, a database server, a file server, a game server, a mail server, a proxy server, and a web server.
The portable terminal is, for example, a wireless communication device ensuring portability and mobility, and may include all kinds of handheld-based wireless communication devices such as Personal Communication System (PCS), Global System for Mobile communications (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA)-2000, W-Code Division Multiple Access (W-CDMA), Wireless Broadband Internet (WiBro) terminal, smart phone, and wearable devices such as watches, rings, bracelets, anklets, necklaces, glasses, contact lenses, or head-mounted devices (HMD).
A device 100 for predicting protein interaction may include one or more processors 110, memory 120, a communicator or communication unit 130, and an input and output interface 140, and the like. However, these internal components included in the device 100 for predicting protein interaction are provided for illustration purposes only, although not limited thereto. Alternatively or additionally, the device 100 for predicting protein interaction according to an embodiment of the present disclosure may perform the functions of the processor 110 through a separate processing server or a cloud server instead of a processor.
Referring to
The processor 110 may control one or more of the components described above to implement various embodiments according to the present disclosure in the device 100 for predicting protein interaction, which are described in
The memory 120 according to an embodiment may store data performing or supporting various functions of the device for predicting protein interaction 100, programs for the operations of the processor 110, input and/or output data (e.g., images, videos, etc.), multiple application programs or applications running on the device for predicting protein interaction 100, and data or instructions for operating the device for predicting protein interaction 100. At least some of these application programs may be downloaded from an external server via wired or wireless communication.
The memory 120 may include at least one of the following types: storage medium, such as flash memory type, hard disk type, Solid State Disk (SSD) type, Silicon Disk Drive (SDD) type, multimedia card micro type, card type memory (e.g., SD or XD memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, and optical disk. Additionally, the memory may be a database that is separate from the device for predicting protein interaction 100, but may be connected wired or wirelessly.
The communicator or communication unit 130 according to an embodiment may include one or more components configured to communicate with external devices, including, for example, at least one of a broadcast receiving module or broadcast receiver, wired communication module or wired communicator, wireless communication module or wireless communicator, short-range communication module or short-range communicator, or location information module.
The wired communication module may include various wired communication modules such as a Local Area Network (LAN) module, a Wide Area Network (WAN) module, or a Value Added Network (VAN) module, as well as various cable communication modules such as Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), Digital Visual Interface (DVI), recommended standard 232 (RS-232), power line communication, or plain old telephone service (POTS).
The wireless communication module may include not only a WiFi module, a Wireless broadband (WiBro) module, but also a wireless communication module supporting various wireless communication methods such as Global System for Mobile Communication (GSM), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Universal Mobile Telecommunications System (UMTS), Time Division Multiple Access (TDMA), Long Term Evolution (LTE), 4G (Generation), 5G, and 6G.
The short-range communication module is for short-range communication and may support short-range communication using at least one of the following technologies: Bluetooth™, Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra Wideband (UWB), ZigBee, Near Field Communication (NFC), Wireless-Fidelity (Wi-Fi), Wi-Fi Direct, or Wireless Universal Serial Bus (Wireless USB).
The input and/or output interface 140 according to an embodiment serves as a channel for connecting various types of external devices to the device for predicting protein interaction 100. The input and/or output interface 140 may include, for example, but not limited to, at least one of the following: a wired and/or wireless headset port, an external charger port, a wired and/or wireless data port, a memory card port, a port for connecting a device equipped with a Subscriber Identification Module (SIM), an audio Input/Output (I/O) port, a video Input/Output (I/O) port, or an earphone port. The device for predicting protein interaction 100 may perform control related to the external device connected to the input and/or output interface 140.
At least one component may be added or omitted in accordance with the performance of the components shown in
Meanwhile, each module or component shown in
In step S210, a processor (e.g., the processor 110 of
In step S220, the processor can extract surface information on a surface of the protein complex. The processor can extract information about the surface of the protein complex, by focusing on at least one surface point included in a region of the surface of the protein complex, to obtain immunogenicity data from the surface of the protein complex. In step S230, the processor can provide the immunogenicity data for the protein complex based on the surface information on the surface of the protein complex.
In step S310, a processor (e.g., the processor 110 of
In step S320, the processor may filter the surface points based on peptide atoms constituting the protein complex. The processor may extract the positions of the peptide atoms based on the 3D coordinates of the peptide atoms forming the protein complex. The processor may perform the filtering of one or more surface points present or located within a predetermined distance from the position of the extracted peptide atom. The group of the filtered surface points may be referred to as a first surface point group.
In step S330, the processor can perform embedding for the filtered surface points. The processor can perform the embedding through a Convolutional Neural Network based on a geodesic distance for the first surface point group, an orientation filter for the first surface point group, and surface point features in the first surface point group.
According to some embodiments, the geodesic distance for the first surface point group may include a geodesic distance between any two different surface points in the first surface point group. The orientation filter can be computed based on the relative position and direction of any two different surface points within the first surface point group. The surface point feature may include, for example, but not limited to, six-dimensional chemical features and ten-dimensional geometric features.
In step S340, the processor can perform reduction of the surface points at a residue level based on the embedding results of step S330. At step S340, the processor can predict immunogenicity (Immunogenicity, IMM) using residue-level features to provide immunogenicity data.
According to certain embodiments, the processor can reduce surface points to the residue level. At step S340, the processor can extract residue-level features using Equation 1.
According to some embodiments, the processor can reduce the number of surface points embedded at step S330 to the residue level and utilize the surface points reduced to the residue level to predict immunogenicity along with residue-level features. Here, neighboring surface points N(i) of an i-th residue-level feature Ri can be defined as surface points within a predetermined distance (e.g., r) from a nearest atom among atoms constituting Ri. The i-th residue-level feature Ri may include a mean value of the neighboring surface point features that are within the predetermined distance from the peptide atoms constituting the residue-level feature.
According to some embodiments, at
According to some embodiments, the processor may filter and determine only surface points 520 that are located or exist within a predetermined distance (e.g., r in
Referring to
In Equation 2, the processor can update surface point embedding (fi) using quasi-geodesic convolution operations. Here, w(dij) represents a geodesic distance w(dij)=exp(−dij2/2σ2) between surface points i and j, with a Gaussian window w(dij)=exp(−dij2/2σ2) to which radius 4 is applied. MLP(pij) is a learnable orientation filter used for performing convolution along the MHC-peptide complex surface, calculated based on a relative position and direction of surface points i and j through position coordinates and normal vectors. fi can be represented in a total of 16 dimensions and is a 16-dimensional surface point-wise feature. This is initialized with 6-dimensional chemical features and 10-dimensional geometric features, updated accordingly.
According to certain embodiments, the processor can perform the convolution operation of
Referring to
A Protein Structure Prediction Algorithm 803 according to an embodiment may be a deep learning-based model configured or developed for predicting the 3D structure of protein. The Protein Structure Prediction Algorithm 803 may predict the 3D structure of protein from the one-dimensional amino acid sequence of a protein utilizing a technique called Markov Deep Learning. For example, the Protein Structure Prediction Algorithm 803 may include an algorithm that utilizes the ESMfold model. 3D coordinates of the HLA 804 may be 3D coordinates of the HLA predicted by the Protein Structure Prediction Algorithm 803 based on the HLA sequence 802.
A Docking Algorithm 805 according to an embodiment may be an algorithm that utilizes an operational model to predict structural changes of a protein alone or a protein-ligand complex. The Docking Algorithm 805 may predict the structural changes occurring in a ligand-bound protein structure and the effects of ligand binding. For example, the Docking Algorithm 805 may include an algorithm that utilizes a DiffDock model. The Docking Algorithm 805 may allow for structural prediction of protein complexes, protein-ligand interaction analysis, drug design, and the like.
3D coordinates of the HLA-peptide complex 806 according to an embodiment may be extracted from the 3D coordinates of the HLA-peptide complexes extracted via a second model based on the 3D coordinate of the HLA and the peptide sequence.
According to an embodiment, surface point embedding 807 may represent a feature of points for a surface of the 3D model in a vector form. The 3D model may be represented by a collection of numerous points, and the individual points may have coordinates on the 3D space. The surface point embedding 807 may perform the function of converting surface points into low-dimensional vectors to abstract and represent features of the 3D model. After performing the surface point embedding 807, the processor may reduce the feature of the 3D model to a residue level and perform self-attention 811 in combination with peptide feature 808 extracted from the peptide sequence 801.
The pre-learned ESM2 809 according to certain embodiments may be a second version of an evolutionary scale modeling (ESM) model which was previously learned. The ESM model may be a deep learning model that is utilized to understand and handle protein sequences. The ESM model may be used to predict the structure of protein complexes, inter-protein interactions, protein-ligand interactions, and the like.
Referring to
According to an embodiment, the self-attention 811 may be one of the key components of a transformer, which is a deep learning model used in the field of natural language processing (NLP). The self-attention 811 may be a method of calculating a correlation of each position with another position for every position in a given input sequence. As a result, individual words (e.g., tokens) of the input sequence may be represented by vectors, and the self-attention 811 may calculate how much individual word vectors are associated with other word vectors. Cross attention 812 is utilized to model the interaction with other input sequences.
An array feature according to an embodiment may refer to a characteristic expressed in a multidimensional arrangement form in data. A multilayer perceptron (MLP) may be a type of artificial neural network. The MLP is a deep learning model including multiple hidden layers, and an individual hidden layer 813 may have multiple neurons.
According to some embodiments, a processor (e.g., the processor 110 in
According to certain embodiments, a processor may predict a first protein complex based on data relating to at least one protein sequence. For example, the protein complex may comprise an MHC-peptide complex. That is, the processor may provide immunogenicity data through data learned about information or a structure of whether a T cell is a protein complex that can be recognized and activated as a particular antigen.
According to some embodiments, a processor may extract 3D coordinates of HLA through a Protein Structure Prediction Algorithm based on a HLA sequence. The processor may insert the HLA sequence as an input value into the Protein Structure Prediction Algorithm, and predict a 3D coordinate (e.g., a 3D structure) of the HLA with respect to the input value. For example, the processor may extract the 3D coordinates of the HLA through a learning model based on the HLA sequence.
According to certain embodiments, a processor may extract 3D coordinates of a HLA-peptide complex through a model utilizing Docking Algorithm based on 3D coordinates and a peptide sequence of the HLA. The processor may input the 3D coordinates and the peptide sequence of the HLA as input values into the model utilizing Docking Algorithm, and predict the 3D coordinates (e.g., 3D structures) of the HLA-peptide complex with respect to the input values. That is, the processor may extract the 3D coordinates of the HLA-peptide complex through a learning model based on the peptide sequence and the 3D coordinate of the HLA extracted from the HLA sequence. By these operations, according to some embodiments, the processor can perform embedding for the surface points.
According to certain embodiment, a processor may perform self-attention by summing peptide features extracted based on a peptide sequence and positional encoding. The processor may extract an embedded HLA embedding through a pre-learned model based on the HLA sequence. Here, the pre-learned model may be, for example, but not limited to, ESM2. The processor may perform cross-attention with a result of summing the positional encoding and the HLA embedding together with a result of performing self-attention. In this case, the processor may apply individual weights to the 3D coordinates of an individual residue alpha carbon in a positional encoding operation.
Meanwhile, some embodiments of the present disclosure may be implemented in the form of a recording medium that stores instructions executable by a computer. The instructions may be stored in the form of program code and, when executed by a processor, may generate program modules to perform the operations of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium.
The computer-readable recording medium includes any type of recording medium in which instructions that may be decrypted by a computer are stored. For example, Examples include a read only memory (ROM), a random access memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like.
As described above, the disclosed embodiments are described with reference to the accompanying figures. A person of ordinary skill in the art may understand that the present disclosure may be implemented in a different form from the disclosed embodiments without changing the technical sprit or essential features of the present disclosure. The disclosed embodiments are illustrative and restrictive.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0142144 | Oct 2023 | KR | national |