Recent years have seen significant developments in hardware and software platforms for managing and operating complex computer-implemented pipelines. For example, conventional systems often utilize a variety of computing devices to attempt to validate and/or perform various tasks within a complex operational pipeline, such as compound exploration program. Such conventional systems, however, often utilize large computational data volumes, causing significant technical problems in validating compound program exploration tasks across computer devices and networks. Accordingly, conventional systems suffer from a number of technical deficiencies, particularly with regard to inaccuracy, inefficiency, and operational inflexibility of implementing computing devices.
Embodiments of the present disclosure provide benefits and/or solve one or more of the foregoing or other problems in the art with systems, non-transitory computer-readable media, and methods for utilizing processed biological representations and language machine learning models for initiating compound exploration programs. For example, the disclosed systems implement a language machine learning model to orchestrate a series of workflows to analyze genes and/or compounds for future exploration. In particular, in some embodiments, the disclosed systems identify predicted biological relationships for an anchor compound or an anchor gene from a processed biological representation (e.g., phenomic image embeddings or protein binding machine learning predictions). Moreover, in some embodiments, the disclosed systems generate digital text prompts that contain the anchor compound or the anchor gene with text rating instructions for the language machine learning model. Furthermore, in some embodiments, from the digital text prompts, the disclosed systems use the language machine learning model to generate corresponding rating metrics according to the text rating instructions. Moreover, in some embodiments the disclosed systems combine the rating metrics to generate a program rating for the anchor compound or the anchor gene for initiating one or more compound exploration programs.
Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.
The detailed description provides one or more embodiments with additional specificity and detail through the use of the accompanying drawings, as briefly described below.
This disclosure describes one or more embodiments of a compound exploration initiation system that utilizes processed biological representations and language machine learning models to generate rating metrics and program ratings for compound exploration programs across computer networks. For example, in some embodiments the compound discovery process involves multiple groups of computing devices individually querying and extracting information related to compound exploration programs. In one or more implementations, the compound exploration initiation system analyzes compound exploration programs at scale by utilizing a language machine learning model. For instance, in some embodiments, the compound exploration initiation system utilizes processed biological representations to identify various predicted biological relationships and feeds those relationships to a language machine learning model with text rating instructions. To illustrate, the compound exploration initiation system utilizes the language machine learning model and text prompts to generate rating metrics and program ratings for anchor genes and/or anchor compounds. Moreover, the compound exploration initiation system combines individual rating metrics returned from the language machine learning model to generate an overall program rating. In some embodiments, the compound exploration initiation system intelligently decides whether to initiate one or more compound exploration programs utilizing the program rating.
As mentioned above, in one or more embodiments, the compound exploration initiation system utilizes processed biological representations to identify predicted biological relationships. In one or more embodiments, the compound exploration initiation system generates these machine learning representations based on various digital signals. For example, in some implementations, the compound exploration initiation system generates processed biological representations from phenomic digital images representing cell perturbations (e.g., gene knockouts or applying compounds to cells). Further, in some embodiments, the processed biological representations further include compound protein pocket interaction predictions. In some such embodiments, the processed biological representation includes machine learning binding representations that indicate the relationships between compounds and proteins. Moreover, in some embodiments the compound exploration initiation system obtains the processed biological representations from additional digital signals (e.g., multiomic datasets such as proteomics, metabolomics, invivomics, and transcriptomics) that contain information relating to genetic features, compound features, and/or protein features.
As mentioned above, in one or more implementations the compound exploration initiation system generates multiple digital text prompts based on predicted biological relationships. For example, the compound exploration initiation system stores digital text prompt templates within a digital text prompt repository and populates placeholder fields within the digital text prompt templates based on predicted biological relationships (e.g., an anchor compound or anchor gene) identified from processed biological representations. Moreover, in some embodiments the digital text prompts further include text rating instructions for the anchor compound or the anchor gene. Additionally, in some embodiments the digital text prompts also include context generation instructions.
As mentioned above, in some embodiments the compound exploration initiation system utilizes a language machine learning model to generate rating metrics according to text rating instructions of digital text prompts. For example, the compound exploration initiation system utilizes a large language model, transformer machine learning model, or other text-based machine learning architecture to process input digital text prompts having text rating instructions and generate rating metrics according to the text rating instructions. For example, the compound exploration initiation system can utilize a language machine learning model to generate gene impact rating metrics, previous analysis rating metrics, and/or tractability rating metrics from gene impact text prompts, previous analysis text prompts, and/or tractability text prompts, each having their own unique rating instructions (e.g., gene impact rating instructions, previous analysis rating instructions, and/or tractability rating instructions). In this manner, the compound exploration initiation system integrates a language machine learning model to generate precise and specific responses at scale (e.g., rating metrics on a full genomic scale). Further, in some embodiments, by utilizing the language machine learning model, the compound exploration initiation system is not restricted to a given dataset, but broadly peruses multiple datasets to generate rating metrics for a variety of different tasks in determining whether to initiate compound program exploration.
As mentioned above, in some embodiments the compound exploration initiation system dynamically combines rating metrics to generate a program rating. For example, the compound exploration initiation system utilizes the language machine learning model to generate rating metrics from digital text prompts, where the rating metrics include binary responses and/or scaled responses. For instance, in some embodiments the compound exploration initiation system takes the binary responses and/or scaled responses and combines them (e.g., via a combination algorithm) to generates a program rating. For instance, the compound exploration initiation system can utilize learned weights to combine individual rating metrics and generate an overall program rating for an anchor compound and/or anchor gene. Furthermore, in some embodiments in generating the program rating, the compound exploration initiation system determines whether a subset of rating metrics satisfies a predetermined rating metric threshold.
In addition, in one or more implementations, the compound exploration initiation system utilizes downstream analysis of compounds or genes to further learn and improve rating metrics and/or program ratings. For example, the compound exploration initiation system can monitor what compounds are selected for future hits or leads in compound discovery pipelines and then modify weights, combination algorithms, or other parameters to more accurately identify those biological relationships that will lead to successful compounds. Thus, the compound exploration initiation system can form a virtuous feedback loop to iteratively improve selected relationships to explore through additional programs.
As mentioned above, although conventional systems can validate and perform various tasks related to determining a biological relationship, such systems have a number of problems in relation to accuracy, efficiency, and flexibility of operation. For instance, conventional systems inaccurately explore some biological relationships due to the technical difficulties associated with examining and exploring disparate information stored across large data volumes. Indeed, conventional systems often fail to accurately filter through digital signals indicating millions of potential biological relationships to accurately focus on relationships that need additional computer-implemented analyses. To illustrate, conventional systems have access to various digital databases or other repositories having digital signals indicating potential relationships to explore. However, conventional systems cannot accurately extract and select the most promising relationships to explore from the potential millions (or billions) of pertinent combinations. Thus, conventional systems often inaccurately identify and select anchors/targets for initiating downstream compound analysis programs.
In addition to their inaccuracies, conventional systems are also inefficient. More specifically, conventional systems require an excessive number of interactions and graphical user interfaces to identify potential relationships from digital signals available across different computing devices or digital repositories. For instance, conventional systems require significant inputs and user interfaces to query, search, and review digital signals (e.g., digital articles, spreadsheets, or test results) and identify biological relationships for anchor compounds or anchor genes. The time, number of user interactions, and number of user interfaces required to search and review digital literature and datasets relating to potential biological relationships through conventional systems wastes significant computing resources (e.g., memory and processing power). Moreover, these inefficiencies become more and more pronounced as the number of desired relationships and the size/number of pertinent information sources increases.
Furthermore, in addition to their inaccuracies and inefficiencies, conventional systems suffer from operational inflexibility. Indeed, conventional systems cannot analyze digital databases effectively to filter and select pertinent anchors/targets for initiating compound exploration programs. Rather conventional systems rigidly rely on client device queries, including user interactions described above, to sort and analyze digital information to select compounds or genes to explore. Moreover, conventional systems lack the ability to scale to a large number (millions or billions) of potential biological relationships with regard to genes across the human genome, a litany of potential compounds, and various diseases/biological activities.
As suggested by the foregoing discussion, the compound exploration initiation system provides a variety of technical advantages relative to conventional systems. For example, by utilizing processed biological representations to generate digital text prompts for language machine learning models, the compound exploration initiation system accurately identifies and selects biological relationships for initiating compound exploration programs. For instance, in some embodiments the compound exploration initiation system identifies potential biological relationships from machine learning embeddings (e.g., processed biological representations such as image embeddings generated from cell perturbations or other machine learning predictions). Moreover, in some embodiments, the compound exploration initiation system further utilizes these potential biological relationships to generate dynamic text prompts for a language machine learning model. The language machine learning model thus determines rating metrics and corresponding program ratings to guide downstream computer-implemented processes. In this manner, the compound exploration initiation system can more accurately identify and select anchor genes and/or anchor compounds for initiating compound exploration programs.
Furthermore, in some embodiments, the compound exploration initiation system improves efficiency relative to conventional systems. Specifically, the compound exploration initiation system can identify a predicted biological relationship (e.g., from processed biological representations) and further generate digital text prompts to query the language machine learning model to return rating metrics. Thus, the compound exploration initiation system can significantly reduce interactions and interfaces required by conventional systems to search and review digital literature and datasets. Accordingly, the compound exploration initiation system can significantly reduce the time, number of user interactions, and number of user interfaces needed for comparing and analyzing potential biological relationships relative to conventional systems.
Moreover, by identifying predicted biological relationships, generating digital text prompts, and utilizing the language machine learning model, the compound exploration initiation system can improve operational flexibility relative to conventional systems. Specifically, by utilizing a language machine learning model to generate rating metrics, the compound exploration initiation system can intelligently analyze large repositories of data (e.g., with millions or billions of potential combinations of genes and/or compounds relative to biological activities/diseases) and validate potential biological relationships on a large scale for determining whether to initiate compound program exploration.
Additional detail regarding a compound exploration initiation system 102 will now be provided with reference to the figures. In particular,
As shown in
As shown in
For instance, the tech-bio exploration system 104 can generate and access experimental results corresponding to gene sequences, protein shapes/folding, protein/compound interactions, phenotypes resulting from various interventions or perturbations (e.g., gene knockout sequences or compound treatments), and/or invivo experimentation on various treatments in living animals. By analyzing these signals (e.g., utilizing various machine learning models), the tech-bio exploration system 104 can generate or determine a variety of predictions and inter-relationships for improving treatments/interventions.
To illustrate, the tech-bio exploration system 104 can generate maps of biology indicating biological inter-relationships or similarities between these various input signals to discover potential new treatments as part of the complex compound discovery process. For example, the tech-bio exploration system 104 can utilize machine learning and/or maps of biology to identify a similarity between a first gene associated with disease treatment and a second gene previously unassociated with the disease based on a similarity in resulting phenotypes from gene knockout experiments. The tech-bio exploration system 104 can then identify new treatments based on the gene similarity (e.g., by targeting compounds the impact the second gene). Similarly, the tech-bio exploration system 104 can analyze signals from a variety of sources (e.g., protein interactions, or invivo experiments) to predict efficacious treatments based on various levels of biological data.
The tech-bio exploration system 104 can generate GUIs comprising dynamic user interface elements to convey tech-bio information and receive user input for intelligently exploring tech-bio information. Indeed, as mentioned above, the tech-bio exploration system 104 can generate GUIs displaying different maps of biology that intuitively and efficiently express complex interactions between different biological systems for identifying improved treatment solutions. Furthermore, the tech-bio exploration system 104 can also electronically communicate tech-bio information between various computing devices.
As shown in
As shown in
For example, as shown in
As also illustrated in
To illustrate, the administrator client device(s) 112 can include computing devices that implement, manage, or initiate a compound program exploration. For example, the administrator client device(s) 112 can receive data from the compound exploration initiation system 102 regarding an anchor gene or anchor compound and in response, the administrator client device(s) 112 can automatically generate additional machine learning representations, perform additional analysis, and/or initiate various compound exploration programs. In some embodiments, the administrator client device(s) 112 via the client application 114 (upon execution) cause the experimental device(s) 116 to perform various actions. Accordingly, a user can interact with the client application of the administrator client device(s) 112 to cause the experimental device(s) 116 to perform analyses, access results or perform other actions.
For example, a user of a user account can interact with the client application 114 on the administrator client device(s) 112 to execute experiments or other multi-faceted processes and to further access tech-bio information, initiate a request for validating gene/compound relationships, and/or accessing various data related to various processed biological representations.
As just mentioned, the environment includes the experimental device(s) 116. For example, the compound exploration initiation system 102 can utilize the experimental device(s) 116 to for example, generate cell perturbations, apply compounds to specific gene anchors, and/or perform gene target knockouts. For example, the tech-bio exploration system 104 can interact with the experimental device(s) 116 that include intelligent robotic devices and camera devices for generating and capturing digital images of cellular phenotypes resulting from different perturbations (e.g., genetic knockouts or compound treatments of stem cells). Similarly, the experimental device(s) 116 can include camera devices and/or other sensors (e.g., heat or motion sensors) capturing real-time information from animals as part of invivo experimentation. The tech-bio exploration system 104 can also interact with a variety of other experimental device(s) such as devices for determining, generating, or extracting gene sequences or protein information.
For example, the experimental device(s) 116 may include computing devices linked to biosensors, electrophysiological platforms, x-ray crystallography machines, liquid chromatography mass spectrometry systems, nuclear magnetic resonance spectrometers, mass spectrometers. In some implementations, the compound exploration initiation system 102 manages, schedules, executes, and tracks operation of the experimental device(s) 116 based on other events within the environment.
As further shown in
In addition, the environment can also include dedicated machine learning device(s) 118. For example, the dedicated machine learning device(s) 118 can include computing devices or virtual machines dedicated to training or implementing large-scale machine learning models. For example, the dedicated machine learning device(s) 118 can generate machine learning predictions and/or embeddings based on digital biological data (e.g., digital images of phenotypes resulting from different perturbations or compound-protein interactions from compound features). For instance, as shown, in some embodiments the dedicated machine learning device(s) 118 generate phenomic image embedding(s) 122, compound features 126 and protein features 128 (e.g., part of the machine learning binding representation(s) 124), and the multiomic representation(s) 130.
As mentioned above, in one or more implementations, the compound exploration initiation system 102 utilizes a language machine learning model to generate rating metrics and a program rating for initiating one or more compound exploration programs. For example,
As shown in
Moreover, the processed biological representation(s) further include Trekseq data, which includes RNA sequencing to determine the number of express proteins that map to a particular gene (e.g., knocking out a gene to see how much of a gene is expressed). Further, in some embodiments Trekseq involves analyzing the transcriptome of a cell, where transcriptome includes messenger RNA, non-coding RNA and other RNA molecules in a cell. Additional details regarding the processed biological representation(s) 200 is given below in the description of
Furthermore, as shown in
As mentioned, from the processed biological representation(s) 200, the compound exploration initiation system 102 identifies the predicted biological relationship 202 for the anchor compound or the anchor gene. For example, the predicted biological relationship 202 can include a predicted biological connection or affiliation (e.g., corresponding to a gene or compound). For instance, a predicted biological relationship includes a hypothesis regarding an affiliation between a gene and/or compound relative to a disease, treatment, or biological activity. For example, a predicted biological relationship can include a predicted impact on a disease (e.g., cancer) relative to an anchor gene or anchor compound. Similarly, a predicted biological relationship can include a compound having a particular biological activity (that impacts a particular gene or protein). A predicted biological relationship can thus include relationships between genes, between compounds, between compounds and genes, and/or between diseases and compounds/genes.
In one or more embodiments an anchor gene includes a specific gene targeted or identified as part of a predicted biological relationship. To illustrate, an anchor gene can include a gene identified for a predicted function or activity (e.g., a predicted effect on a particular disease or condition). For instance, the compound exploration initiation system 102 identifies an anchor gene as part of the process of compound program exploration. Further, in some instances, the compound exploration initiation system 102 utilizes the anchor gene to identify compounds that interact with the anchor gene. Specifically, in the compound program exploration process, the compound exploration initiation system 102 utilizes compounds (e.g., anchor compounds) to inhibit or enhance the expression or function of an anchor gene.
In one or more embodiments, the compound exploration initiation system 102 identifies a predicted biological relationship for an anchor gene from a processed biological representation(s) (e.g., phenomic image embeddings or machine learning binding representations). Further, in some embodiments, the compound exploration initiation system 102 identifies the anchor gene from multiomic processes such as genomics, clinical genomics (e.g., measured genetic information from clinical treatment of humans with one or more biological conditions or diseases), transcriptomics, proteomics, and/or invivomics.
In one or more embodiments, an anchor compound includes a molecule (or soluble factor) targeted or identified as part of a predicted biological relationship. To illustrate, an anchor compound can include a molecule identified for a predicted function or activity (e.g., predicted to treat a particular disease or condition). For instance, in some embodiments the anchor compound has the potential to interact with a biological substrate such as a protein, enzyme, receptor, or gene that is associated with the particular disease or condition.
Similar to anchor genes, in some embodiments the compound exploration initiation system 102 identifies anchor compounds based on the processed biological representations 200 (e.g., phenomic image embeddings or machine learning binding representations). Furthermore, in some embodiments the compound exploration initiation system 102 identifies the anchor compound from identifying the anchor gene. For instance, from identifying a gene of interest that has a high correlation with a particular disease or condition, the compound exploration initiation system 102 further identifies an anchor compound with a statistically significant relationship with the anchor gene.
Further, as shown in
As mentioned, in one or more embodiments, the compound exploration initiation system 102 generates rating metrics utilizing the language machine learning model 206. As used herein, the term machine learning model includes a computer algorithm or a collection of computer algorithms that can be trained and/or tuned based on inputs to approximate unknown functions. For example, a machine learning model can include a computer algorithm with branches, weights, or parameters that changed based on training data to improve for a particular task. Thus, a machine learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine learning models include various types of decision trees, support vector machines, Bayesian networks, random forest models, or neural networks (e.g., deep neural networks).
As used herein, a neural network includes a machine learning model of interconnected artificial neurons (e.g., organized in layers) that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. In some instances, a neural network includes an algorithm (or set of algorithms) that implements deep learning techniques that utilize a set of algorithms to model high-level abstractions in data. To illustrate, in some embodiments, a neural network includes a convolutional neural network, a recurrent neural network (e.g., a long short-term memory neural network), a transformer neural network, a generative adversarial neural network, a graph neural network, a diffusion neural network, or a multi-layer perceptron. In some embodiments, a neural network includes a combination of neural networks or neural network components.
As used herein, the term language machine learning model refers to a machine learning model that analyzes a language input (e.g., text or verbal input) to generate a predicted output. For instance, a language machine learning model includes a neural network that generates text based on an input text or query. The compound exploration initiation system 102 can utilize a variety of architectures for a language machine learning model, such as a large language model or other transformer neural network model. For instance, a large language model includes one or more neural networks capable of processing natural language text to generate outputs that range from predictive outputs, analyses, or combinations of data within stored content items. In particular, a large language model can include parameters trained (e.g., via deep learning) on large data volumes to learn patterns and rules of language for summarizing and/or generating digital content. Examples of large language model include BLOOM, Bard AI, ChatGPT (e.g., GPT-3, GPT-4, etc.), LaMDA, and/or DialoGPT. Moreover, in some embodiments a language transformer model includes bidirectional encoder representations (BERT), Robustly optimized BERT (ROBERTa), and other text transformer models.
Moreover, as shown in
Furthermore, as shown, the compound exploration initiation system 102 combines the rating metrics 208 to generate the program rating 210. For instance, the compound exploration initiation system 102 receives the rating metrics 208 from the language machine learning model 206 and utilizes various weights of a computer-implemented model to combine the rating metrics 208 to determine the program rating 210. For example, in some embodiments the program rating 210 includes an average of the rating metrics 208 (e.g., a weighted average), a sum of the rating metrics 208, or a binary response. Specifically, the program rating 210 indicates whether to initiate one or more compound exploration programs. Additional details regarding the rating metrics 208 and the program rating 210 are given below in the description of
As mentioned above, the compound exploration initiation system 102 generates processed biological representations that include phenomic image embeddings. For example,
As shown in
Thus, a gene perturbation can include gene-knockout perturbations (performed through a gene knockout experiment). For instance, a gene perturbation includes a gene-knockout in which a gene (or set of genes) is inactivated or suppressed in the cell (e.g., by CRISPR-Cas9 editing).
Moreover, a compound perturbation can include a cell perturbation using a molecule and/or soluble factor. For instance, a compound perturbation can include reagent profiling such as applying a small molecule to a cell and/or adding soluble factors to the cell environment. Additionally, a compound perturbation can include a cell perturbation utilizing the compound or soluble factor at a specified concentration. Indeed, compound perturbations performed with differing concentrations of the same molecule/soluble factor can constitute separate compound perturbations. A soluble factor perturbation is a compound perturbation that includes modifying the extracellular environment of a cell to include or exclude one or more soluble factors. Additionally, soluble factor perturbations can include exposing cells to soluble factors for a specified duration wherein perturbations using the same soluble factors for differing durations can constitute separate compound perturbations.
Thus, for example, the compound exploration initiation system 102 performs the cell perturbations 300 by performing gene perturbations, compound perturbations or other perturbation processes. To illustrate, cell perturbations can include thawing cells, plating them, transfection (for CRISPR-treated wells), adding compounds or soluble factors, fixation, staining, and (ultimately) imaging. To illustrate,
Furthermore,
For example, upon capturing phenomic digital images, the compound exploration initiation system 102 utilizes a deep image embedding model to generate phenomic image embeddings. For instance, a deep image embedding model includes a neural network (e.g., a convolutional neural network) or other embedding model that generates a vector representation of an input digital image.
In some implementations, the compound exploration initiation system 102 trains the deep image embedding model through supervised learning (e.g., to predict perturbations from digital images). For instance, the compound exploration initiation system 102 trains the deep image embedding model to generate predicted perturbations from phenomic digital images. For instance, perturbation mapping system utilizes neural network layers to generate vector representations of the phenomic digital images at different levels of abstraction and then utilizes output layers to generate predicted perturbations. The compound exploration initiation system 102 then trains the deep image embedding model by comparing the predicted perturbations with ground truth perturbations. Although the foregoing example describes a particular training approach and embedding model, the compound exploration initiation system 102 can utilize a variety of image embedding models.
With regard to
Thus, utilizing the convolutional neural network, the compound exploration initiation system 102 can embed each image into a low dimensional feature space (e.g., as a phenomic image embedding). Accordingly, a phenomic image embedding refers to a numerical feature representation (e.g., a feature vector) of a phenomic digital image generated by a machine learning model. Indeed, the compound exploration initiation system 102 can generate a multi-dimensional representation of each image within the low dimensional feature space. These multi-dimensional representations thus represent the features of different underlying perturbations (e.g., genes and compounds) as reflected in phenomic digital images utilized to generate the embeddings.
As mentioned above, in one or more embodiments, the compound exploration initiation system 102 utilizes the phenomic image embeddings 306 as the processed biological representations. The compound exploration initiation system 102 utilizes the phenomic image embeddings 306 to determine predicted biological relationship 314. For example, the compound exploration initiation system 102 compares the phenomic image embeddings 306 in an embedding feature space to determine the predicted biological relationship 314.
As illustrated, in some embodiments the compound exploration initiation system 102 stores the phenomic image embeddings 306 in a processed biological representation database. Specifically, as shown in
Furthermore, as shown, in one or more implementations the compound exploration initiation system 102 utilizes a statistical model 310 to compare embeddings of the phenomic image embedding database 308 to identify potential biological relationships. Specifically, the statistical model 310 includes determining a measure of similarity 312 between phenomic image embeddings 306. For instance, the measure of similarity 312 includes determining cosine similarity between the phenomic image embeddings 306. The measure of similarity 312 can also include a distance measure (e.g., Euclidean distance) within an embedding feature space.
The compound exploration initiation system 102 can utilize a threshold similarity to compare the phenomic image embeddings 306 (e.g., those embeddings satisfying the threshold similarity are identified for the predicted biological relationship 314). Further, in some embodiments the compound exploration initiation system 102 utilizes a significance threshold (e.g., a statistical threshold). To illustrate, the compound exploration initiation system 102 can create a statistical distribution of cosine similarity scores for the phenomic image embeddings 306 (e.g., for a specific gene or compound). The compound exploration initiation system can utilize a significance threshold (e.g., p-value of 0.01) to determine the predicted biological relationship 314.
As shown in
In some implementations, the compound exploration initiation system 102 determines the predicted biological relationship 314 based on user interaction with a graphical user interface element illustrating similarity measures between genes and/or compounds. For example, the compound exploration initiation system 102 can provide a user interface that includes a table or heatmap of similarity measures between genes and compounds. In response to selection of a particular field of the table or heatmap (e.g., a field showing a similarity measure), the compound exploration initiation system 102 can select the predicted biological relationship 314.
Although
As shown in
As shown in
The compound exploration initiation system 102 can utilize the machine-learning binding representation 406 to determine a predicted biological relationship 412. For instance, the compound exploration initiation system 102 can utilize the machine-learning binding representation to identify proteins (and/or related genes) that will be impacted by a particular molecule. For example, consider a particular gene known to have a particular function (e.g., cancer impact). The compound exploration initiation system 102 can also identify a particular protein resulting from the particular gene (e.g., the protein resulting in a cell from transcribing the gene). The compound exploration initiation system 102 can utilize the compound protein-pocket interaction machine-learning model 400 to generate a machine learning binding representation 406 for a compound relative to the particular protein (e.g., a prediction of whether the compound will bind to a protein pocket of the particular protein). The compound exploration initiation system 102 can then generate the predicted biological relationship 412 based on the machine learning binding representation 406. For example, if the machine-learning binding representation 406 indicates that the compound will bind to the particular protein, the machine learning binding representation 406 can generate the predicted biological relationship 412; namely, that the compound may impact the function (e.g., cancer) correlated to the particular gene that results in the particular protein.
As shown in
Although not shown in
Although the above description has a specific number of steps and order of steps, in one or more embodiments the compound exploration initiation system 102 omits, reorders, or adds one or more steps to identify a predicted biological relationship for an anchor gene or anchor compound.
In one or more embodiments, the compound protein-pocket interaction machine-learning model 400 can include a variety of machine learning model architectures. In some implementations, the compound protein-pocket interaction machine-learning model 400 includes supervised discriminative classifications or regression models such as a random forest, support vector machine, single layer perceptron, or multiple layer artificial neural network. In some embodiments, the compound protein-pocket interaction machine-learning model 400 takes the form of a fully-connected neural network with a feature input layer, and hidden layers with, and output nodes corresponding to interacting and non-interacting pairs. In some embodiments, an artificial neural network with multiple hidden layers omits connections between input types, for the creation of separate latent spaces representing ligand fingerprints, global protein features, local protein features, and protein functional features.
Further, in some embodiments the compound exploration initiation system 102 trains the compound protein-pocket interaction machine-learning model 400 by identifying a plurality of ghost ligands/compounds (and confidence scores) relative to particular proteins. In particular, the compound exploration initiation system 102 generates synthetic data by determining ghost compounds similar to selected compounds and proteins based on the confidence scores. The compound exploration initiation system 102 trains the compound protein-pocket interaction machine-learning model 400 based on features corresponding to known and synthetic compounds and proteins. For example, in one or more implementations, the compound exploration initiation system 102 trains and utilizes a compound protein interaction machine-learning model as described in METHOD AND SYSTEM FOR PREDICTING DRUG BINDING USING SYNTHETIC DATA, application Ser. No. 17/420,582, filed Jan. 2, 2020, which is incorporated by reference herein in its entirety.
Although
For example, in some embodiments genomics includes representations based on genes and inter-gene interactions, as well as a representation of the identification and characterization of the genetic makeup of a specific organism. Moreover, in some embodiments the compound exploration initiation system 102 utilizes a variety of bioinformatic tools to extract genes for a genome. Further, in some embodiments the clinical genomics includes representations based on an intersection between biological data and human health. For instance, clinical genomics includes determining genetics of a human organism (e.g., DNA) and/or RNA, mRNA, metabolites, proteins, and/or health records associated with a certain condition/disease.
To illustrate, invivomics includes representations, machine learning embeddings or predictions from in vivo data (e.g., experiments conducted in a living organism). Further, to illustrate, invivomic machine learning models can generate machine learning liability predictions or embeddings based on digital video and/or other digital signals from sensors of intelligent cages holding animals. Further, transcriptomics includes representations, machine learning predictions, or embeddings based on transcription mechanisms. Similarly, proteomics includes representations, machine learning predictions or embeddings that indicate protein information in a biological system and metabolomics includes representations, machine learning predictions, or embeddings that indicate metabolites in biological systems. The compound exploration initiation system 102 can utilize one or a combination of these various representations to generate digital text prompts, program ratings, and initiate compound exploration programs.
As mentioned above, the compound exploration initiation system 102 generates digital text prompts from a predicted biological relationship. For example,
As shown in
As shown in
For instance, in some embodiments the compound exploration initiation system 102 generates the digital text prompt template repository 504 by receiving a plurality of digital text prompt templates from an administrator computing device. Further, in some embodiments the compound exploration initiation system 102 receives indications from the administrator computing device registering each of the plurality of digital text prompt templates with a specific type or classification.
Moreover, in one or more embodiments, the compound exploration initiation system 102 generates the digital text prompt template repository 504 by utilizing a generative model to generate a plurality of digital text prompt templates. In particular, the compound exploration initiation system 102 provides a request to generate specific digital text prompt types or classifications (e.g., to a trained prompt generation machine learning model). Further, in some embodiments the compound exploration initiation system 102 receives the plurality of digital text prompt templates back from the generative model and stores the plurality of digital text prompt templates in the digital text prompt template repository 504 tagged with a specific type or classification.
In one or more embodiments, the compound exploration initiation system 102 utilizes the digital text prompt template repository 504 to store multiple digital text prompt templates. For instance, the digital text prompt templates include pre-defined prompts to query a language machine learning model for biological information relating to a specific compound or gene. Moreover, in some embodiments the digital text prompt templates include pre-defined prompts with placeholder fields for inserting or populating the placeholder fields with an anchor gene or anchor compound.
As shown in
Moreover, in one or more embodiments, the compound exploration initiation system 102 utilizes an intelligent model to select one or digital text prompt templates from the digital text prompt template repository 504. For instance, in some such embodiments the compound exploration initiation system 102 utilizes a mapping machine learning model trained on various predicted biological relationships and digital text prompt templates (e.g., inputs to the model) to generate an output of a digital text prompt template (or a digital text prompt) most similar to a predicted biological relationship. Accordingly, in some cases the compound exploration initiation system 102 implements the mapping machine learning model to select the digital text prompts 506, 508, and 510.
To illustrate, in some embodiments the compound exploration initiation system 102 feeds as input the predicted biological relationships 500 into an encoder of the mapping machine learning model. In some such instances, the compound exploration initiation system 102 generates an embedding of the predicted biological relationships 500 and compares the embedding in a latent vector space to identify a similar embedding for digital text prompt templates. Based on the embedding comparison, the compound exploration initiation system 102 can generate an output of digital text prompt templates that satisfy a threshold with the predicted biological relationships 500.
Further, in one or more embodiments, the digital text prompt template repository 504 includes a plurality of different digital text prompt template types or classifications. For example, digital text prompt template types or classifications include previous analysis digital text prompt templates, gene impact digital text prompt templates, and tractability digital text prompt templates. Thus, in some embodiments the compound exploration initiation system 102 identifies the predicted biological relationships 500 for the anchor gene or anchor compound 502 and selects digital text prompt templates based on the above-mentioned types or classifications. Additional specific examples of each of these digital text prompt template types or classifications are given below in the description of
As shown in
Moreover, as shown in
Further, as shown in
Although
Further, although
Moreover, although
As just described in relation to
As discussed above, the rating metrics 608-612 can include a variety of different values or metrics corresponding to different text rating instructions. Thus, for example, the language machine learning model 606 generates the rating metric 608 according to a first text rating instruction (e.g., a scale from 1 to 10), resulting in a rating metric of 3.2. Similarly, the language machine learning model 606 generate the rating metric 610 according to a second text rating instruction (e.g., a score of 1, 2, 3, or 4) resulting in a rating metric of 4. In addition, the language machine learning model 606 generates the rating metric 612 according to a third rating instruction (e.g., binary yes/no), resulting in a rating metric of “no.”
Further, in some embodiments the compound exploration initiation system 102 combines the rating metrics 608-612 to generate a program rating 614. For instance, the compound exploration initiation system 102 can assign a binary rating metric a score (e.g., “yes”=5 and “no” =0) and further add the binary rating metric with the other rating metrics. In some such instances, the program rating 614 includes adding the rating metrics 608-612. In some instances, the compound exploration initiation system 102 averages the rating metrics 608-612.
As mentioned, in some embodiments, the compound exploration initiation system 102 utilizes various rating metric thresholds to generate the program rating 614. For instance, the compound exploration initiation system 102 can apply a rating metric threshold to each rating metric (e.g., to determine whether each rating metric will receive a passing or failing score). The compound exploration initiation system 102 can also apply a rating metric threshold to a number of passing rating metrics. For instance, the compound exploration initiation system 102 can require a certain number of passing rating metrics (e.g., 4). If the number of passing rating metrics fails to satisfy the threshold (e.g., 4), then the compound exploration initiation system 102 can determine a corresponding program rating (e.g., a failing program rating). If the number of passing rating metrics satisfies the threshold (e.g., 4), then the compound exploration initiation system 102 can determine a corresponding program rating (e.g., a passing program rating). In one or more embodiments, for each rating metric that satisfies an initial rating metric threshold, the compound exploration initiation system 102 can add one point to the program rating 614.
In some instances, the compound exploration initiation system 102 can establish that at least a majority of the rating metrics has to satisfy the rating metric threshold to generate a favorable program rating. While in some instances, the compound exploration initiation system 102 can establish that only one rating metric has to satisfy the predetermined threshold to generate a favorable program rating.
In some implementations, rating metric threshold includes a combined threshold (e.g., after combining one or more rating metrics). For instance, the disclosed system can establish the rating metric threshold as 4 for the combination of three rating metrics. If the average score after combining three metrics fails to exceed 4, then the program rating generated by the disclosed system indicates to not move forward with a compound exploration program.
In one or more embodiments, the compound exploration initiation system 102 utilizes weights for the rating metrics 608-612. For example, the compound exploration initiation system 102 assigns a 60% weight to the rating metric 608, a 20% weight to the rating metric 610, and a 20% weight to the rating metric 612. Further, in one or more embodiments, the compound exploration initiation system 102 utilizes the weight assigned to the rating metrics 608-612, to generate the program rating 614. For instance, if the program rating 614 is scaled from 0-5, if the rating metric 608 (also scaled from 0-5) is a 5, the lowest score the program rating 614 could be is a 3 (e.g., 60% of the total).
In some implementations, compound exploration initiation system 102 learns weights to apply to various rating metrics in generating a program rating. For instance, the compound exploration initiation system 102 can identify what anchor genes or anchor compounds are identified in subsequent compound exploration programs. The compound exploration initiation system 102 can adjust the weights to emphasize those rating metrics corresponding to these anchor genes or anchor compounds.
In some implementations, the compound exploration initiation system 102 automatically initiates a compound exploration program based on the program rating 614. For example, the compound exploration initiation system 102 can initialize additional machine learning analysis (e.g., additional phenomic digital images) for an anchor gene or anchor compound based on the program rating 614.
In one or more embodiments, the compound exploration initiation system 102 provides the program rating 614 to an administrator device to determine whether to initiate one or more compound exploration programs. In some embodiments a “compound exploration program” includes a process of identifying and selecting potential chemical compounds or molecules for development into new or enhanced drugs or agents. For instance, a compound exploration program includes utilizing the anchor gene involved in an underlying disease and testing many compounds to identify how the compounds interact with the specific anchor. Additionally, compound exploration programs involve optimizing identified compounds and analyzing results of the compounds applied to the specific anchor. Furthermore, as used herein, the term compound can include small molecules or large molecules. Thus, a compound exploration program includes small molecule (e.g., molecules below a threshold size, such as smaller than antibodies) and large molecules (e.g., molecules above a threshold size, such as larger than antibodies). A compound exploration program can relate to a variety of therapeutics and biological relationships, including antibodies, antibody drug conjugates, proteolysis-targeting chimeras (e.g., PROTACS), other targeting chimeras, soluble factors, and RNA therapeutics.
For instance, in some embodiments the compound exploration initiation system 102 utilizes the program rating 614 to determine to initiate an industrial program generation (IPG) process. To illustrate, IPG includes (i) a hit selection to identify statistically strong connections in a biological map to patient-informed phenotypes, (ii) phenomic confirmation (e.g., promising actives are confirmed by automated similarity and concentration-response analytics), (iii) Trekseq confirmation (e.g., compound and gene relationships are confirmed with transcriptomics in the map background), and (iv) Structure-Activity Relationship (SAR) confidence (e.g., actives that behave as a series are identified, and an automated recommendation for expansion is identified).
Moreover, in some embodiments the compound exploration initiation system 102 utilizes the program rating 614 to determine to initiate an industrialized compound generation (ICG) process. For instance, ICG applies to steps subsequent to IPG. Further, in some embodiments ICG includes rapidly searching and expanding from potential hit series in the chemical space (e.g., identified at the IPG stage) and testing the potential hits with various analytical tests (e.g., SAR screens).
As discussed above, in some embodiments the compound exploration initiation system 102 generates a program rating from a language machine learning model and additional data sources.
As previously discussed in
As used herein, a previous analysis rating metric, refers to a value, score, measure or indication of historical investigation, inquiry, or examination. In particular, a previous analysis rating metric can include a value within a range indicating the extent to which a biological relationship has previously been examined or researched (e.g., the extent to which a compound has been analyzed for its impact on cancer). For example, the previous analysis rating metric can include a measure of previous model validation (e.g., preclinical validation utilizing one or more previous models), therapy availability (e.g., oncology therapy availability or an indication of cancer unmet need), compound availability (e.g., oncology compound availability or an indication of the competitive landscape of compounds for treating cancer or some other disease), or relationship analysis (e.g., known relationship between genes and/or compounds and/or whether the biological relationship is novel). The compound exploration initiation system 102 can generate a previous analysis digital text prompt that includes previous analysis rating instructions (e.g., instructions for generating a previous analysis rating metric). To illustrate, the compound exploration initiation system 102 can generate a digital text prompt that includes previous analysis rating instructions identifying a particular rating scale (e.g., from 0-10). A previous analysis digital text prompt can also include previous analysis contextual instructions (e.g., return or summarize previous research or articles regarding the predicted biological relationship).
Similarly, as used herein, a gene impact rating metric refers to a value, score, measure, or indication of activity, expression, relevance, effect, or influence of a gene. In particular, a gene impact rating can include a measure of gene expression (e.g., oncology expression), gene impact direction (e.g., oncology direction), or toxicity. The compound exploration initiation system 102 can generate a gene impact digital text prompt that includes gene impact rating instructions (e.g., instructions for generating a gene impact rating metric). To illustrate, the compound exploration initiation system 102 can generate a gene impact digital text prompt that includes gene impact rating instructions identifying a particular rating scale (e.g., from 0-5 rate human relevance of this gene with regard to a particular disease). A gene impact digital text prompt can also include gene impact contextual instructions (e.g., return or summarize the manner in which this gene impacts a particular disease).
Thus, as illustrated, the compound exploration initiation system 102 utilizes the language machine learning model 700 to generate the previous analysis rating metric 712 and the gene impact rating metric 714. For instance, the the previous analysis rating metric 712 can include a rating metric for the level of previous research for the anchor gene or anchor compound 502. Similarly, the gene impact rating metric 714 can include determining a disease connection source for a anchor gene (e.g., an association with cancer).
As illustrated, the compound exploration initiation system 102 utilizes the language machine learning model 700 to generate other rating metrics. For example, the language machine learning model 700 generates a rating metric 722 (and the rating metrics 724-726). For instance, the rating metric 722 can include a tractability/druggability rating metric.
As used herein, the term tractability rating metric (or druggability rating metric), refers to a value, score, measure, or indication of influence of compounds or drugs. For example, a tractability rating metric includes a measure of influence of compounds or drugs with regard to a particular disease or biological activity. Thus, a tractability rating metric includes a measure of impact of a drug or compound in treating a disease (e.g., feasibility of treating a disease using a compound). The compound exploration initiation system 102 can generate a tractability digital text prompt that includes tractability rating instructions (e.g., instructions for generating a tractability rating metric). To illustrate, the compound exploration initiation system 102 can generate a tractability digital text prompt that includes tractability rating instructions identifying a particular rating scale (e.g., from 0-5 rate tractability or druggability of a particular disease or biological activity). A tractability digital text prompt can also include tractability contextual instructions (e.g., describe one or more sources for the tractability rating metric).
Further,
Moreover,
Further,
Moreover, for the tractability determination, the compound exploration initiation system 102 further performs a first database query 708 from the cancer dependency map, which was discussed above, and a second database query 710 from a cancer database. For example, the cancer database can include data related to drug discovery, data visualization for cancer related biological substrates, ongoing cancer research, and other data related to medical compounds to anchor potential cancer substrates. As such, the cancer dependency map and the cancer database can contain some overlaps in data, which helps the compound exploration initiation system 102 reinforce findings related to oncological properties of an anchor gene. As shown from the first database query 708 and 710, the compound exploration initiation system 102 generates the rating metric 724 in a same or similar manner as discussed above and the rating metric 726 by parsing the cancer database to extract a correlation or indicator from the cancer related biological substrates, ongoing cancer research, and other data related to medical compounds for an anchor gene.
Similar to the principles discussed above in
Likewise, by combining the rating metrics 722-726, the compound exploration initiation system 102 generates a combined score 730 to determine whether a predetermined threshold is satisfied. To illustrate, the compound exploration initiation system 102 takes the combined score 728 and the combined score 730 and further determines a program rating 732. For instance, the program rating 732 can indicate whether a predetermined threshold was satisfied from a combination of rating metrics (or for individual rating metrics). For instance, the compound exploration initiation system 102 can indicate the program rating 732 as favorable if one of the rating metrics satisfies the predetermined threshold.
Although the above discussion describes utilizing a predetermined threshold to generate the program rating 732 (e.g., based on the combined scores), in some embodiments the compound exploration initiation system 102 utilizes the combined score(s) as the program rating 732 without the predetermined threshold. For instance, the compound exploration initiation system 102 provides the program rating 732 to an administrator computing device and a user of the administrator computing device can determine whether to initiate one or more compound exploration programs.
Furthermore, although
As mentioned above, in some embodiments the compound exploration initiation system 102 utilizes a particular digital text prompts to determine whether to initiate compound exploration programs.
For example,
In one or more embodiments, the gene impact digital text prompt 808 includes a prompt with text rating instructions for the language machine learning model 812 to rate a gene impact for a particular gene (e.g., the anchor gene). For instance, as mentioned previously a gene impact can include an effect, significance, or importance of a specific gene or variant of the gene in relation to various aspects. To illustrate, the gene impact can include the significance or importance to human relevance, gene function or activity, and gene toxicity signals.
For example, in some embodiments, the gene impact digital text prompt 808 can include a gene expression digital text prompt (e.g., a prompt related to expression of a gene with regard to a biological activity/disease), a gene impact direction digital text prompt (e.g., a prompt related to the functional significance of a gene such as its particular directional role in different cellular processes or biochemical pathways such as whether the gene is considered an oncogene or a suppressor gene), or a gene toxicity digital text prompt (e.g., a prompt related to a gene's role in toxicity such as how a gene's activity or expression relates to toxic effects).
For instance, an oncology expression digital text prompt (e.g., human relevance) can include a prompt as follows: “I will give you a gene and a list of cancer indications in humans. Supply a score and information about the gene's relevance to each of the indications. {gene} scoring rubric: 0—weak relevance of {gene} in cancer indication; 1—some evidence of altered expression of {gene} in cancer; 2—target mutation in the cancer indication; 3—putative target.” Moreover, in some embodiments, the oncology expression digital text prompt can further include a context generation prompt. To illustrate, the context generation prompt can include (in addition to the rating metric e.g., score), “with the score please provide evidence linking {gene} to cancer development and also evidence that modulation of {gene} slows growth.”
Further, an oncology direction digital text prompt (e.g., direction) can include a prompt as follows: “I will provide a gene, supply information about this gene in the following format. Give a confidence score from 0-5 classifying {gene} in each of the following categories: oncogene, tumor suppressor, tumor dependency, driver of drug resistance, loss in drug resistance. Please format the response as follows: {gene}; score for each of the categories.”
Moreover, a gene toxicity digital text prompt (e.g., toxicity) can include a prompt as follows: “Please provide information about how well validated a {gene} is linked to toxicity. For each {gene}, provide information on the quality and reproducibility of evidence linking the gene to toxicity. Assign a score from 0.0-5.0 based on the amount of evidence that validates the gene linked to toxicity.”
In one or more embodiments, the tractability digital text prompt 810 includes a prompt with text rating instructions for the language machine learning model to rate compound tractability. For instance, the tractability digital text prompt includes text for generating a tractability rating metric indicating druggability of a particular biological activity or disease. Specifically, as discussed, druggability includes a likelihood that a specific target can be modulated by a drug.
To illustrate, in one or more embodiments, the tractability digital text prompt 810 includes a compound tractability digital text prompt. To illustrate, the tractability digital text prompt 810 can include a prompt as follows: “I will give you a gene, supply a score and information about the difficulty of developing a drug targeting this gene. Please format the response as follows: {gene}; score. Information regarding the difficulty.”
As shown in
In one or more embodiments, in response to the compound exploration initiation system 102 initiating one or more compound exploration programs, the compound exploration initiation system 102 generates an additional processed biological representation. For instance, the compound exploration initiation system 102 generates an additional processed biological representation to begin downstream analysis tasks for the compound exploration program. The additional processed biological representations can include, for example, additional phenomic image embeddings generated from additional phenomic digital images. The additional processed biological representations can also include additional machine learning binding predictions between compounds and proteins. To illustrate, in some embodiments the compound exploration initiation system 102 generates the additional biological machine learning model for use in IPG and/or ICG processes discussed above.
Although the above description in relation to
For instance, although not shown in
For example, in some embodiments the previous model validator digital text prompt includes an oncology validation digital text prompt. For instance, the oncology validation digital text prompt includes a text prompt for generating oncology validation rating metrics indicating a measure of previous models validating a relationship between a gene/compound and cancer.
To illustrate, an oncology validation digital text prompt can include a digital text prompt as follows: “I will provide a gene, supply information about this gene in the following format. Here is a scoring rubric from 0-4 for {gene}. 0—no in-vivo or in-vitro data; 1—in-vitro evidence showing target link; 2—single in-vitro and in-vivo study; 3—in-vivo data in more than two models; 4—in-vivo data in multiple models in greater than two peer reviewed studies. Please format the response as follows: {gene}; 0-4 score based on the rubric.”
In some embodiments, a therapy availability digital text prompt can include an oncology therapy availability text prompt (which further includes cancer indication unmet need). For instance, the oncology therapy availability text prompt can include a text prompt for generating an oncology therapy rating metric indicating a measure of existing cancer treatments/therapies.
For example, an oncology therapy availability digital text prompt can include a prompt as follows: “I will provide a human cancer indication and you will supply a score and information about the unmet need in this indication. Here is the cancer indication: {indication}. Scoring rubric: 0—curative treatments exist; 1—low unmet need, treatments available; 2—medium low unmet need, multiple lines of therapy available; 3—medium unmet need, some targeted treatments available; 4—medium high unmet need, treatment options limited; 5—high unmet need, no treatments available.” Moreover, in some instances, the oncology therapy availability digital text prompt can include a context generation prompt. For example, the context generation prompt can include “score+estimated number of new {country} cases per year, estimated number of {country} deaths per year, and an explanation for the reason of the score.”
In addition, a compound availability digital text prompt can include an oncology compound availability text prompt. For instance, the compound availability text prompt can include a text prompt for generating an oncology availability rating metric indicating a measure of cancer treating compounds (e.g., already existing or utilized in the competitive landscape).
For example, in some instances, the oncology compound availability digital text prompt can include a prompt as follows: “I will give you a human cancer indication and you will supply a score and information about the presence, phase, and progress of efforts targeting the cancer indication. Cancer {indication}. Scoring rubric: 0—high, >2 competitors with approved drugs for the {indication}; 1—medium high, >4 competitors in clinical trials; 2—medium, <4—clinical trials; 3—medium low, <4 competitors in early phase clinical trials; 4—low, 2 competitors in early phase clinical trials; 5—very low, no competitors in clinical trials.” Moreover, the oncology compound availability digital text prompt can include a context generation prompt that includes “score+the number of programs and the latest phase reached in a clinical trial.”
In some embodiments, a relationship analysis digital text prompt includes a known relationship digital text prompt (e.g., a prompt for measuring novel relationships between genes or between a gene and a compound). For instance, the relationship analysis digital text prompt includes a text prompt for generating a relationship analysis rating metric indicating a measure of novelty for a particular relationship between genes, drug candidates, and/or biological target. For instance, a known relationship digital text prompt can include a prompt as follows: “I will give you a gene pair, tell me whether the two genes are known to be biologically related (in the same pathway). {gene 1} and {gene 2}.”
Although
In one or more embodiments, the compound exploration initiation system 102, parses through the phenomic image embeddings database 900 to identify a plurality of predicted biological relationships by comparing phenomic image embeddings with one another. For example, “the top stream” of
As shown, the compound exploration initiation system 102 can utilize multiple data streams to identify related gene-compound/protein-compound interactions and further generate different digital text prompts for the language machine learning model to generate rating metrics.
Specifically, the compound exploration initiation system 102 can identify a related protein to the anchor gene and utilize protein binding predictions to analyze a predicted biological relationship (and generate rating metrics utilizing a language machine learning model). For instance, the compound exploration initiation system 102 can identify a protein synthesized by an anchor gene. In some such instances, the compound exploration initiation system 102 can further utilize a binding representation database 914 to analyze protein-compound interactions for this corresponding protein to determine a predicted biological relationship. Moreover, the compound exploration initiation system 102 can analyze this predicted biological relationship utilizing a prompts and a language machine learning model to generate rating metrics. In other words, the compound exploration initiation system 102 identifies protein-compound interactions in the binding representation database 914 and utilizes these interactions as a predicted biological relationship for generating rating metrics (in parallel with the predicted biological relationships generated from the phenomic image embeddings database 900). In other words,
As shown in the top stream, the compound exploration initiation system 102 generates a gene impact direction digital text prompt 906, then the compound exploration initiation system 102 can generate a compound tractability digital text prompt 908 (e.g., a druggability digital text prompt) and/or a relationship analysis digital text prompt 910. Accordingly, the compound exploration initiation system 102 can utilize rating metrics generated from each of the digital text prompts to adjust/modify digital text prompts used in a parallel data stream (e.g., the bottom data stream).
As shown in “the bottom stream” of
In utilizing the digital text prompts shown in
As shown, following the compound tractability digital text prompt 908 and/or the relationship analysis digital text prompt 910, the compound exploration initiation system 102 further generates a compound tractability digital text prompt 912 and/or a relationship analysis digital text prompt 926 to generate additional rating metrics related to the anchor gene or anchor compound 904 and the predicted biological relationships 918.
As shown, from the series of digital text prompts submitted to a language machine learning model, the compound exploration initiation system 102 generates various rating metrics and a program rating to determine to perform the act 928 of initiating compound exploration programs.
Although not illustrated in
Further, in some instances the compound exploration initiation system 102 can utilize the combination algorithm to receive as input multiple rating metrics corresponding to multiple digital text prompts and determine a subsequent digital text prompt. Specifically, in some instances the compound exploration initiation system 102 assigns weights to different digital text prompts, and for a combination of rating metrics below a certain number, the compound exploration initiation system 102 identifies a subsequent digital text prompt with a lower weight. Conversely, for a combination of rating metrics above a certain number, the compound exploration initiation system 102 identifies a subsequent digital text prompt with a higher weight. Moreover, in some instances the compound exploration initiation system 102 utilizes the combination algorithm to generate the program rating from combining multiple rating metrics.
Although not shown in
Furthermore, although
In some implementations, the compound exploration initiation system 102 generates user interfaces for efficiently displaying rating metrics, contextual information, and/or program ratings in initiating a compound exploration program.
As shown,
Furthermore, the compound exploration initiation system 102 causes the graphical user interface 1002 to display a first rating metric 1006, a second rating metric 1010, and a third rating metric 1012. For instance, the compound exploration initiation system 102 receives the predicted biological relationship and identifies a set of digital text prompts to send to the language machine learning model. Further, in some such instances the compound exploration initiation system 102 generates rating metrics for the set of digital text prompts utilizing the language machine learning model. Moreover, in such instances the compound exploration initiation system 102 causes the graphical user interface 1002 to provide for display the rating metrics obtained from the language machine learning model.
As shown, the first rating metric 1006 reads” preclinical validation rating metric: 5-strong in vivo data supporting anti-tumor activity from 3 independent peer reviewed studies in paper 1, paper 2, and paper 3.” As shown, the text following the first rating metric 1006 includes context 1008 returned with the rating metric in response to context generation instructions. Further, the underlined paper 1, paper 2, and paper 3 can indicate links to the cited papers. For instance, the compound exploration initiation system 102 receives the predicted biological relationship and generates a set of digital text prompts that includes context generation instructions. Moreover, in the context generation instructions instruct the language machine learning model to return published papers related to the rating metrics. Accordingly, the compound exploration initiation system 102 receives from the language machine learning model the rating metrics and corresponding published papers that support the rating metrics. Thus, the compound exploration initiation system 102 causes the graphical user interface 1002 to display the context 1008 obtained from the language machine learning model.
Moreover, as shown, the second rating metric 1010 reads “human relevance rating metric: 4” and the third rating metric 1012 reads “druggability rating metric: 3.” For instance, the compound exploration initiation system 102 identifies the predicted biological relationship for a anchor gene and identifies a corresponding human relevance digital text prompt. Moreover, in some such instances the compound exploration initiation system 102 sends the human relevance digital text prompt to the language machine learning model and receives a rating metric of 4 (e.g., which indicates in the example given above a putative target with known significance to cancer). Likewise, the compound exploration initiation system 102 receives the druggability rating metric, in a similar manner as just described. Specifically, the compound exploration initiation system 102 receives these rating metrics and causes the graphical user interface 1002 to display the rating metrics obtained from the language machine learning model.
As further shown, the compound exploration initiation system 102 further causes the graphical user interface 1002 to display a program rating 1014 which reads “program rating: 4.” For instance, the compound exploration initiation system 102 receives the rating metrics 1006, 1010, and 1012 from the language machine learning model and further combines the rating metrics to determine a program rating. As described above, in some embodiments, the compound exploration initiation system 102 utilizing a program rating model (e.g., combination model) to combine individual rating metrics and determine the program rating 1014. Further, after generating the program rating 1014, the compound exploration initiation system 102 causes the graphical user interface 1002 to display the program rating 1014.
In addition to showing the program rating, the compound exploration initiation system 102 also causes the graphical user interface 1002 to provide an element 1016 that reads “initiate program.” In some embodiments, selecting the element 1016 causes the compound exploration initiation system 102 to trigger one or more compound exploration program(s) related to the predicted biological relationship 1004 (e.g., such as generating an additional machine learning representation for downstream analysis).
As mentioned above, in some embodiments the compound exploration initiation system 102 scales the validation of hundreds of thousands to millions of predicted biological relationships by automatically querying the language machine learning model with digital text prompts related to the predicted biological relationships. Although not shown in
In some implementations, the compound exploration initiation system 102 generates digital text prompts and provides the digital text prompts for display via the administrator computing device 1000. The administrator computing device 1000 can then view and/or modify the text prompts before applying the language machine learning model.
Although not illustrated, in some implementations, the compound exploration initiation system 102 monitors performance of compounds/genes in future programs to improve program initiation. For example, the compound exploration initiation system 102 monitors IPG and/or ICG processes to identify successful compounds that modulate biology. The compound exploration initiation system 102 can then utilize these successful compounds to modify weights, combination algorithms, or parameters to improve program initiation predictions in the future.
While
For example, in one or more embodiments, the acts 1102-1108 include identifying, from a processed biological representation (or biological machine learning representation), a predicted biological relationship for an anchor compound or an anchor gene; generating, from the predicted biological relationship for the anchor compound or the anchor gene, a plurality of digital text prompts, wherein the plurality of digital text prompts comprise the anchor compound or the anchor gene and a plurality of text rating instructions for a language machine learning model; generating, from the plurality of digital text prompts utilizing the language machine learning model, a plurality of rating metrics according to the plurality of text rating instructions; and combining the plurality of rating metrics to generate a program rating for the anchor compound or the anchor gene for initiating one or more compound exploration programs.
In one or more implementations, the series of acts 1100 include generating, utilizing a machine-learning model, a plurality of phenomic image embeddings from a plurality of perturbation images portraying a plurality of cell perturbations; comparing the plurality of phenomic image embeddings to determine a measure of similarity; and identifying the predicted biological relationship from the measure of similarity.
In addition, in one or more implementations, the series of acts 1100 includes identifying compound features corresponding to a compound and protein features corresponding to a protein; generating, utilizing a compound protein-pocket interaction machine-learning model, a machine learning binding representation between the compound and the protein utilizing the compound features and the protein features; and identifying the predicted biological relationship from the machine learning binding representation.
Further, in some implementations, the series of acts 1100 includes identifying a plurality of digital text prompt templates comprising one or more placeholder query fields; and generating the plurality of digital text prompts by populating the one or more placeholder query fields of the plurality of digital text prompt templates based on the anchor compound or the anchor gene.
In one or more implementations, the series of acts 1100 includes generating the plurality of digital text prompts further comprises generating, for the anchor compound or the anchor gene, a gene impact digital text prompt comprising gene impact text rating instructions; and generating the plurality of rating metrics comprises generating, from the gene impact digital text prompt comprising the gene impact text rating instructions utilizing the language machine learning model, a gene impact rating metric indicating a measure of impact corresponding to the target gene.
In addition, in some implementations, the series of acts 1100 includes generating, for the anchor compound or the anchor gene, at least one of: a previous analysis digital text prompt comprising previous analysis text rating instructions indicating a measure of previous analysis of the predicted biological relationship, or a tractability digital text prompt comprising tractability text rating instructions indicating a measure of tractability of impacting the anchor gene utilizing a compound.
Further, in one or more implementations, the series of acts 1100 includes generating a digital text prompt comprising the anchor compound or the anchor gene, a text rating instruction, and a context generation instruction; generating, from the context generation instruction utilizing the language machine learning model, a contextual text description for the anchor compound or the anchor gene; and providing, for display, via a graphical user interface of an administrator computing device, the program rating and the contextual text description.
In addition, in one or more implementations, the series of acts 1100 includes generating the program rating based on determining that a subset of rating metrics of the plurality of rating metrics satisfies a predetermined rating metric threshold.
Further, in one or more implementations, the series of acts 1100 includes providing for display via a graphical user interface of an administrator computing device, the program rating, for the anchor compound or the anchor gene, and the plurality of rating metrics. Moreover, in one or more implementations, the series of acts 1100 includes initiating the one or more compound exploration programs based on the program rating by generating an additional processed biological representation for the anchor compound or the anchor gene.
Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., memory), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are non-transitory computer-readable storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (“SSDs”) (e.g., based on RAM), Flash memory, phase-change memory (“PCM”), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.
Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
Computer-executable instructions comprise, for example, instructions and data which, when executed by a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed by a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
Embodiments of the present disclosure can also be implemented in cloud computing environments. As used herein, the term “cloud computing” refers to a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In addition, as used herein, the term “cloud-computing environment” refers to an environment in which cloud computing is employed.
As shown in
In particular embodiments, the processor(s) 1202 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions, the processor(s) 1202 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 1204, or a storage device 1206 and decode and execute them.
The computing device 1200 includes memory 1204, which is coupled to the processor(s) 1202. The memory 1204 may be used for storing data, metadata, and programs for execution by the processor(s). The memory 1204 may include one or more of volatile and non-volatile memories, such as Random-Access Memory (“RAM”), Read-Only Memory (“ROM”), a solid-state disk (“SSD”), Flash, Phase Change Memory (“PCM”), or other types of data storage. The memory 1204 may be internal or distributed memory.
The computing device 1200 includes a storage device 1206 includes storage for storing data or instructions. As an example, and not by way of limitation, the storage device 1206 can include a non-transitory storage medium described above. The storage device 1206 may include a hard disk drive (HDD), flash memory, a Universal Serial Bus (USB) drive or a combination these or other storage devices.
As shown, the computing device 1200 includes one or more I/O interfaces 1208, which are provided to allow a user to provide input to (such as user strokes), receive output from, and otherwise transfer data to and from the computing device 1200. These I/O interfaces 1208 may include a mouse, keypad or a keyboard, a touch screen, camera, optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces 1208. The touch screen may be activated with a stylus or a finger.
The I/O interfaces 1208 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, I/O interfaces 1208 are configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
The computing device 1200 can further include a communication interface 1210. The communication interface 1210 can include hardware, software, or both. The communication interface 1210 provides one or more interfaces for communication (such as, for example, packet-based communication) between the computing device and one or more other computing devices or one or more networks. As an example, and not by way of limitation, communication interface 1210 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI. The computing device 1200 can further include a bus 1212. The bus 1212 can include hardware, software, or both that connects components of computing device 1200 to each other.
In one or more implementations, various computing devices can communicate over a computer network. This disclosure contemplates any suitable network. As an example, and not by way of limitation, one or more portions of a network may include an ad hoc network, an intranet, an extranet, a virtual private network (“VPN”), a local area network (“LAN”), a wireless LAN (“WLAN”), a wide area network (“WAN”), a wireless WAN (“WWAN”), a metropolitan area network (“MAN”), a portion of the Internet, a portion of the Public Switched Telephone Network (“PSTN”), a cellular telephone network, or a combination of two or more of these.
In particular embodiments, the computing device 1200 can include a client device that includes a requester application or a web browser, such as MICROSOFT INTERNET EXPLORER, GOOGLE CHROME or MOZILLA FIREFOX, and may have one or more add-ons, plug-ins, or other extensions, such as TOOLBAR or YAHOO TOOLBAR. A user at the client device may enter a Uniform Resource Locator (“URL”) or other address directing the web browser to a particular server (such as server), and the web browser may generate a Hyper Text Transfer Protocol (“HTTP”) request and communicate the HTTP request to server. The server may accept the HTTP request and communicate to the client device one or more Hyper Text Markup Language (“HTML”) files responsive to the HTTP request. The client device may render a webpage based on the HTML files from the server for presentation to the user. This disclosure contemplates any suitable webpage files. As an example, and not by way of limitation, webpages may render from HTML files, Extensible Hyper Text Markup Language (“XHTML”) files, or Extensible Markup Language (“XML”) files, according to particular needs. Such pages may also execute scripts such as, for example and without limitation, those written in JAVASCRIPT, JAVA, MICROSOFT SILVERLIGHT, combinations of markup language and scripts such as AJAX (Asynchronous JAVASCRIPT and XML), and the like. Herein, reference to a webpage encompasses one or more corresponding webpage files (which a browser may use to render the webpage) and vice versa, where appropriate.
In particular embodiments, the tech-bio exploration system 104 may include a variety of servers, sub-systems, programs, modules, logs, and data stores. In particular embodiments, the tech-bio exploration system 104 may include one or more of the following: a web server, action logger, API-request server, transaction engine, cross-institution network interface manager, notification controller, action log, third-party-content-object-exposure log, inference module, authorization/privacy server, search module, user-interface module, user-profile (e.g., provider profile or requester profile) store, connection store, third-party content store, or location store. The tech-bio exploration system 104 may also include suitable components such as network interfaces, security mechanisms, load balancers, failover servers, management-and-network-operations consoles, other suitable components, or any suitable combination thereof. In particular embodiments, the tech-bio exploration system 104 may include one or more user-profile stores for storing user profiles and/or account information for credit accounts, secured accounts, secondary accounts, and other affiliated financial networking system accounts. A user profile may include, for example, biographic information, demographic information, financial information, behavioral information, social information, or other types of descriptive information, such as interests, affinities, or location.
The web server may include a mail server or other messaging functionality for receiving and routing messages between the tech-bio exploration system 104 and one or more client devices. An action logger may be used to receive communications from a web server about a user's actions on or off the tech-bio exploration system 104. In conjunction with the action log, a third-party-content-object log may be maintained of user exposures to third-party-content objects. A notification controller may provide information regarding content objects to a client device. Information may be pushed to a client device as notifications, or information may be pulled from a client device responsive to a request received from the client device. Authorization servers may be used to enforce one or more privacy settings of the users of the tech-bio exploration system 104. A privacy setting of a user determines how particular information associated with a user can be shared. The authorization server may allow users to opt in to or opt out of having their actions logged by the tech-bio exploration system 104 or shared with other systems, such as, for example, by setting appropriate privacy settings. Third-party-content-object stores may be used to store content objects received from third parties. Location stores may be used for storing location information received from a client device associated with users.
In the foregoing specification, the invention has been described with reference to specific example embodiments thereof. Various embodiments and aspects of the invention(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the invention and are not to be construed as limiting the invention. Numerous specific details are described to provide a thorough understanding of various embodiments of the present invention.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel to one another or in parallel to different instances of the same or similar steps/acts. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.