This U.S. patent application claims priority under 35 U.S.C. § 119 to: Indian Patent Application No. 202221014386, filed on Mar. 16, 2022. The entire contents of the aforementioned application are incorporated herein by reference.
The disclosure herein generally relates to search optimization techniques, and, more particularly, to optimal variables selection for generating predictive models using population based exhaustive replacement techniques.
Advancements in various sciences such as physical, life, social sciences etc., have generated large amounts of data and there is great interest to make use of these data for the creation of new knowledge, as it is expected to improve the quality of human life. The quest for the new knowledge that includes insights, rules, alerts, predictive models etc. and its associated positive impact on humanity, have created an urgent need for the development of efficient data analytics techniques and technologies such as high performance computing, cloud computing etc., which can handle large amounts of data. Variable selection methods are one such data analytics approach that is applied to the selection of a subset of variables (X) from a large pool of variables based on various statistics measures. The selected variables can be used for the development of prediction models for a dependent variable (Y), when used with modelling techniques such as multiple linear regression, nonlinear regression, etc. The variables selection can be accomplished using a random or exhaustive search technique. The random approach includes heuristic methods such as ant colony, particle swarm optimization, genetic algorithm, and the like that are less compute intensive; however, these methods cannot guarantee an optimal solution as they fail to explore the complete problem (variable) space. Unlike a random approach, the exhaustive search approach, evaluates each possible combination and thus provides the best solution; however, it is a computationally hard problem, thus limiting its applications to the selection of smaller subsets.
Predictive regression model generation, in principle involves the following three critical steps: a) data division, b) optimal feature/variable selection from a large pool of structural features and c) model generation from the selected optimal features. Data quality and the efficiency of the above three steps determine robustness of the predictive models and their applications/business impact. For example, failures of drug candidates in late stages can be addressed using reliable and easily applicable predictive regression ADMET models [Absorption, Distribution, Metabolism, Excretion and Toxicity]. As these computational models rationalize experimental observations, offer potential for virtual screening applications, and consequently can help in reducing time and cost of the drug discovery and development process, consequently, have wide applications within pharmaceutical industry. The generation of predictive ADMET models based on structural features of drugs and drug candidates typically involves three critical steps, discussed earlier. The variable selection step enables researchers to a) derive rules/alerts that can be used for improving research outcomes, and b) provide the best sub-set of variables to generate robust predictive models that are applicable in virtual screening of drug candidates even before they are produced in laboratory.
The use of large number of descriptors/physico-chemical properties that are indicative of molecular structural features is becoming more common and selection of optimal numbers of non-correlated features (variable selection) from a large number of features is a computationally intensive step. Systematic search methods are expected to provide the best possible solutions. For instance, conventional approach such as Replacement Method (RM) involves selecting an initial set of descriptors of size ‘r’ at random whose inter correlation among the selected subset of descriptors is less than a user defined threshold, and then for each path ‘l’, which ranges from ‘1’ to ‘r’, lth descriptor is replaced with all remaining descriptors and objective function for each updated subset is calculated. Thereafter, a subset (referred as optimal subset) is chosen from the above generated subsets having maximum or minimum objective function based on the problem description. Following this, standard errors/deviation of the descriptors of the optimal subset are computed to choose a descriptor with highest standard error to replace next until one or more stopping criterion is met.
Another conventional approach referred as ‘Modified Replacement Method’ or MRM, wherein this follows similar algorithm as RM but in each step the descriptor with largest error is substituted even if that substitution results in a lower objective function. In MRM, the stopping criteria is frequency of an optimal set and number of steps.
Yet another conventional approach referred as ‘Enhanced Replacement Method (ERM)’ which is a combination of conventional methods RM and MRM in which MRM is performed in between RM. In this algorithm, RM method is first performed, and the output of RM is set as input for MRM and the output of MRM is again used as input for the RM.
Yet further conventional approach includes RM and ERM First Step Modification (RMfsm and ERMfsm). Authors of this conventional approach proposed many versions of the RM and ERM, which can contain one or more of the below steps (a) First step modification: in this only one path is selected to replace based on relative standard deviation rather than all possible ‘d’ paths; (b) Arbitrary first step: the descriptor to be replaced is selected randomly; (c) Using more than one starting initial subsets; (d) Selecting initial subset with high standard deviation, and the like.
Yet another conventional approach is ERM with Genetic Algorithm (GA). Authors of this conventional approach proposed various strategies for integration of genetic algorithm and enhanced replacement algorithm. They are: (a) GA with mutation operator position determined by rsd instead of a random selection; (b) GA with crossover operator position determined by rsd instead a random selection; (c) GA with both modified mutation and modified crossover operator; and (d) ERM with an initial population (ERMp) similar to GA.
Though these conventional approaches performed better in some way or the other to produce near satisfactory results, however, they have their own limitations such as unrealistic computing time, particularly when the number of variables to be selected are higher and consequently, wherein optimal solution is not certain and suffer from slower convergence of the solution, lower predictive capability of the models and the like. Because of the unrealistic computing time required, selection and generation of models with larger number of variables becomes a challenging task, thereby, minimizing the applications of the models with pharmaceutical industry. In other words, the conventional approaches as known in the art are computationally intensive and consequently, may not be employed regularly in solving real world problems within an industry (e.g., pharmaceutical industry) due to lack of needed resources and time, when the end-user needs to build models based on a large number of variables.
Embodiments of the present disclosure present technological improvements as solutions to one or more of the above-mentioned technical problems recognized by the inventors in conventional systems. For example, in one aspect, there is provided processor implemented population based exhaustive replacement method for selecting one or more optimal variables from a set of unique variables and generating predictive models thereof. The method comprises: (i) inputting, via one or more hardware processors, a set of physico-chemical properties X derived from chemical structures of drugs and drug like chemical compounds, a biological response Y associated thereof and a size of variables set r; (ii) initializing a population Spop comprising matrix subsets {X1, X2, . . . Xpoo} of variables selected from a filtered set of variables Xf of size n, wherein Xf is derived from a set of variables X, wherein Spop is initialized based on one or more pre-defined criteria, wherein pop represents size of the population, and wherein each matrix subset Xi from Spop comprises variables that are unique from each other and is of the size of variables set r; (iii) selecting at least one subset Xi from Spop and at least one path Pl of the at least one selected subset Xi, wherein the at least one selected path Pl comprises a variable vq to be replaced, wherein each variable vq comprised in the at least one selected path Pl is a vector of size m describing a property of input chemical compounds, and wherein m represents a number of chemical compounds used for building predictive models; (iv) replacing the variable vq from at least one subset Xi with remaining (n−r) variables of the set of variables Xf to obtain a set of modified subsets X′i {X′i1, X′i2, . . . , X′i(n-r)}, wherein each of the modified subsets X′ij comprises replaced variables for the at least selected path Pl, and wherein size of the set of modified subsets X′i is of (n−r); (v) generating a predictive model for each of the modified subsets X′ij and calculating an objective function thereof to obtain a first set of predictive models Mi,rm and associated objective functions OFi,rm, wherein the first set of predictive models Mi,rm are generated based on the biological response Y and each of the modified subsets X′ij variable vectors of a set of input chemical compounds; (vi) identifying an optimal modified subset of replaced variables Xioptimal from the set of modified subsets X′i based on an optimal objective function associated with an optimal predictive model Mioptimal being identified from the first set of predictive models Mi,rm; (vii) updating the at least one selected path Pl for the optimal modified subset of replaced variables Xioptimal, wherein the steps (iv) till (vii) are iteratively performed until one or more predefined criteria are met to obtain an optimal population Spopoptimal; (viii) identifying an optimal element Xoptimal of Spopoptimal, wherein Xoptimal comprises an optimal objective function amongst objective functions comprised in other Xioptimal; (ix) performing an exhaustive search on a pool of variable Xpool created using the optimal population Spopoptimal at to obtain a set of variable subsets Sx, wherein each element of Sx comprises set of r variables; (x) generating the predictive model for each element of Sx and calculating the objective function thereof to obtain a second set of predictive models Mx and associated objective functions OFx; (xi) identifying a pop number of optimal elements from Sx to update the optimal population Spopoptimal and to obtain an updated population Ses; (xii) identifying an optimal element Xoptimal,es amongst elements Xes comprised in the updated population Ses, and comparing an objective function of Xoptimal,es with an objective function of the identified optimal element Xoptimal for updation of the identified optimal element Xoptimal; (xiii) randomly replacing variables of the updated population Ses to obtain a perturbed set of population Sp; and (xiv) generating one or more predictive models based on the selected optimal subset of variables Xoptimal.
In an embodiment, the one or more predefined criteria comprise: (i) subsets in which variables having inter correlation amongst each other below a first predefined threshold, (ii) subsets whose variables are correlated with the biological response Y above a second defined threshold or a system generated dynamic threshold; (iii) seeded subsets having variables whose sizes are less than a current subset size and (iv) a user defined preferences for the set of variables X.
In an embodiment, the at least one selected path Pl is identified based on a relative error value associated with the variable vq.
In an embodiment, the steps (ii) till (xiii) are iteratively performed either sequentially or in parallel across multiple processors threads.
In an embodiment, the steps (iv) till (vii) are iteratively performed until the one or more predefined criteria are met, and wherein the one or more predefined criteria further comprise at least one of (a) a predefined number of iterations, (b) frequency of the matrix subset of variables, (c) the optimal objective function reaching a predetermined value, and (d) an improvement in the objective function in subsequent iterations reaching a predefined saturation value or machine epsilon.
In an embodiment, the steps (ix) till (xii) are performed either sequentially or in parallel across multiple processors threads.
The method may further comprise upon satisfying the one or more predefined criteria, and upon iteratively performing the steps (ii) till (xiii), generating a final optimal subset based on the identified optimal element Xoptimal for predicting biological responses of chemical compounds.
In an embodiment, the step of initializing a population Spop comprising matrix subsets {X1, X2, . . . Xpop} of variables selected from the filtered set of variables Xf is preceded by transforming the set of variables X to obtain a set of transformed variables; and filtering the set of transformed variables based on one or more statistical criteria and predefined thresholds to obtain a filtered set of variables Xf.
In an embodiment, the one or more predictive models comprise one or more linear regression models, one or more non-linear regression models, or one or more classification models, and wherein the one or more predictive models and the identified optimal subset Xoptimal are used for generating one or more rules or alerts.
In another aspect, a processor implemented population based exhaustive replacement system for selecting one or more optimal variables from a set of unique variables and generating predictive models thereof is provided. The system comprises a memory storing instructions; one or more communication interfaces; and one or more hardware processors coupled to the memory via the one or more communication interfaces, wherein the one or more hardware processors are configured by the instructions to: (i) receive, a set of physico-chemical properties X derived from chemical structures of drugs and drug like chemical compounds, a biological response Y associated thereof and a size of variables set r; (ii) initialize a population Spop comprising matrix subsets {X1, X2, . . . Xpop} of variables selected from a filtered set of variables Xf of size n, wherein Xf is derived from a set of variables X, wherein Spop is initialized based on one or more pre-defined criteria, wherein pop represents size of the population, and wherein each matrix subset Xi from Spop comprises variables that are unique from each other and is of the size of variables set r; (iii) select at least one subset Xi from Spop and at least one path of the at least one selected subset Xi, wherein the at least one selected path Pl comprises a variable vq to be replaced, wherein each variable vq comprised in the at least one selected path Pl is a vector of size m describing a property of input chemical compounds, and wherein m represents the number of chemical compounds used for building predictive models; (iv) replace the variable from at least one subset Xi with remaining (n−r) variables of the set of variables Xf to obtain a set of modified subsets X′i {X′i1, X′i2, . . . , X′i(n-r)}, wherein each of the modified subsets X′ij comprises replaced variables for the at least selected path Pl, and wherein size of the set of modified subsets X′i is of (n−r); (v) generate a predictive model for each of the modified subsets X′ij, and calculating an objective function thereof to obtain a first set of predictive models Mi,rm and associated objective functions OFi,rm, wherein the first set of predictive models Mi,rm are generated based on the biological response Y and each of the modified subsets X′ij variable vectors of a set of input chemical compounds; (vi) identify an optimal modified subset of replaced variables Xioptimal from the set of modified subsets X′i based on an optimal objective function associated with an optimal predictive model Mioptimal being identified from the first set of predictive models Mi,rm; (vii) update the at least one selected path Pl for the optimal modified subset of replaced variables Xioptimal, wherein the steps (iv) till (vii) are iteratively performed until one or more predefined criteria are met to obtain an optimal population Spopoptimal; (viii) identify an optimal element Xoptimal of Spopoptimal, wherein Xoptimal comprises an optimal objective function amongst objective functions comprised in other Xioptimal; (ix) perform an exhaustive search on a pool of variable Xpool created using the optimal population Spopoptimal to obtain a set of variable subsets Sx, wherein each element of Sx comprises set of r variables; (x) generate the predictive model for each element of Sx and calculating the objective function thereof to obtain a second set of predictive models Mx and associated objective functions OFx; (xi) identify a pop number of optimal elements from Sx to update the optimal population Spopoptimal and to obtain an updated population Ses; (xii) identify an optimal element Xoptimal,es amongst elements Xes comprised in the updated population Ses, and comparing an objective function of Xoptimal,es with an objective function of the identified optimal element Xoptimal for updation of the identified optimal element Xoptimal; (xiii) randomly replace variables of the updated population Ses to obtain a perturbed set of population Sp; and (xiv) generate one or more predictive models based on the selected optimal subset of variables Xoptimal.
In an embodiment, the one or more predefined criteria comprise: (i) subsets in which variables having inter correlation amongst each other below a first predefined threshold, (ii) subsets whose variables are correlated with the biological response Y above a second defined threshold or a system generated dynamic threshold; (iii) seeded subsets having variables whose sizes are less than a current subset size and (iv) a user defined preferences for the set of variables X.
In an embodiment, the at least one selected path Pl is identified based on a relative error value associated with the variable vq.
In an embodiment, the steps (ii) till (xiii) are iteratively performed either sequentially or in parallel across multiple processors threads.
In an embodiment, the steps (iv) till (vii) are iteratively performed until the one or more predefined criteria are met, and wherein the one or more predefined criteria further comprise at least one of (a) a predefined number of iterations, (b) frequency of the matrix subset of variables, (c) the optimal objective function reaching a predetermined value, and (d) an improvement in the objective function in subsequent iterations reaching a predefined saturation value or machine epsilon.
In an embodiment, the steps (ix) till (xii) are performed either sequentially or in parallel across multiple processors threads.
In an embodiment, the one or more hardware processors are further configured by the instructions to: generating a final optimal subset based on the identified optimal element Xoptimal for predicting biological responses of chemical compounds, upon satisfying the one or more predefined criteria, and upon iteratively performing the steps (ii) till (xiii).
In an embodiment, the step of initializing a population Spop comprising matrix subsets {X1, X2, . . . Xpop} of variables selected from the filtered set of variables Xf is preceded by transforming the set of variables X to obtain a set of transformed variables; and filtering the set of transformed variables based on one or more statistical criteria and predefined thresholds to obtain a filtered set of variables Xf.
In yet another aspect, there are provided one or more non-transitory machine readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause selecting one or more optimal variables from a set of unique variables and generating predictive models thereof by (i) receiving, via one or more hardware processors, a set of physico-chemical properties X derived from chemical structures of drugs and drug like chemical compounds, a biological response Y associated thereof and a size of variables set r; (ii) initializing a population Spop comprising matrix subsets {X1, X2, . . . Xpop} of variables selected from a filtered set of variables Xf of size n, wherein Xf is derived from a set of variables X, wherein Spop is initialized based on one or more pre-defined criteria, wherein pop represents size of the population, and wherein each matrix subset Xi from Spop comprises variables that are unique from each other and is of the size of variables set r; (iii) selecting at least one subset Xi from Spop and at least one path Pl of the at least one selected subset Xi, wherein the at least one selected path Pl comprises a variable vq to be replaced, wherein each variable vq comprised in the at least one selected path Pl is a vector of size m describing a property of input chemical compounds, and wherein m represents the number of chemical compounds used for building predictive models; (iv) replacing the variable vq from at least one subset Xi with remaining (n−r) variables of the set of variables Xf to obtain a set of modified subsets X′i {X′i1, X′i2, . . . , X′i(n-r)}, wherein each of the modified subsets X′ij comprises replaced variables for the at least selected path Pl, and wherein size of the set of modified subsets X′i is of (n−r); (v) generating a predictive model for each of the modified subsets X′ij, and calculating an objective function thereof to obtain a first set of predictive models Mi,rm and associated objective functions OFi,rm, wherein the first set of predictive models Mi,rm are generated based on the biological response Y and each of the modified subsets X′ij variable vectors of a set of input chemical compounds; (vi) identifying an optimal modified subset of replaced variables Xioptimal from the set of modified subsets X′i based on an optimal objective function associated with an optimal predictive model Mioptimal being identified from the first set of predictive models Mi,rm; (vii) updating the at least one selected path Pl for the optimal modified subset of replaced variables Xioptimal, wherein the steps (iv) till (vii) are iteratively performed until one or more predefined criteria are met to obtain an optimal population Spopoptimal; (viii) identifying an optimal element Xoptimal of Spopoptimal, wherein Xoptimal comprises an optimal objective function amongst objective functions comprised in other Xioptimal; (ix) performing an exhaustive search on a pool of variable Xpool created using the optimal population Spopoptimal to obtain a set of variable subsets Sx, wherein each element of Sx comprises set of r variables; (x) generating the predictive model for each element of Sx and calculating the objective function thereof to obtain a second set of predictive models Mx and associated objective functions OFx; (xi) identifying a pop number of optimal elements from Sx to update the optimal population Spopoptimal and to obtain an updated population Ses; (xii) identifying an optimal element Xoptimal,es amongst elements Xes comprised in the updated population Ses, and comparing an objective function of Xoptimal,es with an objective function of the identified optimal element Xoptimal for updation of the identified optimal element Xoptimal; (xiii) randomly replacing variables of the updated population Ses to obtain a perturbed set of population Sp; and (xiv) generating one or more predictive models based on the selected optimal subset of variables Xoptimal.
In an embodiment, the one or more predefined criteria comprise: (i) subsets in which variables having inter correlation amongst each other below a first predefined threshold, (ii) subsets whose variables are correlated with the biological response Y above a second defined threshold or a system generated dynamic threshold; (iii) seeded subsets having variables whose sizes are less than a current subset size and (iv) a user defined preferences for the set of variables X.
In an embodiment, the at least one selected path Pl is identified based on a relative error value associated with the variable vq.
In an embodiment, the steps (ii) till (xiii) are iteratively performed either sequentially or in parallel across multiple processors threads.
In an embodiment, the steps (iv) till (vii) are iteratively performed until the one or more predefined criteria are met, and wherein the one or more predefined criteria further comprise at least one of (a) a predefined number of iterations, (b) frequency of the matrix subset of variables, (c) the optimal objective function reaching a predetermined value, and (d) an improvement in the objective function in subsequent iterations reaching a predefined saturation value or machine epsilon.
In an embodiment, the steps (ix) till (xii) are performed either sequentially or in parallel across multiple processors threads.
The method may further comprise upon satisfying the one or more predefined criteria, and upon iteratively performing the steps (ii) till (xiii), generating a final optimal subset based on the identified optimal element Xoptimal for predicting biological responses of chemical compounds.
In an embodiment, the step of initializing a population Spop comprising matrix subsets {X1, X2, . . . Xpop} of variables selected from the filtered set of variables Xf is preceded by transforming the set of variables X to obtain a set of transformed variables; and filtering the set of transformed variables based on one or more statistical criteria and predefined thresholds to obtain a filtered set of variables Xf.
In an embodiment, the one or more predictive models comprise one or more linear regression models, one or more non-linear regression models, or one or more classification models, and wherein the one or more predictive models and the identified optimal subset Xoptimal are used for generating one or more rules or alerts.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles:
Exemplary embodiments are described with reference to the accompanying drawings. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments.
Expression ‘Subset’ refers to a small set of elements which are taken from a larger set of elements/variables/features.
Expression ‘Path’ refers to a position of the element to be replaced in the subset.
Expression ‘Descriptor(s)’ refer to indices of the features/variables/attributes of a chemical compound.
Expression ‘Exhaustive Search’ refers to a technique of searching all the possible combinations of a given subset size.
Expression ‘Replacement’ refers to replacing a current descriptor with remaining descriptors.
Expression ‘MLR—Multiple Linear Regression’ refers to relationship between one continuous dependent variable and two or more independent variables.
Expression ‘Objective Function’ refers to a function that is desired to maximize or minimize any numeric value.
Expression ‘Convergence’ refers to an act of converging and especially moving toward union or uniformity.
Expression ‘model’ refers to a statistical predictive model having dependent variable and independent variables.
Expression ‘Predictivity of Model’ refers to an ability of the model to predict the biological responses of compound(s) using statistics. Biological response refers to the changes measured or predicted within a biological system e.g., drug targets, animals, humans on exposure to a drug, marketed or under development within a pharmaceutical industry.
Expression ‘Correlation’ refers to a statistical technique that is used to measure and describe the strength and direction of the relationship between two variables.
Expression ‘Population’ refers to a set of subsets having descriptors.
Expression ‘R-Squared’ refers to a coefficient of determination and is the proportion of the variance in the dependent variable that is predictable from by the independent variables.
Expression ‘Optimal subset’ refers to a subset which is having highest R-Squared in comparison with all the other possible subsets.
Expression ‘Pool’ refers to a group of descriptors which are highly correlated with the dependent variable.
Expression ‘Deterministic’ refers to no randomness and produces the same output from a given starting condition or initial state.
Expression ‘Redundant’ refers to data duplication.
Expression ‘Over-fitting’ refers to production of an analysis that corresponds too closely or exactly to a particular set of data and may therefore fail to fit additional data or predict future observations reliably.
Expression ‘Standard-error/Standard-deviation’ is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
Expression ‘Perturbation’ refers to a method of introducing noise in the output or a subset.
All the above expressions shall be interpreted in light of the context and detailed description as described herein in the present disclosure.
Referring now to the drawings, and more particularly to
The I/O interface device(s) 106 can include a variety of software and hardware interfaces, for example, a web interface, a graphical user interface, and the like and can facilitate multiple communications within a wide variety of networks N/W and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. In an embodiment, the I/O interface device(s) can include one or more ports for connecting a number of devices to one another or to another server.
The memory 102 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. In an embodiment a database 108 is stored in the memory 102, wherein the database 108 may comprise, but are not limited to information pertaining to physico-chemical properties, chemical structures of drugs and drug like compounds, property of input chemical compounds, rules or alerts, various models that are generated and executed for prediction of biological response, predefined threshold values, configuration details of the system during training phase and test/validation phase to perform the methodology described herein. The memory 102 further comprises (or may further comprise) information pertaining to input(s)/output(s) of each step performed by the systems and methods of the present disclosure.
In the present disclosure, the one or more predefined criteria comprise: (i) subsets in which variables have inter correlation amongst each other below a first predefined threshold, (ii) subsets whose variables are correlated with the biological response Y above a second defined threshold or a system generated dynamic threshold; (iii) seeded subsets having variables whose sizes are less than a current subset size and (iv) a user defined preferences for the set of variables X.
At step 306, the one or more hardware processors 104 select at least one subset Xi from Spop and at least one path Pl of the at least one selected subset Xi, wherein the at least one selected path Pl comprises a variable vq to be replaced. Each variable vq comprised in the at least one selected path Pl is a vector of size m describing a property of input chemical compounds and wherein m represents the number of chemical compounds used for building predictive models. In the present disclosure, the at least one selected path Pl is identified based on a relative error value associated with the variable vq.
At step 308, the one or more hardware processors 104 replace the variable vq from at least one subset Xi with remaining (n−r) variables of the set of variables Xf to obtain a set of modified subsets X′i {X′i1, X′i2, . . . , X′i(n-r)}, wherein each of the modified subsets X′ij comprises replaced variables for the at least selected path Pl, wherein size of the set of modified subsets X′i is of (n−r) (e.g., refer step 208 of
At step 316, the one or more hardware processors 104 identify an optimal element Xoptimal of Spopoptimal, wherein Xoptimal comprises an optimal objective function amongst objective functions comprised in other Xioptimal.
The steps 306 till 316 are described by way of a flow chart depicted in
At step 318, the one or more hardware processors 104 perform an exhaustive search on a pool of variable Xpool created (e.g., refer step 212 of
The steps 318 till 324 are described by way of a flow chart depicted in
At step 326, the one or more hardware processors 104 randomly replace variables of the updated population Ses to obtain a perturbed set of population Sp (e.g., refer step 216 of
In the present disclosure, the steps 304 till 326 are iteratively performed either sequentially or in parallel across multiple processors threads. Further, the steps 318 till 324 are performed either sequentially or in parallel across multiple processors threads for improved efficiency with respect to time associated with the parallelization of the population based exhaustive replacement method which provides technical solution to the technical problem of optimal variables/features selection by optimizing the processing time and eventually generates optimal predictive models.
Referring back to
The modifications as discussed in the present disclosure in comparison to conventional approaches improve search capability of the method of the present disclosure and thus, larger number of variables can be selected, which improves the predictive performance generated models. Further, selected features provide insights with respect to the physico-chemical property or structural features to be optimized to improve the discovery of drug candidates with better property profiles, consequently, increasing the success, reduction in cost of drug discovery and development, within a pharmaceutical industry. The method of the present disclosure was also executed on various parallel architectures to demonstrate that Parallel PERM can be used to achieve better, and reliable results in real time even for larger datasets, wherein selection of larger variables is faster. Parallelization of PERM can be done on various shared and distributed memory architectures in which the workload is distributed among the threads to reduce computation memory and time. In an example embodiment, PERM can be parallelized (e.g., refer
In an embodiment, when parallelizing PERM, the population initialization step can be executed in parallel on a number of compute threads that search for a subset with predefined conditions. As the population contains unique subsets, the parallel initialization of population can create some duplicates. The present disclosure may minimize the duplicate by using the population index as a seed to the random number generator, time, and others as appropriate.
In the same embodiment, the system can be parallelized in more than one number of ways. Some of which are described below:
The present disclosure can be parallelized based on the architecture type and parallelization technology used. The replacement technique was parallelized on multi core using OpenMP and many core using CUDA.
In an example embodiment, the system was parallelized on OpenMP using above-described techniques and additional modifications as described below. In this, replacement (RM) was performed with T OpenMP threads on initial subset X(m, r) that replaced in two different paths. For example, Path1 and Path2 were replaced with descriptors iteratively in the nested loops. In this the path2 replacement was executed in parallel using OpenMP. The load (L=n/T) of replacing the Path2 was equally balanced across T threads. Each thread is having a range of ranks to the get the combination X′(m, r) of descriptors, correlation checks, calculate objective function O·F(X′) and update XToptimal with X′ if it is better. Each thread may have a thread optimal subset. Once all the threads evaluate the subsets assigned to them, objective functions across threads are compared and Xoptimal is updated with XToptimal.
In another example embodiment, the system executes the two nested loops of Path1 and Path2 in parallel simultaneously. In this parallel method, the number of combinations with replacing the two paths/positions with remaining descriptors were calculated. This load is balanced across T number of OpenMP threads. Each thread identifies the subsets it needs to evaluate and performs correlation check, objective function evaluation and updated XToptimal with X′ if it is better. In this way each thread has a thread optimal subset. Once all the threads complete execution with their respective loads, O·F's across the threads are compared and Xoptimal is updated with XToptimal.
In yet another example embodiment, the present disclosure is executed in parallel on Nvidia graphics card using CUDA. In one example execution of sequential PERM Path1 and Path2 have been replaced with descriptors iteratively in the nested loops. These two nested loops were parallelized together in CUDA implementation and were executed on b blocks and t threads per block. Here the number of blocks and number of threads per block are equal to the number of descriptors n. For the given subset X(m, r) Path1 and Path2 were replaced with the block index and thread index respectively. In this way each thread in the block has a modified subset X′(m, r) where it performs correlation check, calculates the objective function O·F(X′) and updated thread optimal subset. Subsequently, reduction operation was performed across the threads within the block and Xblock,optimal was calculated. In this way each block has XToptimal with Path2 replaced with best descriptor. Then the reduction operation is performed across the blocks and the Xoptimal is updated with Xblock,optimal.
Below are experimental results conducted by the embodiments of the present disclosure for various case studies, relevant to drug discovery and development applications, with pharmaceutical industry.
The binding affinity of new chemical entities (NCEs) to Human Serum Albumin (HSA) is one of the important ADMET properties considered in drug discovery and development. Probably, it is the most extensively studied protein because of its abundance, low cost, ease of purification and stability. It plays a central role in drug pharmacokinetics particularly in the distribution of drugs. Most drugs are transported in bound form to HSA and reach the target tissues. HSA allows solubilization of hydrophobic compounds, thus contributing to a homogeneous distribution of drugs in the body and increases their biological lifetime. Binding of a drug to serum albumin is a reversible process and is therefore in an equilibrium state. Only the unbound drug molecules contribute to the pharmacological efficacy; however, they are equally susceptible to metabolic reactions. Given the high concentration of albumin, the binding strength of any drug to serum albumin is the main factor that determines the availability of that drug and consequently, the diffusion of the drug from the circulatory system to target tissues. All these factors cause the pharmacokinetics of almost any drug to be dramatically influenced and controlled by its binding affinity to serum albumin.
In this case study, the system 100 built global QSPR models for HSA binding affinity using a dataset of 84 drugs and drug-like compounds, originally reported by Gonzalo Colmenarejo et al. keeping log K0hsa, as the dependent variable (Y). In the study by the embodiments of the present disclosure, a total of 392 physico-chemical descriptors were used in the QSPR model generation. These 392 molecular descriptors were calculated using the system of the present disclosure, and these fall into the following classes: (1) structural descriptors, (2) physico-chemical descriptors, (3) geometrical descriptors and (4) topological descriptors.
Herein, practical application of the present disclosure and its associated embodiments and systems and methods is discussed, and parallelization of the method of the present disclosure was performed to derive global QSPR models for human serum albumin binding affinity. In the case study as being discussed in the present disclosure, HSA has been modeled using multiple linear regression based on equation (1) given below and to evaluate the predictive ability of feature subsets built MLR model's correlation value (r2) calculated was used using the equation (2).
The present disclosure has transformed the generated 396 descriptors into various forms such as square, exponential, logarithmic and others and filtered each transformation using a correlation threshold with Y of above 0.1. The system 100 has filtered 288 out of 396 descriptors using this criterion. Reproducibility of the present disclosure was compared by executing the system for ten runs, consistency of results across these runs were observed. It is observed that PERM is able to generate reproducible selection of variables and regressions models and thus, is an improved solution, thereby increased the reliability of the model results. Table 1 below compares the results of PERM with conventional approaches (ERM).
For the above case study, a set of two to three subsets with inter correlation values rx
Further, PERM of the present disclosure was parallelized on two high performance computer architectures 1. Open multi-processing (OpenMP) and 2. Compute unified device architecture (CUDA) on a graphical processing unit. The configuration of the system used in this case study is Intel® Xeon® CPU E5-2609 v3 @ 1.90 GHz with 6 cores with NVIDIA Quadro K4200 (GPU). Below Table 2 depicts comparison of Serial and Parallel PERM approaches as described by the present disclosure herein.
The serial, OpenMP and CUDA versions of parallel PERM have been written in C language, in one example embodiment of the present disclosure. In addition, Ofast or O3 gcc compiler optimization were used while compiling serial and OpenMP codes of the algorithm. Speed up of 5-15× was achieved using OpenMP and 9-20× using CUDA for various subset sizes.
Below Table 3 depicts selected Optimal Descriptors for predicting the human serum albumin binding affinity of drugs and drug like compounds.
In another example embodiment, the present disclosure is used to predict Epidermal growth factor receptor (EGFR) kinase binding affinity of drugs and drug candidates. EGFR Kinase a transmembrane glycoprotein, is a proven anticancer target and many EGFR inhibitors such as Gefitinib, lapatinib, etc. have been developed and approved by the FDA for the treatment of breast cancer. In this case study, the applications of the present disclosure for the prediction of potential EGFR kinase inhibitors with high binding affinity is demonstrated and thus, the impact of the solution for new discovery within drug discovery and development.
In this case study, the dataset was obtained from publicly available kinase knowledge base. A number of filters were applied on the 93750 data entries obtain from public to maintain homogeneity of data. Out of these 4892 compounds were selected with in-vitro EGFR kinase enzyme binding assay pIC50 data for demonstrating the application of the technical solution described in the present disclosure. The following data processing steps were performed to select these compounds.
The system 100, after filtering and computing the values of pIC50, computed 352 physico-chemical descriptors for the selected 4892 compounds that fall into the following classes: (1) structural descriptors, (2) physico-chemical descriptors, (3) geometrical descriptors and (4) topological descriptors. These 352 descriptors were filtered using zero variance, resulting in 350 descriptors for modelling.
The pIC50 values of these 4893 selected compounds ranges from 2.18-11.6. Based on pIC50 values, a stratified random sampling of the dataset was performed in ratio of 70:30 to create training and test dataset, resulting in train data with 3430 compounds and test data with 1463 compounds.
Herein, practical application of the present disclosure and its associated embodiments and systems and methods is discussed, and parallelization of the method of the present disclosure was performed to derive global QSPR models for EGFR binding. Like the previous case study discussed herein, the current case study models EGFR binding using multiple linear regression expression in equation (1) given below and to evaluate the predictive ability of feature subsets built MLR model's correlation value (r2) calculated was used based on the equation (2).
Similarly, in this case study a subset with inter correlation values rx
In this example embodiment, the system 100 has executed PERM in parallel on OpenMP and CUDA. The configuration of the system used in this case study has been Intel Xeon E5-2620 v2 @ 2.1 GHz with 24 cores machine and with Tesla K20 GPU. Table 4 depicts comparison of Parallel PERM on OpenMP and CUDA architectures.
The OpenMP and CUDA versions of parallel PERM have been written in C language. In addition, O3 gcc compiler optimization was used while compiling OpenMP code of the algorithm. Speed up of 1.88× to 5.57× was achieved with CUDA in comparison to OpenMP parallel version with various subset sizes.
The present disclosure has also validated the selected feature/variable subsets on the test set compounds. Below Table 5 gives the results for test set prediction (e.g., results of PERM on Test Set)
While the drawbacks of conventional approaches as described above are imminent, these are very critical when applied to domain such as Pharma and thus very essential to address them effectively. It is also implicit that tuning the control parameters, increases computational time. Embodiments of the present disclosure have shown in results how performing some critical operations such as good initial point selection, exhaustive search are essential to achieve reliable results rather than mere tuning parameters. These modifications as suggested by the present disclosure are not a logical extension of conventional approaches, except for replacing multiple positions simultaneously, rather are essential elements to achieve quality results for any feature/variable selection methods and the solution provided in the current invention, is demonstrated to have the ability to select large number of variables (e.g., twenty). Selecting large number of variables for regression model generation has great impact in solving real world predictive regressions problems within the pharmaceutical industry. The selected variables offer key insights to practicing scientists involved in optimizing drug candidates and hence, the solution described in the present disclosure by system and method can improve the overall productivity, in terms of cost and time of drug discovery.
Embodiments of the present disclosure overcome the technical problems of selecting the best possible features from an extremely large number of features in real time with better predictive models, faster convergence, and reduced complexity of the models. The above technical problems have been overcome by the present disclosure by (i) use of transformed data for building regression models that ensures that any non-linear dependencies are also considered in model building, (ii) selecting initial population with a set of pre-defined criteria ensures a better initial point selection and helps in better convergence rate of the solution, (iii) multiple initial points rather than a single initial point on which further replacements are performed minimizes the chances of local optimization and improves the search capability of the model, (iv) exhaustive search ensures that the best model rather than optimal is derived out of the acquired knowledge—which improves predictability and can potentially improve reproducibility of the model, (v) replacing more than one descriptor in an iteration increases reproducibility and faster convergence as it minimizes the error due to assumption that the model is poor due to effect of a single variable, (vi) random perturbations in exhaustive search minimize the chances of local optimizations, (vii) provisioning parallelization of the PERM as implemented by the system 100 of the present disclosure on various architectures to achieve results in real time with compute time decreased by a factor of 20.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
It is to be understood that the scope of the protection is extended to such a program and in addition to a computer-readable means having a message therein; such computer-readable storage means contain program-code means for implementation of one or more steps of the method, when the program runs on a server or mobile device or any suitable programmable device. The hardware device can be any kind of device which can be programmed including e.g. any kind of computer like a server or a personal computer, or the like, or any combination thereof. The device may also include means which could be e.g. hardware means like e.g. an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or a combination of hardware and software means, e.g. an ASIC and an FPGA, or at least one microprocessor and at least one memory with software processing components located therein. Thus, the means can include both hardware means and software means. The method embodiments described herein could be implemented in hardware and software. The device may also include software means. Alternatively, the embodiments may be implemented on different hardware devices, e.g. using a plurality of CPUs.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to, firmware, resident software, microcode, etc. The functions performed by various components described herein may be implemented in other components or combinations of other components. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202221014386 | Mar 2022 | IN | national |