The present disclosure generally relates to artificial intelligence and machine learning systems, and more particularly, to methods and systems for scalable discovery of the predictive modeling pipelines (hereafter referred to as “leaders”) within a dynamic combinatorial search space using a heuristic-based method.
The term “predictive modeling” in the field of data science refers to the process of analysis to discover the best data transformations and modeling forms for ultimately drawing accurate and meaningful inference from new realizations of data. Predictive modeling is of fundamental interest to researchers in machine learning, pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems and data visualization.
In the machine learning community, researchers or data scientists build a pipeline to define a series of steps to be performed on an input dataset for building a model.
According to various embodiments, a computing device, a non-transitory computer readable storage medium, and a method are provided for permitting data scientists to proficiently explore multiple pipelines to improve upon the efficiency of a model exploration task on a given dataset.
In one embodiment, a computer implemented method includes generating alternative pipeline graphs having a plurality of layers, each of the plurality of layers or steps having one or more machine learning components for performing a predictive modeling task. The method further entails the inclusion of the pipelines within a pipeline graph to determine a respective plurality of results for a given training dataset. The method further includes comparing the plurality of results to known results based on a user-defined metric to output one or more leader pipelines, where the leaders refers to those predictive models with the best performance for some performance metric.
In some embodiments, the pipeline graph is generated from one or more default pipeline graphs for the predictive modeling task.
In some embodiments, the one or more machine learning components include a no-operation component, where the training dataset passes without operation when the pipeline includes the no-operation component.
In some embodiments, the method further comprises applying a set of hyperparameters to one or more of the selected ones of the one or more machine learning components at each of the plurality of layers.
In some embodiments, the method further comprises applying a hyperparameter optimization scheme to reduce a size of a hyperparameter search space.
In some embodiments, the method further comprises initially operating the one or more machine learning components at a last layer of the pipeline graph on the training dataset using a default hyperparameter for each of the one or more machine learning components of the last layer.
In some embodiments, the method further comprises selecting a first portion of the one or more machine learning components of the last layer, the first portion being closest to the known result. In some embodiments, the first portion is about one-half of the one or more machine learning components of the last layer.
In some embodiments, the method further comprises initiating a first hyperparameter tuning on the first portion to determine a tuned set of hyperparameters for each of the first portion of the one or more machine learning components and selecting a second portion of the first portion, the second portion having the best performance. In some embodiments, the second portion is about one-half of the machine learning components of the first portion.
In some embodiments, the method further comprises initiating a second hyperparameter tuning on the second portion to determine a second tuned set of hyperparameters for each of the second portion of the one or more machine learning components of the last layer of the pipeline graph.
In some embodiments, the method further comprises adding an additional one of the plurality of layers and identifying a plurality of smaller paths using each of the one or more machine learning components of the additional one of the plurality of layers and each of the second portion. The smaller pipeline paths serve as a filtering mechanism for the full-length paths to provide a computationally fast means of pruning the overall pipeline graph of pathways that would likely not be fruitful from a predictive modeling accuracy perspective. The method further includes operating the plurality of smaller pipeline paths on the training dataset with the default hyperparameters for each of the two machine learning components of each of the plurality of smaller pipeline paths and selecting a first portion of the smaller pipeline paths, the first portion being closest to the known result. The method further includes initiating a third hyperparameter tuning on the first portion of the smaller pipeline paths to determine a tuned set of hyperparameters for each of the machine learning components of the first portion of the smaller pipeline paths.
According to various embodiments, a computer implemented method includes generating a pipeline graph having a plurality of layers, each of the plurality of layers having one or more machine learning components for performing a predictive modeling task and operating each of the one or more machine learning components at a last layer of the pipeline graph on a training dataset using a default hyperparameter for each of the one or more machine learning components of the last layer. The method further includes selecting a first portion of the one or more machine learning components of the last layer, the first portion having the best performance of the predictive modeling task and initiating a first hyperparameter tuning on the first portion to determine a tuned set of hyperparameters for each of the first portion of the one or more machine learning components. The method further includes selecting a second portion of the first portion, the second portion having the best performance of the predictive modeling task when the tuned set of hyperparameters are applied and initiating a second hyperparameter tuning on the second portion to determine a second tuned set of hyperparameters for each of the second portion of the one or more machine learning components of the last layer of the pipeline graph. The method further includes adding an additional one of the plurality of layers and identifying a plurality of extended pipeline paths using each of the one or more machine learning components of the additional one of the plurality of layers and each of the second portion. The method further includes operating the plurality of extended pipeline paths on the training dataset with the default hyperparameters for each of the one or more machine learning components of the additional one of the plurality of layers. The method further includes selecting a third portion of the extended pipeline paths, the third portion being closest to the known result and initiating a third hyperparameter tuning on the third portion of the extended pipeline paths to determine a second tuned set of hyperparameters for each of the machine learning components of the additional one of the plurality of layers.
According to various embodiments, a non-transitory computer readable storage medium tangibly embodying a computer readable program code having computer readable instructions that, when executed, causes a computer device to carry out a method of improving computing efficiency of a computing device operating a pipeline execution engine. The method includes generating a pipeline graph having a plurality of layers, each of the plurality of layers having one or more machine learning components for performing a predictive modeling task and operating a plurality of pipelines through the pipeline graph on a training dataset to determine a respective plurality of results, wherein each of the plurality of pipelines are distinct paths through selected ones of the one or more machine learning components at each of the plurality of layers. The method further includes comparing the plurality of results to known results based on a user-defined metric to output one or more leader pipelines.
By virtue of the concepts discussed herein, a system and method are provided that improves upon the approaches currently used in machine learning model exploration. These concepts can assure scalability and efficiency of machine learning model exploration while providing a predetermined set of parameters for non-experts that may be fully modified as desired by more advanced users.
These and other features will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The drawings are of illustrative embodiments. They do not illustrate all embodiments. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for more effective illustration. Some embodiments may be practiced with additional components or steps and/or without all the components or steps that are illustrated. When the same numeral appears in different drawings, it refers to the same or like components or steps.
In the following detailed description, numerous specific details are set forth by way of examples to provide a thorough understanding of the relevant teachings. However, it should be apparent that the present teachings may be practiced without such details. In other instances, well-known methods, procedures, components, and/or circuitry have been described at a relatively high-level, without detail, to avoid unnecessarily obscuring aspects of the present teachings.
Unless specifically stated otherwise, and as may be apparent from the following description and claims, it should be appreciated that throughout the specification descriptions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” or the like, refer to the action and/or processes of a computer or computing system, or similar electronic computing device, that manipulate and/or transform data represented as physical, such as electronic, quantities within the computing system's registers and/or memories into other data similarly represented as physical quantities within the computing system's memories, registers or other such information storage, transmission or display devices.
As discussed in greater detail below, the present disclosure generally relates to a system for machine learning model exploration using a pipeline graph where a staged distributed optimization can be used for the scalable discovery of leaders. The systems and computerized methods provide a technical improvement in the efficiency, scalability and adaptability for generating machine learning pipelines for a given task using a given dataset, while allowing for end user control of parameter specifications, execution environment, and model metrics, such as time constraints, memory constraints, diversity of results, accuracy, precision, and the like.
Reference now is made in detail to the details illustrated in the accompanying drawings and discussed below.
Referring to
To make the model exploration task easier, the concept of a “Pipeline Graph” is established. A pipeline graph defines the nature of and ordering of the operations to perform when exploring predictive models for tasks such as classification, regression, or clustering. A pipeline graph, denoted as G(V, E), is a directed acyclic rooted graph (DAG) with a set V of vertices and a set E of edges. Each vertex vi∈V represents an operation to be performed on the input data, and an edge ei∈E represents the ordering of operations between vertices. A pipeline graph can prescribe a series of steps, such as feature scaling, feature transformation, feature selection, model learning, and the like, to find the best solution for given training data instance.
Referring to
In
The pipeline graph 300 can be set up by the user to include three stages: i) feature scaling, ii) feature selection, and iii) classification. The feature scaling stage, at layer 302, includes three popular methods MinMaxScaler, RobustScaler, StandardScaler and may further include option to exclude the stage (i.e., “noop” meaning no operation). The next stage, at layer 304, is feature selection using PCA or SelectKBest or SkipFeatureSelect. The final stage, at layer 306 includes regression models to explore.
The pipeline graph 300 includes multiple end-to-end machine learning/artificial intelligence pipelines that provide prescriptions for how input data are to be manipulated and ultimately modeled. The prescriptions includes scaling operations, explored models, the definition of hyperparameter (HP) grids and how they are explored (e.g., random-search, grid-search, and the like).
Each vertex vi in G is referred to as a pipeline node 308 and edge 310 represents the ordering of operations between pipeline nodes 308. Pipeline nodes 308 include a name and/or the operation it performs and a reference to the python object containing the functional implementation. It is represented by a tuple vi=(namei, objecti). For example, tuple (“pca”, PCA( )) represents a pipeline node, with name “pca”, that performs principle component analysis. The nodes in
The name given to each node 308 in the pipeline graph 300 should be unique. The node name is a tag that enables a user to supply additional information that can be used to control the behavior of the node. One such example is to specify a parameter value prefixed by the node name adopting the convention used in scikit-learn. For example, “gradientboostingregressor_n_estimators” would associate with the parameter “n_estimators” of the scikit-learn object GradientBoostingRegressor. As described in greater detail below, by adopting such a standard convention, hyperparameter grids can be developed for parameter optimization.
The operation performed by a pipeline node 308 is either of two types—Transform(_.transform) or Estimate(_.fit). An Estimate operation is typically applied to a collection of data items to produce a trained model. A Transform operation uses a trained model on individual data items or on a collection of items to produce a new data item.
A pipeline path of a pipeline graph G(V, E), denoted as Pi={vroot→v1→ . . . →vj→ . . . →vk}, is a directed path of nodes vj∈V that starts from the root node vroot and ends at leaf node vk. Briefly, a pipeline path chains together artificial intelligence/machine learning-related operations common to the overall task of building predictive models. For example, the left most pipeline path of the pipeline graph in
P
1={start→Standard Scaler→PCA→KNN Regression}
There are 18 total pipeline paths to explore in
As illustrated in
The pipeline graph 300, while illustrated for use in a regression predictive model, is a universal, general purpose exploration mechanism for any artificial intelligence capability. The pipeline graph 300 can explore many paths to find the best one and can be customizable with respect to many parameters, such as number of layers, which models are used from a selection of available machine learning models, and the ability to utilize specialized, user-produced models. With the pipeline graph 300 used for machine learning model exploration, individual pipelines can be executed in parallel to generate optimal pipelines.
The table below illustrates example modelling tasks in which the pipeline graph concept can be applied to generate optimized pipelines for predictive models for a given task. The table illustrates typical number of layers and nodes used for various modelling tasks. The table further illustrates the number of possible pipelines that may be explored to determine the best pipelines based on the model metrics 204 (see
While the pipeline graph provides the aforementioned flexibility to add a sequence of machine learning algorithms of a user's choice, it also includes another capability. Many machine learning algorithms come with a set of parameters, such as the choice of the kernel function in support vector machine (SVM), the depth of the tree in decision tree classification, the number of projected dimensions in principle component analysis (PCA) transformation, and the like. Such algorithm specific parameters are referred as “hyperparameters”.
In a pipeline graph design, the set of hyperparameters associated with a pipeline node (v) are represented by H(v). Each hyperparameter ρ∈H(v) is further associated with a set of values V(ρ), for example:
v,ρ
→V(ρ)∀ρ∈H(v)
For example, for the PCA algorithm, a user may be interested in setting the parameter for the number of components to value in the set [3, 5, 7, 10].
v=PCA,ρ=n_components→[3,5,7,10]
This functionality is achieved in the system, according to aspects of the present disclosure, by making the hyperparameter an attribute of the pipeline node and separately defining mappings which associate an algorithm and hyperparameter to a set of values. Note that while V(φ is referred to as a set of values, it may be a discrete set of values or continuous values sampled from a range or a particular distribution.
Typically, machine learning algorithms have more than one hyperparameter, so users wish to try combinations of parameters for different algorithms within a pipeline path. To support this concept, methods according to aspects of the present disclosure define “hyperparameter grids” for many common machine learning tasks grouped by function (e.g., a regression grid, a classification grid, and the like). Coarse and fine grids can be defined which differ in size and thus computational burden they impose.
An objective of a pipeline execution engine 820 (see
A task is a discrete section of computational work. For example, an execution of a single pipeline path is treated as one task. One strategy is to run multiple tasks (i.e., pipeline paths) in a parallel (preferably distributed) and time-bounded manner. In one embodiment, tasks can be created at multiple levels of granularity to achieve different levels of parallelism.
Path-level parallelism involves running each pipeline path in parallel. In this setting, the evaluation is parallelized across paths, where each task contains a distinct path and is responsible for evaluating the path for a number of parameter choices. This option is referred to as a “path learning”.
Parameter-level parallelism involves running each pipeline path for different hyperparameter combinations in parallel. In this setting, the evaluation is parallelized across the parameters so that each task gets the same path, but a different point in the hyperparameter space. This option is referred to as “param_learning”.
The pipeline execution engine 720 can support parameter and path-level parallelism. The latter is used by default unless a user specifies a hyperparameter grid or when a particular hyperparameter optimization strategy only supports sequential operation (such as those that estimate gradients within the search routine).
Another strategy for speeding up pipeline execution is to use an optimization technique to reduce the size of the hyperparameter search space (compared to, for example, fully enumerated grid search) for each pipeline path. Aspects of the present disclosure can use various known optimization schemes. For example, the pipeline execution engine 720 can support six different optimization schemes: Complete search (compS), Random search (randS), Evolutionary search (evolS), Hyperband search (hbandS), Bayesian-optimization search (bayesS, and RBFOpt search (rbfOptS). A user can select one of the optimizer listed above for discovering the best combination of hyperparameter values and pipeline path. In some embodiments, the selection of the optimizer can be automated by the system.
The capability of the pipeline execution engine 720 can be extended to support automated learning. This autolearning functionality can be referred to as DAG-AI. First, an end user prepares a deep pipeline graph for a given machine learning task. For an entry level data scientist, the system can include prebuilt pipeline graphs for various tasks, such as classification, regression, imbalanced learning, and the like. For example, the pre-built classification graph has a depth of 5 layers, 130+ nodes, ˜160,000 paths, and a hyper-parameter grid with ˜150 entries. The output of the process on a pipeline graph is a best performing pipeline path with a parameter configuration for a given dataset. However, the best performing pipeline path may not include all the nodes along the paths in the graph. In other words, the method explores variable length pipeline paths from a given pipeline graph.
An example solution designed to implement a DAG-AI system aims to discover the best pipeline path with its parameter as early as possible. Discovering a reasonable best solution as early as possible eliminates the need of running experiments for longer duration, thereby conserving valuable computational resources. The idea is to execute a pipeline-graph iteratively over multiple rounds and progressively generate the results at the end of each round. The results generated at the end of the current round is available to the user for quick inspection, as well as, used in subsequent iteration to prune the search space.
In the second round 504, a random search based hyperparameter tuning is initiated on the models that are selected in previous round (first round 502). A randomized hyperparameter tuning is a highly parallel activity. The number of parameters to be tried out for each model in the current round is adjustable but is typically kept to a small value, such as 10. In the early stage of execution, the exploration search space can be controlled. In this round, nearly 10+ different models can be run with 10 different randomly generated parameter values. Out of 10+ models, nearly 50% of the top-performing models can be selected to become a candidate for the third round 506.
In the third round 506, a random search based hyperparameter tuning is initiated on the models that are selected in previous round (second round 504). Compared to the previous round, the number of parameters to be tried out for each model in this round is greater that the number of parameters in the second round 504 and is adjustable. In some embodiments, nearly 5+ different models are run with 30 different parameter values in the third round 506. In should be noted that some models do not have many parameters to be tuned.
After successful completion of first three rounds, length-1 pipeline paths (along with the parameters) are identified that perform better than the other pipeline paths. To reduce exploration of a huge search space, embodiments of the present disclosure can select k (≤5) top-performing models here-after and derive k pipeline-graphs, one for each top-performing model. Each of these new pipeline-graph's last layer has only one node. A new pipeline graph can be denoted for the kth top-performing models as PGk. For example, assuming that a “KNN Regression” is a top performing algorithm, then the resultant pipeline graph only has one node in the last layer.
In rounds 4 and 5, one focus is on discovering smaller pipeline paths for each top-performing models. Given a pipeline graph Gk for a kth top-performing model, Gk can be decomposed into multiple pipeline graphs of depth-2, for example. In should be noted that, the last layer in each decomposed graph is same, that is, the kth top-performing model. Next, each decomposed graph can be processed in two stages (i.e., a fourth round 508 and a fifth round 510) to discover a smaller pipeline path that perform better than the pipeline path from which it got enlarged. The fourth round 508 is similar to the first round 502, where pipelines with default parameters are tried, whereas, the fifth round 510 is similar to the second round 504, where randomized hyperparameter tuning is conducted on the top performing paths outputted by the fourth round 508. After the fourth and fifth rounds 508, 510, it is known which nodes, other than nodes from previous stages, also help to improve the performance. In the sixth round 512, these nodes can be used to grow a longer length pipeline path. The process can be repeated to extend the length of the pipeline, incrementally, as needed or desired to improve the performance of the pipeline. Use of the process 500 can reduce the search space by about 90 percent, thus increasing processing speed and reducing required processing resources.
Up until the sixth round 502, highly parallelized randomized search operations can be used. In the seventh round 514, intelligent search mechanisms can be applied for promising pipeline paths that have been discovered. In particular, given a path, other hyperparameter optimization schemes, such as evolS, hbandS, bayesS, and rbfOptS, can be applied on each path to discover hyperparameter tuning that can help to improve the performance. Instead of applying the intelligent method on each and every pipeline path, this method is applied on top-performing pipeline paths to improve the execution time.
With the foregoing overview of the example system 210 (see
Referring to
At act 606, the system can offer a suitable hyperparameter grid for the graph G. At act 608, a user may optionally modify the hyperparameter grid based on need.
At act 610, the end user can specify parameters for leader discovery. For example, the user can define how many leaders are to be discovered and their diversity (e.g., share the model but use different data processing steps). The user can define how much time and the maximum memory that is given to each leader for training. The user can set evaluation metrics (e.g., the model metrics 204 of
At act 612, the system can execute an optimization process for discovering pipelines. The optimization process can include an iterative search over the pipeline graph, can use incremental pipeline pattern growth, and can use a distributed execution which can result in improvement in leader discovery as time progresses.
At act 614, the user can interact with the system throughout the process, based on continuous feedback provided by the system. The user may perform various tasks, such as increasing time or memory (if many pipelines fail in an early stage, for example), remove certain components that result in poor performance or are a bottleneck, perform an iterative refinement in case of execution restart, suggest models that are top performers in early stages, print results summary of common components of leaders, and the like.
The computer platform 700 may include a central processing unit (CPU) 704, a hard disk drive (HDD) 706, random access memory (RAM) and/or read only memory (ROM) 708, a keyboard 710, a mouse 712, a display 714, and a communication interface 716, which are connected to a system bus 702.
In one embodiment, the HDD 706, has capabilities that include storing a program that can execute various processes, such as the pipeline execution engine 720, in a manner described herein.
The descriptions of the various embodiments of the present teachings have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
While the foregoing has described what are considered to be the best state and/or other examples, it is understood that various modifications may be made therein and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.
The components, steps, features, objects, benefits and advantages that have been discussed herein are merely illustrative. None of them, nor the discussions relating to them, are intended to limit the scope of protection. While various advantages have been discussed herein, it will be understood that not all embodiments necessarily include all advantages. Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.
Numerous other embodiments are also contemplated. These include embodiments that have fewer, additional, and/or different components, steps, features, objects, benefits and advantages. These also include embodiments in which the components and/or steps are arranged and/or ordered differently.
Aspects of the present disclosure are described herein with reference to a flowchart illustration and/or block diagram of a method, apparatus (systems), and computer program products according to embodiments of the present disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of an appropriately configured computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The call-flow, flowchart, and block diagrams in the figures herein illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the foregoing has been described in conjunction with exemplary embodiments, it is understood that the term “exemplary” is merely meant as an example, rather than the best or optimal. Except as stated immediately above, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.
It will be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein. Relational terms such as first and second and the like may be used solely to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “a” or “an” does not, without further constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments have more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.