Embodiments provide a single application and corresponding user interface that helps a user manage all phases of a lifecycle of a data science project.
Data science is the study of data to extract meaningful insights. Data science is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data.
A data science project has a lifecycle that includes determining an objective, data gathering, generating a model, and deploying the model or model output. Each phase of the lifecycle includes interfacing with a different database, software, or the like.
The following description and the drawings sufficiently illustrate specific embodiments to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. Portions and features of some embodiments may be included in, or substituted for, those of other embodiments. Embodiments set forth in the claims encompass all available equivalents of those claims.
Each phase of a data science project lifecycle requires context switching across applications. Experts suggest that each transition from one activity to another costs about 26 minutes of data science expert productivity. Seamless integration of all phases of a data science project lifecycle in a single platform reduces or eliminates the 26 minutes lost from the transition significantly and reduces a total amount of time it takes to move analytic through the data science project lifecycle.
Data acquisition and cleansing consumes a significant amount of time of a data science project. Disparate datasets can be ingested to an integrated data lake. The data lake allows the data scientist to locate and reuse previously ingested and cleansed data thus saving significant time and resources. Embodiments offers integrated cleansing solutions via a modular open systems architecture (MOSA) or a proprietary approach.
Data science projects are often collaborative endeavors. Multiple data science experts are typically involved in a single data science project. There are currently no known data science project solutions that allow for email, chat, text, or other collaboration embedded in the data science project deployment process. Current data science project programs dismiss collaboration around technical processes involving large datasets (e.g., greater than a terabyte (TB)) that are not suitable for transmission by email.
Embodiments include pre-loaded templates that assist with model re-use with a new/current datasets. Models and useful data are hard to find. A data science environment (DSE) of embodiments packages (embedding all elements) and deploys projects, models, widgets, dashboards and data to a self-serve analytics marketplace for use and reuse, as well as other endpoints or environments.
The DSE of embodiments integrates all phases of the data science project lifecycle into a series of graphical user interfaces (GUIs) that are presented by a single software application. The GUI provides a data scientist, whether experienced or more junior, a simple series of graphics that guide the data scientist through project setup, configuration, and deployment. The GUI encourages best practices for data science project implementation and deployment. The GUI encourages collaboration between data scientist experts. The GUI reduces an amount of time from project conception to deployment by reducing time spent gather data, cleaning data, choosing a model type, generating a model, populating/training the model, generating a final product, deploying the final product to a marketplace, or the like.
The operation 102 includes gathering all the raw data for the data science project into a single location. The data can, instead of being raw, be represented by a link to the data. The data can be from a public repository, a private repository, a combination thereof, or the like. The operation 102 can include cleaning the gathered data. Cleaning data is distinct from data transformation. Data transformation alters one or more values of the data itself. Cleaning data fixes or removes incorrect, corrupted, duplicated, or incomplete data. Data transformation is the process of converting data from one format or structure into another. Transformation processes can also be referred to as data wrangling, or data munging, transforming and mapping data from one “raw” data form into another format for warehousing and analyzing. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.
Removing unwanted observations from a dataset, including duplicate observations or irrelevant observations can be performed at operation 102. Duplicate observations will happen most often during data collection. When combining data sets from multiple places, such as scraped data, or received data from clients or multiple departments, there are opportunities to create duplicate data.
Irrelevant observations can be removed at operation 102. Irrelevant observations are observations that do not fit into the specific objective of the data science project. For example, if data regarding millennial customers is desired, and the dataset includes older generations, those observations of the older generations are irrelevant. Removing irrelevant observations makes analysis more efficient and minimizes distraction from the primary target-as well as creating a more manageable and more performant dataset.
Structural errors can occur when measuring or transferring data. Structural errors include different naming conventions, typos, incorrect capitalization, or the like. These inconsistencies can cause mislabeled categories or classes. For example, one dataset can include “N/A” and another dataset can include “Not Applicable”, but they should be analyzed as the same category. The operation 102 can make the naming conventions consistent so as to be analyzed under the same category.
Often, there will be observations that do not fit within the data being analyzed. If there is a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data and the corresponding model. However, sometimes it is the appearance of an outlier that will prove a theory. Remember: just because an outlier exists, does not mean it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, it can be removed at operation 102.
Missing data is where an entry or an entire observation is missing. Sometimes missing data cannot be ignored because many algorithms will not accept missing values. There are a couple of ways to deal with missing data. As a first option, observations that have missing values can be dropped, but doing this will drop or lose information. As a second option, missing values can be provided based on other observations. There is an opportunity to lose integrity of the data because assumptions and not actual observations can be bad assumptions. As a third option, the way the data is used can be altered to effectively navigate null values. The operation 102 can include handling missing data in any of these ways.
An example software that cleans data is Amazon Web Services & Glue Studio® of Seattle, Washington, United States of America (USA). The operation 102 can include a GUI that interfaces, such as through an application programming interface (API), with Glue Studio® so that Glue Studio® can perform the data cleaning operations 102. In some instances, a model that cleans data can be created using the GUI 200, hosted by infrastructure supporting the GUI 200, or a combination thereof. The model can then be used to cleanse data, such as from a standard formatted file.
The operation 104 can include putting the goal of the data science project into words. Metes and bounds of the data science project can be defined at operation 104. The objective captures the purpose of the data science project. The objective can be a sentence, paragraph, or the like. The objective can include limits on the data to be included, a model to be used, a final product to be provided, a combination thereof, or the like. The objective can be to build on a prior project, start a new project, or the like.
of different locations, rather than gathering the data into a single location. Gathering the data into the single location can include The operation 106 can include importing cleaned data into a single data lake. While such an operation may seem trivial, this is not a common operation performed in current data science projects. Data science projects commonly operate on data from a variety providing links to the data, storing the raw data, a compressed form of the raw data, a combination thereof, or the like. The GUI can allow the user access same access to the data regardless of the form it stored in the data lake. That is, the user does not know whether there is a link to the data, raw data, or that the data is in any particular format as all of this is hid and handled by the GUI. Such a GUI makes the user interface easier for the user, as the user does not need to manage separate tabs of a browser, application interfaces, or the like. Examples of data lakes includes Amazon Web Services & Simple Storage Service (Amazon S3)® from Amazon Incorporated of Seattle, WA, USA, Amazon RedShift® from Amazon Incorporated, and Amazon Athena® from Amazon Incorporated, among others, are example services for creating and maintaining a data lake. Note the imported data can be in a variety of formats, such as comma separated value (CSV), portable data format (PDF), Excel, java script object notation (JSON), keyhole markup language (KML), open cybersecurity schema framework (OCSF), or a proprietary format or database.
A data quality operator 116 can determine a quality score for the data in the data lake. The data quality score can indicate how likely the data in the data lake, when used, will generate a reliable result. The data quality score can be determined based on completeness (% total of non-nulls and non-blanks), uniqueness (% of non-duplicate values), consistency (% of data having patterns), validity (% of reference matching), accuracy (% of unaltered values), linkage (% of well-integrated data), a combination thereof, or the like.
The transform operation 108 determines features of the data imported at operation 106 or otherwise in the data lake. Features are aspects of the data that are relevant to a given model. Features can be generated by a machine learning (ML) model or generated based on a heuristic. Learning the features can include training an embedding layer and then inputting the data from the data lake into the embedding layer. A heuristic can include a statistic (e.g., average (e.g., mean, median, mode, or the like), standard deviation, whether the data is above or below a specified value, a percentile, an expected value, variance, covariance, among many others) or a series of mathematical operations performed on the data. An example application that performs operation 108 is Amazon Web Services Glue Studios from Amazon Incorporated.
At operation 110, a user can select a model type of a model to generate from scratch, a model template of a model that was previously generated, a model that was previously generated, or the like. The model type can be on an application level or a more granular level. The model types can include data analytics or machine learning models. The data analytics models can be descriptive, predictive, prescriptive, or diagnostic. Assessment, tracking, comparing, and reporting are all example operations performed by a descriptive model. Prediction, root cause, data mining, forecasting, Monte-Carlo simulating, and pattern of life are all example operations performed by a predictive model. Prescriptive models perform optimization. Diagnostic models determine likelihoods, probabilities, or distribution outcomes. ML model types can be reinforcement learning (RL), supervised, unsupervised, or semi-supervised. Example RL model operations includes real-time decisions, game artificial intelligence (AI), learning tasks, skill acquisition, robot navigation, or the like. The supervised type can perform classification or regression. Classification can include detecting fraud, performing image classification, customer retention, diagnostics, or the like. Regression can provide new insights, forecasting, predictions, process optimization, or the like. Unsupervised learning can include clustering or dimensionality reduction. Clustering can be used for providing a recommendation, targeting, segmentation, or the like. Dimensionality reduction can be used for providing a big data visual, compression, structure discovery, feature elicitation, or the like.
There are many model generation applications to which the GUI can interface. Examples of such applications include Amazon Web Services & Sagemaker® from Amazon Incorporated, Jupyter Notebook from Jupyter Project (an open-source endeavor), and Zeppelin from the Apache Software Foundation, among others.
The operation 110 can further include training or otherwise setting parameters of the selected model. Training the model can include supervised or semi-supervised learning based on training sample (e.g., input, output samples) in the data lake. The training data can be a subset of the data. The data lake can further include testing data for performing operation 112. Setting parameters can include setting hyperparameters (e.g., number of layers, types of layers, order of layers, number of neurons, gradient step size, number of samples in a batch, number of batches, activation functions, cost function, or the like) or model parameters (e.g., weights, bias, or the like), or the like.
A template recommender 118 can recommend a new model type, a template, an existing model, or other model to the user through the GUI. The template recommender 118 can include an ML model that is trained to provide the template recommendation. The template recommender 116 can receive the objective defined at operation 104, a project description, or other text describing the purpose of the data science project as input. The template recommender 116 can perform a semantic comparison of the input to other text descriptions of data science projects. The template recommender 118 can return an output that is a model, model type, or template that was used in the data science project that is most semantically similar to the data science project being created.
The operation 112 includes testing the generated model and monitoring model performance over time. The operation 112 can include detecting model drift and alerting if drift has occurred, a class that is causing the drift, or the like. The operation 112 can provide a visual depiction of model performance, such as an ROC curve, training accuracy, output and confidence of output, among other performance metrics. Performance metrics will vary by model types. For example a data analytics model has different performance metrics than M. Also, performance metrics for the sub type of the model can indicate different performance metrics. For example, performance metrics are different for regression and classification models. The GUI 200 accounts for these nuances.
The operation 114 includes preparing and published the model generated or selected at operation 110 for use by other data scientists or analysts. The operation 114 includes selecting a tool to visualize results produced by the model generated at operation 110. The tool can be, for example, Kibana, which is an open source data visualization tool that generates pie charts, bar graphs, heat maps or the like, Tableau, which has an extensive online community portal of open-source data, dashboards, infographics and plots that allows users to create reports and publish to the Tableau public portal, and CloudWatch from Amazon Incorporated, which is a cloud monitoring tool that collects and tracks metrics on operating cloud resources, such as Amazon Web Services that the user uses. CloudWatch allows a user to create custom dashboards, applications, and metrics.
A visual analytics recommender 120 can recommend an analytics to the user through the GUI. The visual analytics recommender 120 can include an ML model that is trained to provide the template recommendation. The visual analytics recommender 116 can receive the structure or features of the data used to generate the model, data to be input to the model, objective defined at operation 104, a project description, or other text describing the purpose of the data science project as input. The visual analytics recommender 116 can perform a semantic comparison of the input to other text descriptions of data science projects. The visual analytics recommender 120 can return an output that is a visual analytics tool, graphic type, or other visual object type that was used in the data science project that is most semantically similar to the data science project being created.
The data science environment provided by the flow 100 provides a robust pipeline that optimizes the creation of models in a collaborative environment. Data engineers and scientists can tackle the toughest of data science challenges in a single integrated-open platform. Objective, model, and data type aside, data and models may be synthesized into powerful visualizations at enterprise scale. The open platform allows users to assemble models with a variety of data sources while creating a single source of truth for data storage without displacing or disrupting existing data, tools or processes.
Models can be reused in the form of pre-loaded templates. All data science assets can be taken through a facilitated package and deploy process where models are paired with visual analytics to form widgets, then published to a private analytics marketplace, or analogous environment. The flow 100 can include monitoring to track model accuracy and drift. The flow 100 can include embedded algorithms (e.g., data quality score operator, template recommender, visual analytic recommender) to support decision making across the user journey: Data quality score (Cleanliness, completeness, objective and model type); Template recommender (model templates with explainable AI); Visual analytic recommender (The right visual for the objective and dataset).
Embodiments provide advantages and improvements in terms of data science project lifecycle through an open platform (MOSA) that can be accessed through a cloud or a more local network. Embodiments provide model templates that can be accessed from a variety of tools. Embodiments allow for controlled data scientist collaboration through an entire data science project lifestyle. Embodiment allow the data scientist to create projects to organize a team and collaborate with teammates throughout the lifecycle for all resources; data, models, projects, products and dashboards.
Embodiments seamlessly provide for project packaging or model packaging, such as for easy reuse. Embodiments pair models with advance visual analytics so that a user can view model performance metrics seamlessly. Embodiments provide for deployment of a variety of resources to a variety of environments across an enterprise. Embodiments provide for monitoring of model performance, such as accuracy and drift. Embodiments support decision making throughout the project lifecycle by providing best-practice recommendations that support decision making across the user journey, such as providing a data quality score and explanation based on data cleanliness, completeness, objective, and model type); a template recommender (model templates with explainable artificial intelligence (AI)); or a visual analytic recommender (the right visual for the dataset and objective).
To the best of the inventors' knowledge, there is no single platform that provides all of the following features: an open framework, project management, pre-loaded model templates, collaboration through the project lifecycle, automated (or at least semi-automated) data acquisition and cleansing, data modeling, model performance monitoring, and packaging and deploying the model as a product.
A graphical software control is a software element that is displayed on a graphical user interface and with which a user can interact. The interaction can include providing input, selecting, a combination thereof, or the like. Example software control types include radio buttons, checklists, buttons, drop-down menus, sticky menu, scroll panel, cards, tables, chips/badges, icons, links, checkboxes, toggle switches, segmented controls, input boxes, a combination thereof, or the like. A graphical software control, when interacted with, executes a function associated with the graphical software control.
The control 220, when selected, launches a wizard through which a user can work through the whole lifecycle of a data science project. More details of the wizard are provided elsewhere. A wizard or setup assistant is a user interface that presents a series of dialog boxes to lead the user through a sequence of smaller steps that achieve a larger objective. The control 220 can be selected by a user that is looking to setup a project that does not yet exist.
The control 222, when selected, presents a user with a view of projects that have been started previously, such as through the wizard provided responsive to the control 220 being selected. The projects can be in process, completed, or the like. The projects displayed to the user can be limited to projects for which the user that is logged into the data science environment has been assigned a role. The projects displayed can be displayed in a rank order. The rank can be determined based on date/time created, date/time last accessed, amount of time personnel are spending on the project, a combination thereof, or the like.
The control 224, when selected, presents a user with a view of models that have been started previously, such as through the wizard provided responsive to the control 220 being selected. The models can be in process, completed, or the like. The models presented can be any and all public models or models for which the user has been assigned a role. The models displayed can be displayed in a rank order. The rank can be determined based on date/time created, date/time last accessed, amount of time personnel are spending on the model, a number of times the model has been deployed, an accuracy of the model, a combination thereof, or the like.
The control 226, when selected presents a user with a view of deployed products that are available for use. The products are widgets that are a combination of analytics (data analysis results) and a model. The widgets allow a user to enter new data into the model to return a result, explore other results provided by the model, present the data in a graphically meaningful manner, or the like. There are many combinations of models and relevant graphics, too many to list here.
The list 228 details projects for which the user has a role. The list 228 includes links that, when selected, cause the GUI to display a home page for the selected project. The list includes a snippet of an objective or description of each project (e.g., directly underneath the link to the project). The link can include text that is the name of the project. The list 228 includes further details of the projects in the list including an owner name, a date the project was created, a date the project was last updated, tags associated with the project, among other. The list 228 includes a software control through which the user can add a project to the list 228.
The list 230 details models that the user has executed. The list 230 includes links that, when selected, cause the GUI to display a home page for the selected model. The list 230 includes a snippet of an objective or description of each project (e.g., directly underneath the link to the model) for which the model is a part. The link can include text that is the name of the model. The list 230 includes further details of the models in the list including an owner name, a date the model was created, a date the model was last updated, tags associated with the model, among others. The list 230 includes a software control through which the user can add a model to the list 230.
The list 232 provides a view of some projects for which the user would like to stay up-to-date on progress. The list 232 is similar to the list 228, with the list including more details regarding each of the projects enumerated in the list 232. The details provided in the list include the project name (presented as a selectable link that, when selected provides the project homepage on the GUI, a status of the project (e.g., running, ready for deployment, deployed, or the like), a more detailed description of the project than what is provided on the list 228, an owner of the project, contributors to the project, a date the project was created, a date the project was last updated, and tags associated with the project.
After selecting the model, file, or notebook desired, the GUI 200 presents an interface to interact with the selected item. In the example of
The tags dropdown menu 888 allows the user to associate keywords with the project. The tags can include common catchphrases, jargon, or the like that describes the type of project at a high level. Example tags include “experiment”, “Lorem”, “test”, among many others. The collaborators dropdown menu 890 allows the user to select who is going to work on the project and under what respective capacities those people are going to work. The add to watch list binary control 892, when slid to the right, indicates that the project is to be present on the watch list 232.
All of the text entered in the controls 882, 884, 886 and tags selected using the tags dropdown menu 888 is searchable and can be used to provide a recommendation to the user later in the project creation process. The text and tags can form a string, description, or other representation of the semantics of the project.
The encryption code input control 1018 allows a user to specify a decryption key for the data. The data can be stored in encrypted form and decrypted using the decryption key. The tags control 1020 allows the user to select tags associated with the data. The text from the description control 1016 and the tags from the tags control 1020 can additionally be used along with the other description, tags, or a combination thereof to provide a recommendation to a user.
Performance metrics will vary by model types. For example a data analytics model has different performance metrics than M. Also, performance metrics for the sub type of the model can indicate different performance metrics. For example, performance metrics are different for regression and classification models. The GUI 200 accounts for these nuances.
Responsive to selecting the continue control on the GUI 200 in the state illustrated in
A name of the model appears on the GUI 200 of
A model performance dashboard provides visual depictions of the model performance. Graphs 2020, 2022, 2028, of confusion, accuracy and processor usage are illustrated in the dashboard of
Performance metrics will vary by model types. For example a data analytics model has different performance metrics than M. Also, performance metrics for the sub type of the model can indicate different performance metrics. For example, performance metrics are different for regression and classification models. The GUI 200 accounts for these nuances.
Another performance metric that can be visualized is model drift. Model drift adds the attribute of confidence to existing metrics. Drift exists where an ML model is confidently incorrect. The confidence metric can be tracked for model degradation based on the published baseline.
The project database and the GUI 200 presentation of the project database to the user was visited in
The project details pop-up page (
The GUI 200 of
The GUI 200 conceals all of the expertise required to connect to, interact with, and otherwise access the functionality of the servers 4994, 49964998. The GUI 200 further conceals the knowledge of what server is to be connected to at a given point in a data science project deployment process and the actions that are to be performed. By providing a presentation of the series of GUI states based on the user interaction with the GUI, the data scientist, data engineer, or other user does not need to know what application to connect to, how to connect to the application, where data resides, how the data accessed, where the data is accessed from, or other knowledge typically needed to complete a data science project.
The method 5000 can further include providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a selectable list of the previously generated projects. The method 5000 can further include providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a selectable list of the previously generated models. The method 5000 can further include providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a watchlist of models, projects, and deployments that a user has previously indicated is to be included on the watchlist.
The method 5000 can further include, wherein the wizard configures the GUI in a series of consecutive states that guides a user through data acquisition, model generation, and deployment of the model or a widget to the marketplace. The method 5000 can further include, wherein the series of consecutive states further guides the user through data cleansing. The method 5000 can further include, wherein the series of consecutive states further guides the user through data transformation.
The method 5000 can further include receiving, by a model recommender that compares a representation of a project objective, project description, and tags of a model being generated to corresponding descriptions of previously generated models, a list of recommended models. The method 5000 can further include providing, by the GUI, the list of recommended models in guiding the user through model generation. The method 5000 can further include receiving, by a visual analytics recommender that compares a representation of a project objective, project description, and tags of a deployment being generated to corresponding descriptions of previously generated deployments, a list of recommended deployments. The method 5000 can further include providing, by the GUI, the list of recommended deployments in guiding the user through the deployment of the model.
The method 5000 can further include receiving, by a data quality score operator, a data quality score. The method 5000 can further include providing, by the GUI, the data quality score in guiding the user through the data acquisition.
Existing solutions for data science project help scored a 46 of 100 for usability. The GUI 200 of embodiments that provides the functionality of a complete analytics platform received a notional usability score of 96 of 100, thus increasing usability by 111%. The GUI 200 of embodiments further increases learnability by 80% according to a survey. Further, the survey indicated, that, when asked
“on a scale of 1-5, with 1 meaning very difficult and 5 meaning very easy: how difficult would you say it is to create a model using the tools you are familiar with?” (4 participants) average of 1.83;
“how difficult would you say it is to deploy a model using the tools you are familiar with?” (4 participants) average of 2.25;
“ how difficult would you say it is to create a model in the Data Science Environment?” (4 participants) average of 3.7; and
“how difficult would you say it is to deploy a model in the Data Science Environment?” (4 participants) average of 4.3.
Artificial Intelligence (AI) is a field concerned with developing decision-making systems to perform cognitive tasks that have traditionally required a living actor, such as a person. Neural networks (NNs) are computational structures that are loosely modeled on biological neurons. Generally, NNs encode information (e.g., data or decision making) via weighted connections (e.g., synapses) between nodes (e.g., neurons). Modern NNs are foundational to many AI applications, such as object recognition, device behavior modeling (as in the present application) or the like. The model recommender 118, data quality score 116, visual analytics recommender 120, or other component or operation can include or be implemented using one or more NNs.
Many NNs are represented as matrices of weights (sometimes called parameters) that correspond to the modeled connections. NNs operate by accepting data into a set of input neurons that often have many outgoing connections to other neurons. At each traversal between neurons, the corresponding weight modifies the input and is tested against a threshold at the destination neuron. If the weighted value exceeds the threshold, the value is again weighted, or transformed through a nonlinear function, and transmitted to another neuron further down the NN graph-if the threshold is not exceeded then, generally, the value is not transmitted to a down-graph neuron and the synaptic connection remains inactive. The process of weighting and testing continues until an output neuron is reached; the pattern and values of the output neurons constituting the result of the NN processing.
The optimal operation of most NNs relies on accurate weights. However, NN designers do not generally know which weights will work for a given application. NN designers typically choose a number of neuron layers or specific connections between layers including circular connections. A training process may be used to determine appropriate weights by selecting initial weights.
In some examples, initial weights may be randomly selected. Training data is fed into the NN, and results are compared to an objective function that provides an indication of error. The error indication is a measure of how wrong the NN's result is compared to an expected result. This error is then used to correct the weights. Over many iterations, the weights will collectively converge to encode the operational data into the NN. This process may be called an optimization of the objective function (e.g., a cost or loss function), whereby the cost or loss is minimized.
A gradient descent technique is often used to perform objective function optimization. A gradient (e.g., partial derivative) is computed with respect to layer parameters (e.g., aspects of the weight) to provide a direction, and possibly a degree, of correction, but does not result in a single correction to set the weight to a “correct” value. That is, via several iterations, the weight will move towards the “correct,” or operationally useful, value. In some implementations, the amount, or step size, of movement is fixed (e.g., the same from iteration to iteration). Small step sizes tend to take a long time to converge, whereas large step sizes may oscillate around the correct value or exhibit other undesirable behavior. Variable step sizes may be attempted to provide faster convergence without the downsides of large step sizes.
Backpropagation is a technique whereby training data is fed forward through the NN—here “forward” means that the data starts at the input neurons and follows the directed graph of neuron connections until the output neurons are reached-and the objective function is applied backwards through the NN to correct the synapse weights. At each step in the backpropagation process, the result of the previous step is used to correct a weight. Thus, the result of the output neuron correction is applied to a neuron that connects to the output neuron, and so forth until the input neurons are reached. Backpropagation has become a popular technique to train a variety of NNs. Any well-known optimization algorithm for back propagation may be used, such as stochastic gradient descent (SGD), Adam, etc.
The set of processing nodes 5110 is arranged to receive a training set 5115 for the ANN 5105. The ANN 5105 comprises a set of nodes 5107 arranged in layers (illustrated as rows of nodes 5107) and a set of inter-node weights 5108 (e.g., parameters) between nodes in the set of nodes. In an example, the training set 5115 is a subset of a complete training set. Here, the subset may enable processing nodes with limited storage resources to participate in training the ANN 5105.
The training data may include multiple numerical values representative of a domain, such as an image feature, or the like. Each value of the training or input 5117 to be classified after ANN 5105 is trained, is provided to a corresponding node 5107 in the first layer or input layer of ANN 5105. The values propagate through the layers and are changed by the objective function.
As noted, the set of processing nodes is arranged to train the neural network to create a trained neural network. After the ANN is trained, data input into the ANN will produce valid classifications 5120 (e.g., the input data 5117 will be assigned into categories), for example. The training performed by the set of processing nodes 5107 is iterative. In an example, each iteration of the training the ANN 5105 is performed independently between layers of the ANN 5105. Thus, two distinct layers may be processed in parallel by different members of the set of processing nodes. In an example, different layers of the ANN 5105 are trained on different hardware. The members of different members of the set of processing nodes may be located in different packages, housings, computers, cloud-based resources, etc. In an example, each iteration of the training is performed independently between nodes in the set of nodes. This example is an additional parallelization whereby individual nodes 5107 (e.g., neurons) are trained independently. In an example, the nodes are trained on different hardware.
The example computer system 5200 includes a processor 5202 (e.g., a central processing unit (CPU), a graphics processing unit (GPU) or both), a main memory 5204 and a static memory 5206, which communicate with each other via a bus 5208. The computer system 5200 may further include a video display unit 5210 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 5200 also includes an alphanumeric input device 5212 (e.g., a keyboard), a user interface (UI) navigation device 5214 (e.g., a mouse), a mass storage unit 5216, a signal generation device 5218 (e.g., a speaker), a network interface device 5220, and a radio 5230 such as Bluetooth, WWAN, WLAN, and NFC, permitting the application of security controls on such protocols.
The mass storage unit 5216 includes a machine-readable medium 5222 on which is stored one or more sets of instructions and data structures (e.g., software) 5224 embodying or utilized by any one or more of the methodologies or functions described herein. The instructions 5224 may also reside, completely or at least partially, within the main memory 5204 and/or within the processor 5202 during execution thereof by the computer system 5200, the main memory 5204 and the processor 5202 also constituting machine-readable media.
While the machine-readable medium 5222 is shown in an example embodiment to be a single medium, the term “machine-readable medium” may include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more instructions or data structures. The term “machine-readable medium” shall also be taken to include any tangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention, or that is capable of storing, encoding, or carrying data structures utilized by or associated with such instructions. The term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media. Specific examples of machine-readable media include non-volatile memory, including by way of example semiconductor memory devices, e.g., Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
The instructions 5224 may further be transmitted or received over a communications network 5226 using a transmission medium. The instructions 5224 may be transmitted using the network interface device 5220 and any one of a number of well-known transfer protocols (e.g., HTTPS). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), the Internet, mobile telephone networks, Plain Old Telephone (POTS) networks, and wireless data networks (e.g., WiFi and WiMax networks). The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying instructions for execution by the machine, and includes digital or analog communications signals or other intangible media to facilitate communication of such software.
Example 1 includes a method for, by a graphical user interface (GUI), for data analytics and machine learning (ML) model project deployment, the method comprising providing, on the GUI, a first graphical software control that causes a project creation wizard to be presented on the GUI, providing, on the GUI, a second graphical software control that causes previously generated projects to be presented on the GUI, providing, on the GUI, a third graphical software control that causes previously generated models to be presented on the GUI, and providing, on the GUI, a fourth graphical software control that causes a marketplace of data science projects to be presented on the GUI.
In Example 2, Example 1 further includes providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a selectable list of the previously generated projects.
In Example 3, at least one of Examples 1-2 further includes providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a selectable list of the previously generated models.
In Example 4, at least one of Examples 1-3 further includes providing, on the GUI and concurrently with the first, second, third, and fourth graphical software controls, a watchlist of models, projects, and deployments that a user has previously indicated is to be included on the watchlist.
In Example 5, at least one of Examples 1-4 further includes, wherein the wizard configures the GUI in a series of consecutive states that guides a user through data acquisition, model generation, and deployment of the model or a widget to the marketplace.
In Example 6, Example 5 further includes, wherein the series of consecutive states further guides the user through data cleansing.
In Example 7, at least one of Examples 1-6 further includes, wherein the series of consecutive states further guides the user through data transformation.
In Example 8, at least one of Examples 5-7 further includes receiving, by a model recommender that compares a representation of a project objective, project description, and tags of a model being generated to corresponding descriptions of previously generated models, a list of recommended models, and providing, by the GUI, the list of recommended models in guiding the user through model generation.
In Example 9, at least one of Examples 5-8 further includes receiving, by a visual analytics recommender that compares a representation of a project objective, project description, and tags of a deployment being generated to corresponding descriptions of previously generated deployments, a list of recommended deployments, and providing, by the GUI, the list of recommended deployments in guiding the user through the deployment of the model.
In Example 10, at least one of Examples 5-9 further includes receiving, by a data quality score operator, a data quality score, and providing, by the GUI, the data quality score in guiding the user through the data acquisition.
Example 11 includes a non-transitory machine-readable medium including instructions that, when executed by a machine, cause the machine to perform the method of one of Examples 1-10.
Example 12 includes a system comprising processing circuitry, a GUI, and a memory including instructions that, when executed by the processing circuitry, cause the processing circuitry to perform operations comprising the method of one of Examples 1-10.
Although an embodiment has been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. The accompanying drawings that form a part hereof, show by way of illustration, and not of limitation, specific embodiments in which the subject matter may be practiced. The embodiments illustrated are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed herein. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. This Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
Although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description.
In this document, the terms “a” or “an” are used, as is common in patent documents, to include one or more than one, independent of any other instance or usages of “at least one” or “one or more.” In this document, the term “or” is used to refer to a nonexclusive or, such that “A or B” includes “A but not B,” “B but not A,” and “A and B,” unless otherwise indicated. In this document, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.” Also, in the following claims, the terms “including” and “comprising” are open-ended, that is, a system, article, composition, formulation, or process that includes elements in addition to those listed after such a term in a claim are still deemed to fall within the scope of that claim. Moreover, in the following claims, the terms “first,” “second,” and “third,” etc. are used merely as labels, and are not intended to impose numerical requirements on their objects.
The Abstract of the Disclosure is provided to comply with 37 C.F.R. §1.72(b), requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separate embodiment.