1. Field of the Invention
The present specification is related to facilitating analysis of Big Data. More specifically, the present specification relates to systems and method for providing a unified data science platform. Still more particularly, the present specification relates to user interfaces for a unified data science platform including management of models, experiments, data sets, projects, actions, reports and features.
2. Description of Related Art
The model creation process of the prior art is often described as a black art. At best, it is slow, tedious and inefficient process. At worst, it compromises model accuracy and delivers sub-optimal results more often than not. This is all exacerbated when the data sets are massive in the case of Big Data analysis. Existing solutions fail to be intuitive to a novice user and burden the user with a learning curve that is intense and time consuming. Such a deficiency may lead to a decrease in user productivity as the user may waste effort trying to interpret the complexity inherent in data science without any success.
Thus, there is a need for a system and method that provides an enterprise class machine learning platform to automate data science and thus making machine learning much easier for enterprises to adopt and that provides intuitive user interfaces for the management and visualization of models, experiments, data sets, projects, actions, reports and features.
The present disclosure overcomes one or more of the deficiencies of the prior art at least in part by providing a system and method for providing a unified, project-based data scientist workspace to visually prepare, build, deploy, visualize and manage models, their results and datasets.
According to one innovative aspect of the subject matter described in this disclosure, a system comprising one or more processors; and a memory including instructions that, when executed by the one or more processors, cause the system to: generate a user interface for presentation to a user, the user interface oriented around a first machine learning object in a data science process; determine a first context associated with the first machine learning object in the data science process; identify a second machine learning object related to the first machine learning object in the first context; generate a suggestion of a first action based on the first context; transmit, for display, the suggestion of the first action to the user on the user interface; receive, from the user, a confirmation to perform the first action; and manipulate one or more of the first machine learning object and the second learning object related to the first machine learning object in the first context based on the first action.
In general, another innovative aspect of the subject matter described in this disclosure may be embodied in methods that include generating a user interface for presentation to a user, the user interface oriented around a first machine learning object in a data science process; determining a first context associated with the first machine learning object in the data science process; identifying a second machine learning object related to the first machine learning object in the first context; generating a suggestion of a first action based on the first context; transmitting, for display, the suggestion of the first action to the user on the user interface; receiving, from the user, a confirmation to perform the first action; and manipulating one or more of the first machine learning object and the second learning object related to the first machine learning object in the first context based on the first action.
Other aspects include corresponding methods, systems, apparatus, and computer program products for these and other innovative features. These and other implementations may each optionally include one or more of the following features.
For instance, the operations further include generating a main workspace card including a snapshot of the first machine learning object and the first context associated with the first machine learning object in the data science process, the snapshot identifying one or more of an input and output of the first machine learning object, generating a dashboard card including a dynamic view of one or more key performance indicators for the first machine learning object in the data science process, generating a history card including a temporal history of commands applied to the one or more the first machine learning object and the second machine learning object related to the first machine learning object in the first context, generating a palette card including a list of reusable cards in the data science process, and placing the main workspace card, the dashboard card, the history card, and the palette card in a relative position with respect to each other on the user interface to receive user interaction for manipulating the one or more of the first machine learning object and the second machine learning object. For instance, the operations further include determining a first analysis phase of the first machine learning object and a history of analysis associated with the one or more of the first machine learning object and the second machine learning object related to the first machine learning object in the first context. For instance, the operations further include identifying a second action previously performed on another instance of the first machine learning object in a second analysis phase within a second context in the data science process, wherein the second analysis phase and the second context is identical to the first analysis phase and the first context, and first action is learned based on the second action. For instance, the operations further include selecting the suggestion based on one or more of seeded suggestions, heuristics, and a set of best practices in the data science process. For instance, the operations further include displaying a preview of an effect of the first action on the one or more of the first machine learning object and the second machine learning object related to the first machine learning object in the first context. For instance, the operations further include generating a checklist for the data science process based on one or more of learning from a previous checklist, seeded checklists, heuristics, and a set of best practices, the checklist identifying an overall progress of the data science process. For instance, the operations further include generating one or more report elements for inclusion in a report for the data science process responsive to receiving the confirmation to perform the first action. For instance, the operations further include generating a documentation of the first action in the data science process responsive to receiving the confirmation to perform the first action.
For instance, the features further include the suggestion of the first action including a sequence of actions comprising one or more of a demo, a lesson, and a tutorial for guiding the user in the data science process. For instance, the features further include the first machine learning object including one or more from a group of projects, datasets, workflows, code, model, deployment, knowledge, and jobs.
The present disclosure is particularly advantageous because it provides a unified, project-based data scientist workspace to visually prepare, build, deploy, visualize and manage models, their results and datasets. The unified workspace increases advanced data analytics adoption and makes machine learning accessible to a broader audience, for example, by providing a series of user interfaces to guide the user through the machine learning process in some embodiments. In some embodiments, the project-based approach allows users to easily manage items including projects, models, results, activity logs, and datasets used to build models, features, experiments, etc. In some embodiments, a user may be educated and/or guided through the process and provided suggestions with regard to a next step in the user's project, best practices, etc.
The features and advantages described herein are not all-inclusive and many additional features and advantages will be apparent to one of ordinary skill in the art in view of the figures and description. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and not to limit the scope of the inventive subject matter.
The invention is illustrated by way of example, and not by way of limitation in the figures of the accompanying drawings in which like reference numerals are used to refer to similar elements.
A system and method for providing one or more user interfaces under a unified platform for the data science process end-to-end is described. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosure. It should be apparent, however, that the disclosure may be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the disclosure. For example, the present disclosure is described in one implementation below with reference to particular hardware and software implementations. However, the present disclosure applies to other types of implementations distributed in the cloud, over multiple machines, using multiple processors or cores, using virtual machines or integrated as a single machine.
Reference in the specification to “one implementation” or “an implementation” means that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation of the disclosure. The appearances of the phrase “in one implementation” in various places in the specification are not necessarily all referring to the same implementation. In particular the present disclosure is described below in the context of multiple distinct architectures and some of the components are operable in multiple architectures while others are not.
Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers or memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
Aspects of the method and system described herein, such as the logic, may also be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable array logic (PAL) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits (ASICs). Some other possibilities for implementing aspects include: memory devices, microcontrollers with memory (such as EEPROM), embedded microprocessors, firmware, software, etc. Furthermore, aspects may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. The underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (MOSFET) technologies like complementary metal-oxide semiconductor (CMOS), bipolar technologies like emitter-coupled logic (ECL), polymer technologies (e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, and so on.
Finally, the algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems should appear from the description below. In addition, the present disclosure is described without reference to any particular programming language. It should be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein.
In some implementations, the system 100 includes a data science platform server 102 coupled to the network 106 for communication with the other components of the system 100, such as the plurality of client devices 114a . . . 114n, the production server 108, and the data collector 110 and associated data store 112. In some implementations, the data science platform server 102 may include a hardware server, a software server, or a combination of software and hardware. In some implementations, the data science platform server 102 is a computing device having data processing (e.g., at least one processor), storing (e.g., a pool of shared or unshared memory), and communication capabilities. For example, the data science platform server 102 may include one or more hardware servers, server arrays, storage devices and/or systems, etc. In the example of
In some implementations, the data science platform server 102 may be a web server that couples with one or more client devices 114 (e.g., negotiating a communication protocol, etc.) and may prepare the data and/or information, such as forms, web pages, tables, plots, visualizations, etc. that is exchanged with one or more client devices 114. For example, the data science platform server 102 may generate a first user interface to allow the user to enact a data transformation on a set of data for processing and then return a second user interface to display the results of data transformation as applied to the submitted data. Also, instead of or in addition, the data science platform server 102 may implement its own API for the transmission of instructions, data, results, and other information between the data science platform server 102 and an application installed or otherwise implemented on the client device 114. Although only a single data science platform server 102 is shown in
In some implementations, the system 100 includes a production server 108 coupled to the network 106 for communication with the other components of the system 100, such as the plurality of client devices 114a . . . 114n, the data science platform server 102, and the data collector 110 and associated data store 112. In some implementations, the production server 108 may be either a hardware server, a software server, or a combination of software and hardware. The production server 108 may be a computing device having data processing, storing, and communication capabilities. For example, the production server 108 may include one or more hardware servers, server arrays, storage devices and/or systems, etc. In some implementations, the production server 108 may include one or more virtual servers, which operate in a host server environment and access the physical hardware of the host server including, for example, a processor, memory, storage, network interfaces, etc., via an abstraction layer (e.g., a virtual machine manager). In some implementations, the production server 108 may include a web server (not shown) for processing content requests, such as a Hypertext Transfer Protocol (HTTP) server, a Representational State Transfer (REST) service, or other server type, having structure and/or functionality for satisfying content requests and receiving content from one or more computing devices that are coupled to the network 106 (e.g., the data science platform server 102, the data collector 110, the client device 114, etc.). In some implementations, the production server 108 may include machine learning models, receive a transformation sequence and/or machine learning models for deployment from the data science platform server 102 and generate predictions prescribed by the machine learning models, and use the transformation sequence and/or models on a test dataset (in batch mode or online) for data analysis. For purposes of this application, the terms “prediction” and “scoring” are used interchangeably to mean the same thing, namely, to turn predictions (in batch mode or online) using the model. In machine learning, a response variable, which may occasionally be referred to herein as a “response,” refers to a data feature containing the objective result of a prediction. A response may vary based on the context (e.g., based on the type of predictions to be made by the machine learning model). For example, responses may include, but are not limited to, class labels (classification), targets (generically, but particularly relevant to regression), rankings (ranking/recommendation), ratings (recommendation), dependent values, predicted values, or objective values. Although only a production server 108 is shown in
The data collector 110 is a server/service which collects data and/or analysis from other servers (not shown) coupled to the network 106. In some implementations, the data collector 110 may be a first or third-party server (that is, a server associated with a separate company or service provider), which mines data, crawls the Internet, and/or receives/retrieves data from other servers. For example, the data collector 110 may collect user data, item data, and/or user-item interaction data from other servers and then provide it and/or perform analysis on it as a service. In some implementations, the data collector 110 may be a data warehouse or belonging to a data repository owned by an organization. In some embodiments, the data collector 110 may receive data, via the network 106, from one or more of the data science platform server 102, a client device 114 and a production server 108. In some embodiments, the data collector 110 may receive data from real-time or streaming data sources.
The data store 112 is coupled to the data collector 108 and comprises a non-volatile memory device or similar permanent storage device and media. The data collector 110 stores the data in the data store 112 and, in some implementations, provides access to the data science platform server 102 to retrieve the data collected by the data store 112 (e.g. training data, response variables, rewards, tuning data, test data, user data, experiments and their results, learned parameter settings, system logs, etc.).
Although only a single data collector 110 and associated data store 112 is shown in
The network 106 is a conventional type, wired or wireless, and may have any number of different configurations such as a star configuration, token ring configuration or other configurations known to those skilled in the art. Furthermore, the network 106 may comprise a local area network (LAN), a wide area network (WAN) (e.g., the Internet), and/or any other interconnected data path across which multiple devices may communicate. In yet another implementation, the network 106 may be a peer-to-peer network. The network 106 may also be coupled to or include portions of a telecommunications network for sending data in a variety of different communication protocols. In some instances, the network 106 includes Bluetooth communication networks or a cellular communications network for sending and receiving data including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, wireless application protocol (WAP), electronic mail, etc.
The client devices 114a . . . 114n include one or more computing devices having data processing and communication capabilities. In some implementations, a client device 114 may include a processor (e.g., virtual, physical, etc.), a memory, a power source, a communication unit, and/or other software and/or hardware components, such as a display, graphics processor (for handling general graphics and multimedia processing for any type of application), wireless transceivers, keyboard, camera, sensors, firmware, operating systems, drivers, various physical connection interfaces (e.g., USB, HDMI, etc.). The client device 114a may couple to and communicate with other client devices 114n and the other entities of the system 100 via the network 106 using a wireless and/or wired connection.
A plurality of client devices 114a . . . 114n are depicted in
Examples of client devices 114 may include, but are not limited to, mobile phones, tablets, laptops, desktops, netbooks, server appliances, servers, virtual machines, TVs, set-top boxes, media streaming devices, portable media players, navigation devices, personal digital assistants, etc. While two client devices 114a and 114n are depicted in
It should be understood that the present disclosure is intended to cover the many different embodiments of the system 100 that include the network 106, the data science platform server 102, the production server 108, the data collector 110 and associated data store 112, and one or more client devices 114. In a first example, the data science platform server 102, the production server 108, and the data collector 110 may each be dedicated devices or machines coupled for communication with each other by the network 106. In a second example, any one or more of the servers 102, 108, and 110 may each be dedicated devices or machines coupled for communication with each other by the network 106 or may be combined as one or more devices configured for communication with each other via the network 106. For example, the data science platform server 102 and the production server 108 may be included in the same server. In a third example, any one or more of the servers 102, 108, and 110 may be operable on a cluster of computing cores in the cloud and configured for communication with each other. In a fourth example, any one or more of one or more servers 102, 108, and 110 may be virtual machines operating on computing resources distributed over the internet. In a fifth example, any one or more of the servers 102 and 108 may each be dedicated devices or machines that are firewalled or completely isolated from each other (i.e., the servers 102 and 108 may not be coupled for communication with each other by the network 106). For example, the data science platform server 102 and the production server 108 may be included in different servers that are firewalled or completely isolated from each other.
While the data science platform server 102 and the production server 108 are shown as separate devices in
Referring now to
The processor 202 comprises an arithmetic logic unit, a microprocessor, a general purpose controller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or some other processor array, or some combination thereof to execute software instructions by performing various input, logical, and/or mathematical operations to provide the features and functionality described herein. The processor 202 processes data signals and may comprise various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. The processor(s) 202 may be physical and/or virtual, and may include a single core or plurality of processing units and/or cores. Although only a single processor is shown in
The memory 204 may store and provide access to data to the other components of the data science platform server 102. The memory 204 may be included in a single computing device or a plurality of computing devices. In some implementations, the memory 204 may store instructions and/or data that may be executed by the processor 202. For example, as depicted in
The instructions stored by the memory 204 and/or data may comprise code for performing any and/or all of the techniques described herein. The memory 204 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory or some other memory device known in the art. In some implementations, the memory 204 also includes a non-volatile memory such as a hard disk drive or flash drive for storing information on a more permanent basis. The memory 204 is coupled by the bus 220 for communication with the other components of the data science platform server 102. It should be understood that the memory 204 may be a single device or may include multiple types of devices and configurations.
The display module 206 may include software and routines for sending processed data, analytics, or results for display to a client device 114, for example, to allow an administrator to interact with the data science platform server 102. In some implementations, the display module may include hardware, such as a graphics processor, for rendering interfaces, data, analytics, or recommendations.
The network I/F module 208 may be coupled to the network 106 (e.g., via signal line 214) and the bus 220. The network I/F module 208 links the processor 202 to the network 106 and other processing systems. In some implementations, the network I/F module 208 also provides other conventional connections to the network 106 for distribution of files using standard network protocols such as transmission control protocol and the Internet protocol (TCP/IP), hypertext transfer protocol (HTTP), hypertext transfer protocol secure (HTTPS) and simple mail transfer protocol (SMTP) as should be understood to those skilled in the art. In some implementations, the network I/F module 208 is coupled to the network 106 by a wireless connection and the network I/F module 208 includes a transceiver for sending and receiving data. In such an alternate implementation, the network I/F module 208 includes a Wi-Fi transceiver for wireless communication with an access point. In another alternate implementation, the network IF module 208 includes a Bluetooth® transceiver for wireless communication with other devices. In yet another implementation, the network I/F module 208 includes a cellular communications transceiver for sending and receiving data over a cellular communications network such as via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, wireless application protocol (WAP), email, etc. In still another implementation, the network I/F module 208 includes ports for wired connectivity such as but not limited to USB, SD, or CAT-5, CAT-5e, CAT-6, fiber optic, etc.
The input/output device(s) (“I/O devices”) 210 may include any device for inputting or outputting information from the data science platform server 102 and may be coupled to the system either directly or through intervening I/O controllers. An input device may be any device or mechanism of providing or modifying instructions in the data science platform server 102. For example, the input device may include one or more of a keyboard, a mouse, a scanner, a joystick, a touchscreen, a webcam, a touchpad, a touchscreen, a stylus, a barcode reader, an eye gaze tracker, a sip-and-puff device, a voice-to-text interface, etc. An output device may be any device or mechanism of outputting information from the data science platform server 102. For example, the output device may include a display device, which may include light emitting diodes (LEDs). The display device represents any device equipped to display electronic images and data as described herein. The display device may be, for example, a cathode ray tube (CRT), liquid crystal display (LCD), projector, or any other similarly equipped display device, screen, or monitor. In one implementation, the display device is equipped with a touch screen in which a touch sensitive, transparent panel is aligned with the screen of the display device. The output device indicates the status of the data science platform server 102 such as: 1) whether it has power and is operational; 2) whether it has network connectivity; 3) whether it is processing transactions. Those skilled in the art should recognize that there may be a variety of additional status indicators beyond those listed above that may be part of the output device. The output device may include speakers in some implementations.
The storage device 212 is an information source for storing and providing access to data, such as the data described in reference to
The bus 220 represents a shared bus for communicating information and data throughout the data science platform server 102. The bus 220 may represent one or more buses including an industry standard architecture (ISA) bus, a peripheral component interconnect (PCI) bus, a universal serial bus (USB), or some other bus known in the art to provide similar functionality which is transferring data between components of a computing device or between computing devices, a network bus system including the network 106 or portions thereof, a processor mesh, a combination thereof, etc. In some implementations, the processor 202, memory 204, display module 206, network I/F module 208, input/output device(s) 210, storage device 212, various other components operating on the data science platform server 102 (operating systems, device drivers, etc.), and any of the components of the data science unit 104 may cooperate and communicate via a communication mechanism included in or implemented in association with the bus 220. The software communication mechanism may include and/or facilitate, for example, inter-process communication, local function or procedure calls, remote procedure calls, an object broker (e.g., CORBA), direct socket communication (e.g., TCP/IP sockets) among software modules, UDP broadcasts and receipts, HTTP connections, etc. Further, any or all of the communication could be secure (e.g., SSH, HTTPS, etc.).
As depicted in
It should be recognized that the data science unit 104 and disclosure herein applies to and may work with Big Data, which may have billions or trillions of elements (rows×columns) or even more, and that the user interface elements are adapted to scale to deal with such large datasets, resulting large models and results and provide visualization, while maintaining intuitiveness and responsiveness to interactions.
The project module 245 includes computer logic executable by the processor 202 to manage and organizes a project based data science automation process. In some implementations, the project module 245 exposes machine learning objects for user interaction in the data science process. The machine learning objects in the data science process include, for example, projects, datasets, workflows, code, models, deployment, knowledge, and jobs. In some implementations, the project module 245 sends instructions to the user interface module 275 to generate a user interface to orient around, display and/or expose the machine learning objects as different cards, or entries in a table. For example, the user interface may show a plurality of proof-of-concept projects initiated by an enterprise as different cards, or entries in a table of projects. Furthermore, each project may include one or more contextually related machine learning objects, such as datasets, workflows, models, and users who have access to the project.
In some implementations, the project module 245 handles the specification of a checklist for a project. The checklist clarifies and organizes information or data for completing the project in the data science workflow. The checklist represent phases of analytics work and/or analytics diagnostics. The phases of analytics work are parts of the overall analytics work in a project. For example, the phases include, but are not limited to, project specification, data collection, data preparation, data featurization, training of models, selection of models, reporting of models, and deployment of models. The project module 245 includes a specification of diagnostics in the checklist. The diagnostics are validation steps which are prescribed as necessary or desirable to perform, for example, checking for the presence of outliers in the training data. Each diagnostic may include a set of visualizations/plots to be created, a set of statistics to be computed, and thresholds or other conditions on those statistics that define whether the diagnostic has been passed (or any subset of these three). The project module 245 monitors these statistics and thresholds and can automatically check a machine learning object, such as a workflow to see which diagnostics have been passed. The checklist may help the data science project be error-checkable, progress-trackable, and a structured process. In some implementations, the phases of the analytics work are customizable to meet demands of each individual group or enterprise involved in the data science process. In some implementations, the project module 245 sends instructions to the user interface module 275 to generate a user interface that provides a way for a user to create or modify a checklist, and view the status of a checklist (which items have been checked off, and when, and by whom, and a timeline by which they should be checked off). A checklist can be shown in a horizontal or vertical fashion, indicating the overall progress of the machine learning/data science project.
One of the checklist items can be the specification of the project. The project module 245 receives a specification including a primary objective of the project from a user. For example, the primary objective may be a quantitative metric such as predictive accuracy, and may include constraints based on other metrics. The constraints may dictate, for example, that the scoring time of the final model in the project must be less than a specified threshold. In another example, the quantitative metric may be a metric which combines multiple metrics, such as a weighted combination of more than one quantitative values. The specification of the project may also include values/costs such as the entries in a classification cost matrix. In another example, the specification of the project may also include the specification of the generalization mechanism (e.g. 10-fold cross-validation). In some implementations, the project module 245 generates the checklist that is hierarchically. For example, the checklist includes a diagnostic, which itself may be comprised of sub-diagnostics which check more detailed issues.
In some implementations, the project module 245 receives data science tags for a plurality of machine learning objects from one or more users of a project. For example, each type of object (e.g., projects, datasets, workflows, code, models, deployments, knowledge, jobs, features, cards) may have tags associated with it, which may be pre-assigned in the data science process or created by users participating in the project. Tags may be searched, edited, filtered, and viewed by the user. In some implementations, the project module 245 configures pre-condition and post-conditions for the machine learning object manipulated in the project. For example, a machine learning object, such as a workflow may have its pre-conditions or post-conditions specified in a standardized representation or set of representations. The pre-conditions and post-conditions may be preconfigured by the data science process or user specified. The pre-conditions and post-conditions inform the data science process of what is the input and/or output of each machine learning object and what the result of interaction of two or more machine learning objects should be, for error checking and automation in the data science process.
The data preparation module 250 includes computer logic executable by the processor 202 to receive a request from a user to import a dataset from various information sources, such as computing devices (e.g. servers) and/or non-transitory storage media (e.g., databases, Hard Disk Drives, etc.). In some implementations, the data preparation module 250 imports data from one or more of the servers 108, the data collector 110, the client device 114, and other content or analysis providers. For example, the data preparation module 250 may import a local file. In another example, the data preparation module 250 may link to a dataset from a non-local file (e.g. a Hadoop distributed file system (HDFS)). In some implementations, the data preparation module 250 processes a sample of the dataset and sends instructions to the user interface module 275 to generate a preview of the sample of the dataset. The data preparation module 250 manages the one or more datasets in a project and performs special data preparation processing to import the external file during the import of the dataset. In some implementations, the data preparation module 250 processes the dataset to retrieve metadata. For example, the metadata can include, but is not limited to, name of the feature or column, a type of the feature (e.g., integer, text, etc.), whether the feature is categorical (e.g., true or false), a distribution of the feature in the dataset based on whether the data state is sample or full, a dictionary (e.g., when the feature is categorical), a minimum value, a maximum value, mean, standard deviation (e.g. when the feature is numerical), etc. In some implementations, the data preparation module 250 scans the dataset on import and automatically infers the data types of the columns in the dataset based on rules and/or heuristics and/or dynamically using machine learning. For example, the data preparation module 250 may identify a column as categorical based on a rule. In another example, the data preparation module 250 may determine that 80 percent of the values in a column to be unique and may identify that column to be an identifier type column of the dataset. In yet another example, the data preparation module 250 may detect time series of values, monotonic variables, etc. in columns to determine appropriate data types. In some implementations, the data preparation module 250 determines the column types in the dataset based on machine learning on data from past usage. In some implementations, the data preparation module 250 sends instructions to the user interface module 275 to generate a user interface oriented around the dataset as a machine learning object and display features generated for the dataset for user interaction.
The model management module 255 includes computer logic executable by the processor 202 for generating one or more models based on the data prepared by the data preparation module 250 in the project of the data science process. In some implementations, the model management module 255 includes a one-step process to train, tune and test models. The model management module 255 may use any number of various machine learning techniques to generate a model. In some implementations, the model management module 255 automatically and simultaneously selects between distinct machine learning models and finds optimal model parameters for various machine learning tasks. Examples of machine learning tasks include, but are not limited to, classification, regression, and ranking. The performance can be measured by and optimized using one or more measures of fitness. The one or more measures of fitness used may vary based on the specific goal of a project. Examples of potential measures of fitness include, but are not limited to, error rate, F-score, area under curve (AUC), Gini, precision, performance stability, time cost, etc. In some implementations, the model management module 255 provides the machine learning specific data transformations used most by data scientists when building machine learning models, significantly cutting down the time and effort needed for data preparation on big data.
In some implementations, the model management module 255 identifies variables or columns in a dataset that were important to the model being built and sends the variables to the reporting module 265 for creating partial dependence plots (PDP). In some implementations, the model management module 255 analyses the data of the built model and sends the data to the reporting module 265 for creating diagnostic reports. In some implementations, the model management module 255 determines the tuning results of models being built and sends the information to the user interface module 275 for display. In some implementations, the model management module 255 stores the one or more models in the storage device 212 for access by other components of the data science unit 104. In some implementations, the model management module 255 performs testing on models using test datasets, generates results and stores the results in the storage device 212 for access by other components of the data science unit 104.
In some implementations, the model management module 255 manages and builds a workflow in the project. The workflow may or may not include a model. The model management module 255 monitors the building and exporting of the workflow and sends data to the auditing module 260 for building an audit trail changes that have transpired in the building and exporting of the workflow. For example, the workflow may be a complex transformation composed of individual, simpler transformations. In another example, a user-developed transformation may be a workflow that is composed of column extraction transformation, column addition transformation, column subtraction transformation, etc. In another example, the workflow can be a subset of one or more transformations from a data transformation pipeline, which may also occasionally be referred to herein as a transformation workflow, project workflow or similar, exported by a user. In another example, the workflow may be a machine learning model that can be an input to another workflow.
In some implementations, the model management module 255 may deploy and manage models in a training and/or production environment. The model management module 255 sends instructions to the user interface module 275 to generate a user interface for displaying a scoreboard of the models, or experiments involving models. The model management module 255 sends instructions to the user interface module 275 to generate a user interface for displaying information relating to deployment of models.
The auditing module 260 includes computer logic executable by the processor 202 to create a full audit trail of models, projects, datasets, results and other machine learning objects in a data science project. In some implementations, the auditing module 260 creates self-documenting models with an audit trail. Thus, the auditing module 260 improves model management and governance with self-documenting models, which includes a full audit trail. The auditing module 260 generates an audit trail for items so that they may be reviewed to see when/how they were changed and who made the changes to, for example, the machine learning object. Moreover, models generated by the model management module 255 automatically document all datasets, transformations, commands, algorithms and results, which are displayed in an easy to understand visual format. In some implementations, the auditing module 260 sends instructions to the user interface module 275 to generate a user interface that displays a running log or history of actions (by user or as part of the automated data analysis process) with respect to the machine learning object of the data science project. The auditing module 260 tracks all changes and creates a full audit trail that includes information on what changes were made (i.e., using commands programmatically or via the user interface), when and by whom. The audit trail or the auto-documentation explains what was done, in digestible chunks that provide clarity. The audit trail can be shared with other users or regulatory bodies. This level of model management and governance is critical for data science teams working in enterprises of all sizes, including regulated industries. The auditing module 260 also provide the rewind function that allows a user to re-create any past pipelines. The auditing module 260 also tracks software versioning information. The auditing module 260 also records the provenance of data sets, models and other files. The auditing module 260 also provides for file importation and review of files or previous versions.
The reporting module 265 includes computer logic executable by the processor 202 for generating reports, visualizations, and plots on items including models, datasets, results, etc. In some implementations, the reporting module 265 determines a visualization that is a best fit based on variables being compared. For example, in partial dependence plot visualization, if the two PDP variables being compared are categorical-categorical, then the plot may be heat map visualization. In another example, if the two PDP variables being compared are continuous-categorical, then the plot may be a bar chart visualization. In some implementations, the reporting module 265 receives one or more custom visualizations developed in different programming platforms from the client devices 114, receives metadata relating to the custom visualizations and adds the visualizations to the visualization library, and makes the visualizations accessible across project-to-project, model-to-model or user-to-user through the visualization library.
In some implementations, the reporting module 265 cooperates with the user interface module 275 to identify any information provided in the user interfaces to be output in a report format individually or collectively. Moreover, the visualizations, the interaction of the items (e.g., experiments, features, models, data sets, and projects), the audit trail or any other information provided by the user interface module 275 can be output as a report. For example, the reporting module 265 allows for the creation of directed acyclic graph (DAG) and a representation of it in the user interface as shown below in example of
In some implementations, the modules 250, 255, and 265 may receive user defined code sequences that manipulate the dataset, the model, and the plot visualization of one or more of the objects in the data science project. The modules 250, 255, and 265 send instructions to the user interface module 275 to generate a user interface that integrates coding where the user may edit of the code sequence. This integration addresses a large span of skills, allows customization of the data science process. The modules 250, 255, and 265 send instructions to the user interface module 275 to update the user interface with generated report elements indicating, for example, the successful debugging or wrapping of the code sequence for use in the data science project.
The suggestion module 270 includes computer logic executable by the processor 202 for generating a suggestion of a next action to interactively guide the user in the data science process. The suggestion may be used to teach the user why the action is preferred in a particular juncture of the data analysis in the project. For example, the suggestion may help ensure a good outcome in the project, prevent the user from getting stalled in the data science process, and raise the skill level of the user to create a trained user. The suggestion module 270 determines a context of one or more related machine learning objects and generates the suggestion of a next action based on the context. The context identifies an analysis phase of the data science process involving the one or more related machine learning objects. The context also considers a history of analysis performed on the one or more related machine learning objects.
In some implementations, the suggestion module 270 selects the suggestion from one or more of seeded suggestions, heuristics, and a set of best practices. In some implementations, the suggestion module 270 learns the actions of one or more other users (e.g. an expert user) in similar context, and generates a next action suggestion for a novice user based on learning the actions (e.g. those of the expert user). In some implementations, the suggestion module 270 sends instructions to the user interface module 275 to generate a user interface that includes an option (which may appear as a button or other interaction cue) for the user to select to receive a suggestion of a next action. In some implementations, a user may repeatedly select the option and the user interface module 275 generates successive steps guiding the user through the machine learning/data science process from end-to-end.
In some implementations, the suggestion module 270 accesses a knowledge base for machine learning/data science and select a knowledge element from the knowledge base. The suggestion module 270 bundles the suggestions with an appropriate knowledge element to describe a reasoning behind the suggestions. The knowledge base is user-editable in some implementations. The suggestion module 270 receives a question-and-answer knowledge from a user and adds the knowledge to the knowledge base for other users to access. In some implementations, the suggestion module 270 may specify a sequence of actions as suggestions, thus constituting the equivalent of a lesson or demo. The lesson or demo may guide the user through both the knowledge elements and the associated software actions, and the user learns the data science process taught by the lesson or demo by doing as per the suggestions.
In some implementations, the suggestion module 270 maintains a machine learning/data science point system within the knowledge base. The point system may encourage certain user behaviors by displaying an amount of “points” gained by the user and stored by the point system, for example for completing or passing certain lessons or demos, for creating and teaching lessons or demos, for adding knowledge nodes to the knowledge base, for creating models which perform well compared to others on scoreboards, or for performing more actions in the data science process, for performing other actions in the data science process, or for performing any other action associated or not associated with the product or the company, or any subset of these. Such points may be used to compare to other users' points, gain rewards which may be monetary or other gifts or rights, or exchange with other users. They may be bought for real currency or sold for real currency.
The user interface module 275 includes computer logic executable by the processor 202 for creating any or all of the user interfaces illustrated in
In some implementations, the user interface module 275 cooperates and coordinates with other components of the data science unit 104 to generate a user interface that allows the user to perform operations on experiments, features, models, data sets, deployment, projects, and other machine learning objects in the same or different user interface. This is advantageous because it may allow the user to perform operations and modifications to multiple items at the same time. The user interface includes graphical elements that are interactive. The user interface is adaptive. The graphical elements can include, but are not limited to, radio buttons, selection buttons, checkboxes, tabs, drop down menus, scrollbars, tiles, text entry fields, icons, graphics, directed acyclic graph (DAG), plots, tables, etc.
In some implementations, the user interface module 275 receives processed information of a dataset from the data preparation module 250 and generates a user interface for representing the features of the dataset. The processed information may include, for example, a preview of the dataset that can be displayed to the user in the user interface. In one embodiment, the preview samples a set of rows from the dataset which the user may verify and then confirm in the user interface for including a plot of the data features into a report as shown in the example of
In some implementations, the user interface module 275 cooperates with other components of the data science unit 104 to recommend a next, suggested action to the user on the user interface. In some implementations, the user interface module 275 generates a user interface including a suggestion box that serves as a guiding wizard in building a model as shown in the example of
In some implementations, the user interface module 275 cooperates with the reporting module 265 to generate a user interface displaying dependencies of items and the interaction of the items (e.g., experiments, features, models, data sets, and projects) in a directed acyclic graph (DAG) view. The user interface module 275 receives information representing the DAG visualization from the reporting module 265 and generates a user interface as shown in the example of
In some implementations, the user interface module 275 receives information including the audit trail from the auditing module 260 and generates a user interface as shown in the example of
The user interface engine 275 generates one or more user interfaces oriented around a plurality of fundamental objects of machine learning/data science process. For example,
Referring to
As shown in
In some implementations, the main workspace card 304 is a screen object which is rectangular, either with corners or rounded edges, generally smaller than the standard screen size of the user interface 300, containing text and/or images. For example, the main workspace card 304 displays an associated input command accepted by the system, and the visual output of that command such as a plot or diagram or table or scoreboard, or its output in text form. In some implementations, the main workspace card 304 may include an area for the user to input a command or other inputs which specify a system action on one or more machine learning objects. The main workspace card 304 may include user-authorable cards that allow the specification of inputs in the manner of a form screen, and display actions taken based on the inputs. In some implementations, the main workspace card 304 may present a unified representation of all of the inputs of a workflow, comprising a concatenation of all of the inputs of cards in the workflow.
In some implementations, the dashboard card 306 may provide an at-a-glance view of one or more key performance indicators relevant to the context of the machine learning object. Any card from other screen areas can be placed into the dashboard area 306 for visualizing a dynamic and live-updating of such a card. For example, cards can be selected for inclusion in the dashboard area 306 (and the selection mechanism can include drag-and-drop into the dashboard area 306). When a card is shown in the dashboard area 306, it may be shown in one or more of a smaller, compressed, abbreviated, and vignette format. Examples of multiple cards in a dashboard include a machine learning/data science scoreboard, a workflow diagram, and a machine learning/data science checklist as shown in
The history area 308 is a machine learning/data science history area). The history area 308 is shown in
Regardless, as illustrated, the user interface 400 includes one or more cards in the history area 308 that may be individually selectable by the user for inclusion in a report for the project involving the dataset object. The one or more cards in the history area 308 may be organized by report topic and may include a diagnostics report for project checklist (see below for more detailed description). For example, the user may select the explicit features report topic card 404 in the history area 308 by checking the box for inclusion into the report. The explicit features report topic card 404 shows a plot of the missing values by features which gives the user an indication of a quality of the dataset(s) used in the data science process for the user's current project. In some implementations, the report generation may be set up by the user in such a way as to automatically document everything the user has performed on the dataset and include such documentation as a report. Such implementations may beneficially provide an audit trail.
Referring now also to graphical representation in
Referring to
Referring to graphical representation in
In some implementations, the user interface 1200 may accommodate a machine learning or data science guided teaching or learning. The next-action suggestion interaction mechanism in the user interface 1200 can be used as a teaching or learning system. The user can specify or request a sequence of actions in the user interface 1200 to suggest, thus constituting the equivalent of a lesson or demo, wherein the user interface 1200 steps the user through one or both of the knowledge elements and the associated software and/or machine learning actions. The user learns via the user interface 1200 by doing as per the suggestions. For example, the user may select the option 1202 at one or more junctures of the data science process to receive one or more suggestion of next actions to perform. In some implementations, the user interface 1200 may gather the actions performed by the user for learning. For example, the user may be allowed to perform actions other than the one the user interface 1200 has suggested, in order to allow a non-linear teaching/learning experience. In some implementations, the user interface 1200 may request a confirmation from the user that the user has read a knowledge element in the demo which the user interface 1200 presented to the user via the main workspace area 304. The user interface 1200 may present a question or a series of question, i.e., a quiz, to test learning of the knowledge. The user interface 1200 may change the next action suggestion based on the user answers.
One of the checklist items can be the specification of the project. This includes the project's primary objective, which is a quantitative metric such as predictive accuracy, and may include constraints based on other metrics. For example, the metric can be the scoring time of the final model must be less than a specified threshold. The metric may be a metric which combines multiple metrics, for example, a weighted combination of more than one quantitative values. The checklist may also include values/costs such as the entries in a classification cost matrix. The checklist may also include the specification of the generalization mechanism, for example, a 10-fold cross-validation. The checklist may be hierarchically, i.e. a diagnostic may itself consist of sub-diagnostics which check more detailed issues. Another one of the checklist items can be diagnostic questions. Diagnostics are validation steps which are prescribed as necessary or desirable to perform, for example, checking for the presence of outliers in the training data. Each diagnostic included in the checklist may include a set of visualizations/plots to be created, a set of statistics to be computed, and thresholds or other conditions on those statistics that define whether the diagnostic has been passed (or any subset of these three). In some implementations, the selection of report elements (e.g., visualizations, plots, etc.) for inclusion in the report can be done through the specification of the project checklist.
The foregoing description of the implementations of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the present disclosure to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the present disclosure be limited not by this detailed description, but rather by the claims of this application. As should be understood by those familiar with the art, the present disclosure may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the modules, routines, features, attributes, methodologies and other aspects are not mandatory or significant, and the mechanisms that implement the present disclosure or its features may have different names, divisions and/or formats. Furthermore, as should be apparent to one of ordinary skill in the relevant art, the modules, routines, features, attributes, methodologies and other aspects of the present disclosure may be implemented as software, hardware, firmware or any combination of the three. Also, wherever a component, an example of which is a module, of the present disclosure is implemented as software, the component may be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future to those of ordinary skill in the art of computer programming. Additionally, the present disclosure is in no way limited to implementation in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the present disclosure, which is set forth in the following claims.
The present application claims priority, under 35 U.S.C. §119, of U.S. Provisional Patent Application No. 62/233,969, filed Sep. 28, 2015 and entitled “Improved User Interface for a Unified Data Science Platform Including Management of Models, Experiments, Data Sets, Projects, Actions, Reports and Features,” which is incorporated by reference in its entirety. The present application is also a continuation-in-part of U.S. patent application Ser. No. 15/042,086, filed Feb. 11, 2016 and entitled “User Interface for Unified Data Science Platform Including Management of Models, Experiments, Data Sets, Projects, Actions, Reports and Features,” which claims priority to U.S. Provisional Patent Application No. 62/115,135, filed Feb. 11, 2015 and entitled “User Interface for Unified Data Science Platform Including Management of Models, Experiments, Data Sets, Projects, Actions, Reports and Features.” The entireties of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62233969 | Sep 2015 | US | |
62115135 | Feb 2015 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15042086 | Feb 2016 | US |
Child | 15279223 | US |